General Linear Models Examples

This is a typical dataset where we have a response and several factors.

           B1   B2   B3
clarion  32.7 32.3 31.5
clinton  32.1 29.7 29.1
knox     35.7 35.9 33.1
o'neill  36.0 34.2 31.2
compost  31.8 28.0 29.2
wabash   38.2 37.8 31.9
webster  32.5 31.1 29.7

LINEAR MODEL FOR TWO WAY TABLE:

Data = Tot Mean + Row Effect + Col Eff + Residual
            B1     B2     B3  row effect
clarion -1.052 -0.024  1.076  -0.390
clinton  0.214 -0.757  0.543  -2.257
   knox -0.786  0.843 -0.057   2.343
o'neill  0.614  0.243 -0.857   1.243
compost  0.548 -1.824  1.276  -2.890
 wabash  0.648  1.676 -2.324   3.410
webster -0.186 -0.157  0.343  -1.457
col eff  1.586  0.157 -1.743  32.557

SAS DOES IT IN A DIFFERENT WAY, IT SETS THE EFFECT
FOR ONE OF THE GROUPS AS ZERO.

 

          ESTIMATE   TVALUE  PVALUE  STD ERROR
INTERCEPT  29.3571 B  34.96  0.0001  0.839703
TYPE
clarion     1.0667 B   1.02  0.3285  1.047294
clinton    -0.800  B  -0.76  0.4597  1.047294
knox        3.800  B   3.63  0.0035  1.047294
o'neill     2.700  B   2.58  0.0242  1.047294
compost    -1.433  B  -1.37  0.1962  1.047294
wabash      4.867  B   4.65  0.0006  1.047294
webster     0.000  B    .     .       .

BLOCK   1   3.3285 B   4.85  0.0004  0.685615
        2   1.900  B   2.77  0.0169  0.685615
        3   0.000  B    .     .       .

 

WE ALSO LOOK AT THE ANOVA TABLE

Source      DF      Type I SS  F Value   Pr > F

TYPE         6      103.15142    10.45   0.0004
BLOCK        2       39.03714    11.86   0.0014

Source      DF     Type III SS  F Value   Pr > F

TYPE         6      103.15142    10.45   0.0004
BLOCK        2       39.03714    11.86   0.0014

This is the SAS code and output file

options ps=50 ls=70;
   *---------------snapdragon experiment---------------*
   | as reported by stenstrom, 1940, an experiment was |
   | undertaken to investigate how snapdragons grew in |
   | various soils. each soil type was used in three   |
   | blocks.                                           |
   *---------------------------------------------------*;

   data plants;
      input type $ @;
      do block=1 to 3;
         input stemleng @;
         output;
         end;
      cards;
   clarion  32.7 32.3 31.5
   clinton  32.1 29.7 29.1
   knox     35.7 35.9 33.1
   o'neill  36.0 34.2 31.2
   compost  31.8 28.0 29.2
   wabash   38.2 37.8 31.9
   webster  32.5 31.1 29.7
   ;
   proc glm;
      class type block;
      model stemleng=type block;
run;
   proc glm order=data;
      class type block;
      model stemleng=type block / solution;
      means type / bon duncan tukey;

*-type-order---clrn-cltn-knox-onel-cpst-wbsh-wstr;
contrast 'compost v others' type -1 -1 -1 -1  6 -1 -1;
contrast 'knox vs oneill'   type  0  0  1 -1  0  0  0;
   run;

Output file from "glm.sas"

This is the data but with some missing observations

          B1   B2   B3
clarion  32.7 32.3   NA
clinton  32.1 29.7 29.1
knox     35.7 35.9 33.1
o'neill    NA 34.2 31.2
compost  31.8 28.0 29.2
wabash   38.2 37.8 31.9
webster  32.5   NA 29.7

Row Effects

    clarion   clinton     knox  o'neill
 0.05238095 -2.147619 2.452381 0.252381

 compost   wabash   webster
-2.780952 3.519048 -1.347619  

Column Effects:
       B1        B2        B3
 1.427778 0.3111111 -1.738889

Main Effect:   32.44762  

So what is SAS going to do?  

          ESTIMATE    TVALUE    PVALUE  STD ERROR
INTERCEPT  29.372 B    27.73    0.0001    1.0591
TYPE
clarion     0.281 B     0.20    0.8491    1.4390
clinton    -0.969 B    -0.76    0.4682    1.2804
knox        3.630 B     2.84    0.0196    1.2804
o'neill     2.209 B     1.54    0.1591    1.4390
compost    -1.603 B    -1.25    0.2421    1.2804
wabash      4.696 B     3.67    0.0052    1.2804
webster     0.000 B       .       .         .

BLOCK 1     3.455 B     4.16    0.0025    0.8308
        2   2.236 B     2.69    0.0247    0.8308
        3   0.000 B      .
WE ALSO LOOK AT THE ANOVA TABLE
AND WE SEE THAT THE ORDERS MATTERS

TYPE BEFORE BLOCK

Source   DF         Type I SS  F Value   Pr > F

TYPE      6       95.93611111     8.42   0.0028
BLOCK     2       33.76848485     8.89   0.0074

Sourc    DF       Type III SS  F Value   Pr > F

TYPE      6       98.19681818     8.62   0.0026
BLOCK     2       33.76848485     8.89   0.0074

BLOCK BEFORE TYPE

Source   DF         Type I SS  F Value   Pr > F

BLOCK     2       31.50777778     8.30   0.0091
TYPE      6       98.19681818     8.62   0.0026

Source   DF       Type III SS  F Value   Pr > F

BLOCK     2       33.76848485     8.89   0.0074
TYPE      6       98.19681818     8.62   0.0026
 

THE BASIC STATS

Source   DF    Sum of Squares  F Value   Pr > F

Model     8      129.70459596     8.54   0.0021
Error     9       17.08484848
Total    17      146.78944444

R-Square              C.V.     STEMLENG Mean
0.883610          4.238642        32.5055556

PRINCIPAL COMPONENTS

Principal components analysis is a method for dimension reduction.

Applications:

 

Data: yi=(yi1,…, yip)  i=1,..,n,  we assume that the {yi} are centered.

Let A be an orthogonal transformation such that  the zi = Ayi are uncorrelated.

Sz =

Since A is orthogonal  

 

    The eigenvalues of S are l1 = ,…,lp =

     The proportion of the variance explained by k components is :  (l1 +…+lk)/ (l1 +…+lp)

 

Example:   This is an example were we try to group crime variables into components that give a simpler interpretation of various forms of crime.  

options ls=64 ps=50;
DATA CRIME;
   TITLE 'CRIME RATES PER 100,000 POPULATION BY STATE';
   INPUT STATE $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
   CARDS;
ALABAMA        14.2 25.2  96.8 278.3 1135.5 1881.9 280.7
ALASKA         10.8 51.6  96.8 284.0 1331.7 3369.8 753.3
ARIZONA         9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
ARKANSAS        8.8 27.6  83.2 203.4  972.6 1862.1 183.4
CALIFORNIA     11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
COLORADO        6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
CONNECTICUT     4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
DELAWARE        6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
FLORIDA        10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
GEORGIA        11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
HAWAII          7.2 25.5 128.0  64.1 1911.5 3920.4 489.4
IDAHO           5.5 19.4  39.6 172.5 1050.8 2599.6 237.6
ILLINOIS        9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
INDIANA         7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
IOWA            2.3 10.6  41.2  89.8  812.5 2685.1 219.9
KANSAS          6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
KENTUCKY       10.1 19.1  81.1 123.3  872.2 1662.1 245.4
LOUISIANA      15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
MAINE           2.4 13.5  38.7 170.0 1253.1 2350.7 246.9
MARYLAND        8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
MASSACHUSETTS   3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1
MICHIGAN        9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
MINNESOTA       2.7 19.5  85.9  85.8 1134.7 2559.3 343.1
MISSISSIPPI    14.3 19.6  65.7 189.1  915.6 1239.9 144.4
MISSOURI        9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
MONTANA         5.4 16.7  39.2 156.8  804.9 2773.2 309.2
NEBRASKA        3.9 18.1  64.7 112.7  760.0 2316.1 249.1
NEVADA         15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
NEW HAMPSHIRE   3.2 10.7  23.2  76.0 1041.7 2343.9 293.4
NEW JERSEY      5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
NEW MEXICO      8.8 39.1 109.6 343.4 1418.7 3008.6 259.5
NEW YORK       10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
NORTH CAROLINA 10.6 17.0  61.3 318.3 1154.1 2037.8 192.1
NORTH DAKOTA    0.9  9.0  13.3  43.8  446.1 1843.0 144.7
OHIO            7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
OKLAHOMA        8.6 29.2  73.8 205.0 1288.2 2228.1 326.8
OREGON          4.9 39.9 124.1 286.9 1636.4 3506.1 388.9
PENNSYLVANIA    5.6 19.0 130.3 128.0  877.5 1624.1 333.2
RHODE ISLAND    3.6 10.5  86.5 201.0 1489.5 2844.1 791.4
SOUTH CAROLINA 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1
SOUTH DAKOTA    2.0 13.5  17.9 155.7  570.5 1704.4 147.5
TENNESSEE      10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
TEXAS          13.3 33.8 152.4 208.2 1603.1 2988.7 397.6
UTAH            3.5 20.3  68.8 147.3 1171.6 3004.6 334.5
VERMONT         1.4 15.9  30.8 101.2 1348.2 2201.0 265.2
VIRGINIA        9.0 23.3  92.1 165.7  986.2 2521.2 226.7
WASHINGTON      4.3 39.6 106.2 224.8 1605.6 3386.9 360.3
WEST VIRGINIA   6.0 13.2  42.2  90.9  597.4 1341.7 163.3
WISCONSIN       2.8 12.9  52.2  63.7  846.9 2614.2 220.7
WYOMING         5.4 21.9  39.7 173.9  811.6 2772.2 282.0
;
PROC PRINCOMP OUT=CRIMCOMP;

PROC SORT;
   BY PRIN1;
PROC PRINT;
   ID STATE;
   VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
   TITLE2 'STATES LISTED IN ORDER OF OVERALL CRIME RATE';
   TITLE3 'AS DETERMINED BY THE FIRST PRINCIPAL COMPONENT';
PROC SORT;
   BY PRIN2;
PROC PRINT;
   ID STATE;
   VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
   TITLE2 'STATES LISTED IN ORDER OF PROPERTY VS. VIOLENT CRIME';
   TITLE3 'AS DETERMINED BY THE SECOND PRINCIPAL COMPONENT';

PROC PLOT;
   PLOT PRIN2*PRIN1=STATE;
   TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS';
PROC PLOT;
   PLOT PRIN3*PRIN1=STATE;
   TITLE2 'PLOT OF THE FIRST AND THIRD PRINCIPAL COMPONENTS';

        CRIME RATES PER 100,000 POPULATION BY STATE

                  Principal Component Analysis

      50 Observations
       7 Variables
                       Simple Statistics

             MURDER           RAPE        ROBBERY        ASSAULT

Mean    7.444000000    25.73400000    124.0920000    211.3000000
StD     3.866768941    10.75962995     88.3485672    100.2530492  

                  BURGLARY        LARCENY           AUTO

       Mean    1291.904000    2671.288000    377.5260000
       StD      432.455711     725.908707    193.3944175

                       Correlation Matrix

            MURDER      RAPE   ROBBERY   ASSAULT

MURDER      1.0000    0.6012    0.4837    0.6486
RAPE        0.6012    1.0000    0.5919    0.7403
ROBBERY     0.4837    0.5919    1.0000    0.5571
ASSAULT     0.6486    0.7403    0.5571    1.0000
BURGLARY    0.3858    0.7121    0.6372    0.6229
LARCENY     0.1019    0.6140    0.4467    0.4044
AUTO        0.0688    0.3489    0.5907    0.2758

             BURGLARY       LARCENY          AUTO

   MURDER      0.3858        0.1019        0.0688
   RAPE        0.7121        0.6140        0.3489
   ROBBERY     0.6372        0.4467        0.5907
   ASSAULT     0.6229        0.4044        0.2758
   BURGLARY    1.0000        0.7921        0.5580
   LARCENY     0.7921        1.0000        0.4442
   AUTO        0.5580        0.4442        1.0000

             Eigenvalues of the Correlation Matrix

      Eigenvalue  Differen    Proportion    Cumulative

 PRIN1   4.11496   2.87624      0.587851       0.58785
 PRIN2   1.23872   0.51291      0.176960       0.76481
 PRIN3   0.72582   0.40938      0.103688       0.86850
 PRIN4   0.31643   0.05846      0.045205       0.91370
 PRIN5   0.25797   0.03593      0.036853       0.95056
 PRIN6   0.22204   0.09798      0.031720       0.98228
 PRIN7   0.12406    .           0.017722       1.00000

                          Eigenvectors

             PRIN1     PRIN2         PRIN3         PRIN4

MURDER    0.300279  -.629174      0.178245      -.232114
RAPE      0.431759  -.169435      -.244198      0.062216
ROBBERY   0.396875  0.042247      0.495861      -.557989
ASSAULT   0.396652  -.343528      -.069510      0.629804
BURGLARY  0.440157  0.203341      -.209895      -.057555
LARCENY   0.357360  0.402319      -.539231      -.234890
AUTO      0.295177  0.502421      0.568384      0.419238

                        PRIN5         PRIN6         PRIN7

       MURDER        0.538123      0.259117      0.267593
       RAPE          0.188471      -.773271      -.296485
       ROBBERY       -.519977      -.114385      -.003903
       ASSAULT       -.506651      0.172363      0.191745
       BURGLARY      0.101033      0.535987      -.648117
       LARCENY       0.030099      0.039406      0.601690
       AUTO          0.369753      -.057298      0.147046
Plot of PRINCIPAL COMPONENTS (Data and Variables)   

Plot of PRIN2*PRIN1.  Symbol is value of STATE.

PRIN2 |
      |                           M
      |
      |
      |                     R
    2 +
      |                           H
      |
      |                   C
      |                           D
      |
    1 +           V M  U      N
      |        W                           C  A
      |                           W  O
      |            M                            N
      |N            M
      |          N            O I         M          C
    0 +              I    K
      |             P                    M
      |     S                                             N
      |                         M
      |                 V  O         T        F
      |     W
   -1 +                             N
      |             K       T
      |                A        G
      |
      |                  N
      |
   -2 +                            L
      |                      A        S
      |
      |              M
      -+----------+----------+----------+----------+--------
      -4         -2          0          2          4

                                 PRIN1

        Plot of PRIN3*PRIN1.  Symbol is value of STATE.

PRIN3 |
      |                                         N
      |                           M
      |
      |
    2 +
      |
      |
      |
      |
      |
      |                         I
    1 +             P       R
      |
      |             KM    C T
      |     W                AN M         M
      |                       O    L     M
      |                   I     G                    C
      |
    0 +                A                  A
      |N    S  N N  M   VN O         T
      |                                                   N
      |        W    M     K
      |        I  VM I U          D   S
      |                           H
      |
   -1 +
      |                             N      C
      |                           W  O        F
      |
      |
      |                                       A
      -+----------+----------+----------+----------+---------
      -4         -2          0          2          4    PRIN1
 
How many components?

·        Explain some fix % of the variance (70%, 80%…)

·        Exclude eigenvalues less than the average. (For the correlation matrix the average is 1)

·        Graph of eigenvalues

Splus/R Demo

 

Test the null hypothesis that the last k eigenvalues are equal

Let    .

The test statistic is

The test statistic u is approximatelyc2 with df= (k-1)(k+2)/2.

In the example dataset: The last four eigenvalues are small

>   (50 - (2*7+11)/6)*(4*log(mei)-sum(log(ei)))

[1] 10.12649

> qchisq(0.95,9)

[1] 16.91898

Now with the last 5 eigenvalues:                              

> (50 - (2*7+11)/6)*(5*log(mei)-sum(log(ei)))             > qchisq(0.95,14)

[1] 39.57434                                                                [1] 23.68475