This is a typical dataset where we have a response and several
factors.
B1 B2 B3
clarion 32.7 32.3 31.5
clinton 32.1 29.7 29.1
knox 35.7 35.9 33.1
o'neill 36.0 34.2 31.2
compost 31.8 28.0 29.2
wabash 38.2 37.8 31.9
webster 32.5 31.1 29.7
LINEAR MODEL FOR TWO WAY TABLE:
Data = Tot Mean + Row Effect + Col Eff
+ Residual
B1 B2 B3 row effect
clarion -1.052 -0.024 1.076 -0.390
clinton 0.214 -0.757 0.543 -2.257
knox -0.786 0.843
-0.057 2.343
o'neill 0.614 0.243 -0.857
1.243
compost 0.548 -1.824 1.276 -2.890
wabash 0.648 1.676
-2.324 3.410
webster -0.186 -0.157 0.343 -1.457
col
eff 1.586 0.157 -1.743 32.557
SAS DOES IT IN A DIFFERENT WAY,
IT SETS THE EFFECT
FOR ONE OF THE GROUPS AS ZERO.
ESTIMATE TVALUE PVALUE STD ERROR
INTERCEPT 29.3571 B 34.96
0.0001 0.839703
TYPE
clarion 1.0667 B
1.02 0.3285 1.047294
clinton -0.800 B
-0.76 0.4597 1.047294
knox
3.800 B 3.63 0.0035 1.047294
o'neill 2.700
B 2.58 0.0242 1.047294
compost -1.433 B
-1.37 0.1962 1.047294
wabash 4.867
B 4.65 0.0006 1.047294
webster 0.000
B .
. .
BLOCK 1 3.3285
B 4.85 0.0004 0.685615
2 1.900 B 2.77 0.0169 0.685615
3 0.000 B .
. .
WE ALSO LOOK AT THE ANOVA TABLE
Source DF Type I SS F Value Pr > F
TYPE
6 103.15142 10.45
0.0004
BLOCK
2 39.03714
11.86 0.0014
Source DF Type III SS F Value Pr > F
TYPE
6 103.15142 10.45
0.0004
BLOCK
2 39.03714
11.86 0.0014
This is the SAS code and output file
options
ps=50 ls=70;
*---------------snapdragon experiment---------------*
| as reported by stenstrom, 1940, an experiment was |
| undertaken to investigate how snapdragons grew in |
| various soils. each soil type was used in three |
|
blocks.
|
*---------------------------------------------------*;
data plants;
input type $ @;
do block=1 to 3;
input stemleng @;
output;
end;
cards;
clarion 32.7 32.3 31.5
clinton 32.1 29.7 29.1
knox 35.7 35.9 33.1
o'neill 36.0 34.2 31.2
compost 31.8 28.0 29.2
wabash 38.2 37.8 31.9
webster 32.5 31.1 29.7
;
proc glm;
class type block;
model stemleng=type block;
run;
proc glm order=data;
class type block;
model stemleng=type block / solution;
means type / bon duncan tukey;
*-type-order---clrn-cltn-knox-onel-cpst-wbsh-wstr;
contrast
'compost v others' type -1 -1 -1 -1 6 -1 -1;
contrast
'knox vs oneill' type 0 0 1 -1 0
0 0;
run;
This is the data but with some missing observations
B1 B2 B3
clarion 32.7 32.3 NA
clinton 32.1 29.7 29.1
knox 35.7 35.9 33.1
o'neill NA 34.2 31.2
compost 31.8 28.0 29.2
wabash 38.2 37.8 31.9
webster 32.5 NA 29.7
Row Effects
clarion
clinton knox o'neill
0.05238095 -2.147619 2.452381
0.252381
compost wabash
webster
-2.780952 3.519048 -1.347619
Column Effects:
B1
B2 B3
1.427778 0.3111111 -1.738889
Main Effect: 32.44762
So what is SAS going to do?
ESTIMATE TVALUE PVALUE STD ERROR
INTERCEPT 29.372 B
27.73 0.0001 1.0591
TYPE
clarion 0.281
B 0.20 0.8491
1.4390
clinton -0.969 B
-0.76 0.4682 1.2804
knox 3.630
B 2.84 0.0196
1.2804
o'neill 2.209
B 1.54 0.1591
1.4390
compost -1.603 B
-1.25 0.2421 1.2804
wabash 4.696
B 3.67 0.0052
1.2804
webster 0.000
B .
. .
BLOCK 1 3.455
B 4.16 0.0025
0.8308
2 2.236 B 2.69
0.0247 0.8308
3 0.000 B .
WE ALSO LOOK AT THE ANOVA
TABLE
AND WE SEE THAT THE
ORDERS MATTERS
TYPE BEFORE BLOCK
Source DF Type I SS F Value Pr > F
TYPE
6 95.93611111
8.42 0.0028
BLOCK
2 33.76848485
8.89 0.0074
Sourc DF Type III SS F Value Pr > F
TYPE
6 98.19681818
8.62 0.0026
BLOCK
2 33.76848485
8.89 0.0074
BLOCK BEFORE TYPE
Source DF Type I SS F Value Pr > F
BLOCK
2 31.50777778
8.30 0.0091
TYPE
6 98.19681818
8.62 0.0026
Source DF Type III SS F Value Pr > F
BLOCK
2 33.76848485
8.89 0.0074
TYPE
6 98.19681818
8.62 0.0026
THE BASIC STATS
Source DF Sum of Squares F Value Pr > F
Model
8 129.70459596
8.54 0.0021
Error
9 17.08484848
Total
17 146.78944444
R-Square
C.V. STEMLENG Mean
0.883610
4.238642 32.5055556
PRINCIPAL
COMPONENTS
Principal components analysis is a method for dimension reduction.
Applications:
Data: yi=(yi1,…, yip) i=1,..,n, we assume that the {yi} are centered.
Let A be an orthogonal transformation such that the zi = Ayi are uncorrelated.
Sz
= 
Since A is orthogonal
The eigenvalues of S are l1 =
,…,lp = ![]()
The proportion of the variance explained by k components is : (l1 +…+lk)/ (l1 +…+lp)
Example: This is an example were we try to group crime variables into components that give a simpler interpretation of various forms of crime.
options ls=64 ps=50;
DATA CRIME;
TITLE 'CRIME RATES PER
100,000 POPULATION BY STATE';
INPUT STATE $1-15 MURDER
RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
CARDS;
ALABAMA
14.2 25.2 96.8 278.3 1135.5 1881.9 280.7
ALASKA
10.8 51.6 96.8 284.0 1331.7 3369.8 753.3
ARIZONA
9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
ARKANSAS
8.8 27.6 83.2 203.4 972.6 1862.1 183.4
CALIFORNIA
11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
COLORADO
6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
CONNECTICUT
4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
DELAWARE
6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
FLORIDA
10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
GEORGIA
11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
HAWAII
7.2 25.5 128.0 64.1 1911.5 3920.4 489.4
IDAHO
5.5 19.4 39.6 172.5 1050.8 2599.6 237.6
ILLINOIS
9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
INDIANA
7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
IOWA
2.3 10.6 41.2 89.8 812.5 2685.1 219.9
KANSAS
6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
KENTUCKY
10.1 19.1 81.1 123.3 872.2 1662.1 245.4
LOUISIANA
15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
MAINE
2.4 13.5 38.7 170.0 1253.1 2350.7 246.9
MARYLAND
8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
MASSACHUSETTS 3.1 20.8
169.1 231.6 1532.2 2311.3 1140.1
MICHIGAN
9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
MINNESOTA
2.7 19.5 85.9 85.8 1134.7 2559.3 343.1
MISSISSIPPI 14.3 19.6
65.7 189.1 915.6 1239.9 144.4
MISSOURI
9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
MONTANA
5.4 16.7 39.2 156.8 804.9 2773.2 309.2
NEBRASKA
3.9 18.1 64.7 112.7 760.0 2316.1 249.1
NEVADA
15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
NEW HAMPSHIRE 3.2
10.7 23.2 76.0 1041.7 2343.9 293.4
NEW
JERSEY 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
NEW
MEXICO 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5
NEW
YORK 10.7 29.4 472.6 319.1 1728.0 2782.0
745.8
NORTH CAROLINA 10.6 17.0 61.3
318.3 1154.1 2037.8 192.1
NORTH DAKOTA
0.9 9.0 13.3 43.8 446.1 1843.0 144.7
OHIO
7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
OKLAHOMA
8.6 29.2 73.8 205.0 1288.2 2228.1 326.8
OREGON
4.9 39.9 124.1 286.9 1636.4 3506.1 388.9
PENNSYLVANIA 5.6
19.0 130.3 128.0 877.5 1624.1 333.2
RHODE ISLAND 3.6
10.5 86.5 201.0 1489.5 2844.1 791.4
SOUTH CAROLINA 11.9 33.0 105.9 485.3
1613.6 2342.4 245.1
SOUTH DAKOTA 2.0
13.5 17.9 155.7 570.5 1704.4 147.5
TENNESSEE
10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
TEXAS
13.3 33.8 152.4 208.2 1603.1 2988.7 397.6
UTAH
3.5 20.3 68.8 147.3 1171.6 3004.6 334.5
VERMONT
1.4 15.9 30.8 101.2 1348.2 2201.0 265.2
VIRGINIA
9.0 23.3 92.1 165.7 986.2 2521.2 226.7
WASHINGTON
4.3 39.6 106.2 224.8 1605.6 3386.9 360.3
WEST VIRGINIA 6.0
13.2 42.2 90.9 597.4 1341.7 163.3
WISCONSIN
2.8 12.9 52.2 63.7 846.9 2614.2 220.7
WYOMING
5.4 21.9 39.7 173.9 811.6 2772.2 282.0
;
PROC PRINCOMP OUT=CRIMCOMP;
PROC SORT;
BY PRIN1;
PROC PRINT;
ID STATE;
VAR PRIN1 PRIN2 MURDER
RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
TITLE2 'STATES LISTED IN
ORDER OF OVERALL CRIME RATE';
TITLE3 'AS DETERMINED BY
THE FIRST PRINCIPAL COMPONENT';
PROC SORT;
BY PRIN2;
PROC PRINT;
ID STATE;
VAR PRIN1 PRIN2 MURDER
RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;
TITLE2 'STATES LISTED IN
ORDER OF PROPERTY VS. VIOLENT CRIME';
TITLE3 'AS DETERMINED BY
THE SECOND PRINCIPAL COMPONENT';
PROC PLOT;
PLOT PRIN2*PRIN1=STATE;
TITLE2 'PLOT OF THE FIRST
TWO PRINCIPAL COMPONENTS';
PROC PLOT;
PLOT PRIN3*PRIN1=STATE;
TITLE2 'PLOT OF THE FIRST
AND THIRD PRINCIPAL COMPONENTS';
CRIME RATES PER 100,000 POPULATION BY STATE
Principal Component Analysis
50
Observations
7
Variables
Simple Statistics
MURDER RAPE ROBBERY ASSAULT
Mean 7.444000000
25.73400000 124.0920000 211.3000000
StD
3.866768941 10.75962995
88.3485672 100.2530492
BURGLARY LARCENY AUTO
Mean 1291.904000
2671.288000 377.5260000
StD 432.455711
725.908707 193.3944175
Correlation Matrix
MURDER RAPE ROBBERY ASSAULT
MURDER
1.0000 0.6012 0.4837
0.6486
RAPE
0.6012 1.0000 0.5919
0.7403
ROBBERY 0.4837
0.5919 1.0000 0.5571
ASSAULT
0.6486 0.7403 0.5571
1.0000
BURGLARY
0.3858 0.7121 0.6372
0.6229
LARCENY
0.1019 0.6140 0.4467
0.4044
AUTO
0.0688 0.3489 0.5907
0.2758
BURGLARY LARCENY AUTO
MURDER
0.3858
0.1019 0.0688
RAPE
0.7121
0.6140 0.3489
ROBBERY
0.6372
0.4467 0.5907
ASSAULT
0.6229
0.4044 0.2758
BURGLARY 1.0000
0.7921 0.5580
LARCENY
0.7921
1.0000 0.4442
AUTO
0.5580
0.4442 1.0000
Eigenvalues of the Correlation Matrix
Eigenvalue Differen Proportion Cumulative
PRIN1 4.11496
2.87624
0.587851 0.58785
PRIN2
1.23872 0.51291
0.176960 0.76481
PRIN3
0.72582 0.40938
0.103688 0.86850
PRIN4
0.31643 0.05846
0.045205 0.91370
PRIN5
0.25797 0.03593
0.036853 0.95056
PRIN6
0.22204 0.09798
0.031720 0.98228
PRIN7
0.12406
.
0.017722 1.00000
Eigenvectors
PRIN1 PRIN2 PRIN3 PRIN4
MURDER
0.300279 -.629174 0.178245
-.232114
RAPE
0.431759 -.169435
-.244198 0.062216
ROBBERY 0.396875
0.042247 0.495861
-.557989
ASSAULT 0.396652
-.343528 -.069510
0.629804
BURGLARY 0.440157
0.203341 -.209895
-.057555
LARCENY 0.357360
0.402319 -.539231
-.234890
AUTO
0.295177 0.502421
0.568384 0.419238
PRIN5 PRIN6 PRIN7
MURDER
0.538123 0.259117
0.267593
RAPE
0.188471 -.773271
-.296485
ROBBERY
-.519977 -.114385
-.003903
ASSAULT
-.506651 0.172363
0.191745
BURGLARY 0.101033
0.535987 -.648117
LARCENY
0.030099 0.039406
0.601690
AUTO
0.369753 -.057298
0.147046
Plot of PRINCIPAL COMPONENTS
(Data and Variables)
Plot of PRIN2*PRIN1. Symbol is value of STATE.
PRIN2 |
|
M
|
|
|
R
2 +
|
H
|
|
C
|
D
|
1 + V M
U N
|
W
C A
|
W O
|
M
N
|N M
|
N O
I
M C
0
+
I K
|
P
M
|
S
N
|
M
|
V O
T F
| W
-1
+
N
|
K T
|
A G
|
|
N
|
-2
+
L
|
A S
|
|
M
-+----------+----------+----------+----------+--------
-4
-2
0
2 4
PRIN1
Plot of PRIN3*PRIN1. Symbol is value of STATE.
PRIN3 |
|
N
|
M
|
|
2 +
|
|
|
|
|
|
I
1 +
P R
|
|
KM C T
|
W
AN M M
|
O L M
|
I
G
C
|
0
+
A
A
|N S N N M VN
O T
|
N
| W
M K
| I VM I
U D S
|
H
|
-1 +
|
N C
|
W O F
|
|
|
A
-+----------+----------+----------+----------+---------
-4
-2
0
2 4
PRIN1
How many components?
·
Explain some fix % of the variance (70%, 80%…)
·
Exclude eigenvalues less than the average. (For the
correlation matrix the average is 1)
·
Graph of eigenvalues

Let
.
The test statistic is ![]()
The test statistic u is approximatelyc2 with df= (k-1)(k+2)/2.
In the example dataset: The last four eigenvalues are small
> (50 - (2*7+11)/6)*(4*log(mei)-sum(log(ei)))
[1] 10.12649
> qchisq(0.95,9)
[1] 16.91898
Now with the last 5 eigenvalues:
> (50 - (2*7+11)/6)*(5*log(mei)-sum(log(ei))) > qchisq(0.95,14)
[1] 39.57434 [1]
23.68475