Q1_df <- read.table("~/Documents/repos/PSYCH7821/assets/assignments/HW2/ex02_question1_data.txt",
header = TRUE)
head(Q1_df)
## id sex sctyp civics motiv locus concept read write math science
## 1 1 2 1 40.6 0.67 0.29 0.88 33.6 43.7 40.2 39.0
## 2 2 1 1 45.6 0.33 -0.42 0.03 46.9 35.9 41.9 36.3
## 3 6 1 1 35.6 0.00 0.46 0.03 49.5 46.3 46.2 41.7
## 4 7 1 1 55.6 0.33 0.44 -0.47 62.7 64.5 48.0 63.4
## 5 8 2 1 55.6 1.00 0.68 0.25 44.2 51.5 36.9 49.8
## 6 9 1 1 55.6 0.33 0.06 0.56 46.9 41.1 45.3 47.1
We see that the variables sex and sctyp are categorical variables; civics and motiv need more inspection. Let’s compute the contingency table of the counts:
table(Q1_df$motiv)
##
## 0 0.33 0.67 1
## 41 65 84 127
table(Q1_df$civics)
##
## 25.7 30.6 33.1 35.6 36.9 39.4 40.6 41.9 43.1 45.6 46.8 48.1 49.3 50.6 51.8
## 1 9 1 18 2 1 24 2 1 50 2 2 1 53 2
## 53.1 55.6 56.8 58.1 59.3 60.5 61.8 65.5 66.8 70.5
## 1 46 5 1 1 53 1 30 2 8
Now we can see that motiv is a an ordinal variable. More specifically, it is a polytomous variable with four levels. The variable civics on the other hand seems to have a mixture distribution. This leads me to think that the data has been previoulsy preprocessed in a way that might affect the analysis.
NOTE 1: The optimal coefficient \(\beta\) which the regression fit computes depends on the distributions of the predictor variables!
Since we have standardized the data, let’s compute the correlation matrix. Note that I am leaving out civics and motiv since Pearson’s correlation is a measure of linear association which is appropriate for continuous variables and these two variables are not really continuous.
cor(Q1_df[ , 6:11])
## locus concept read write math science
## locus 1.0000000 0.10943434 0.3752034 0.35061385 0.3618460 0.3365816
## concept 0.1094343 1.00000000 0.1112732 0.05590149 0.1338509 0.1061843
## read 0.3752034 0.11127317 1.0000000 0.63420287 0.6628906 0.6621116
## write 0.3506139 0.05590149 0.6342029 1.00000000 0.6391226 0.5480410
## math 0.3618460 0.13385089 0.6628906 0.63912259 1.0000000 0.6514850
## science 0.3365816 0.10618426 0.6621116 0.54804100 0.6514850 1.0000000
If we plan to fit a regression model, when need to worry about collinearity. The four test ability variables seem correlated among each other, so we need to deal with collinearity. One solution is to obtain a second dataset with uncorrelated predictor variables. A second approach involves regularization and shrinkage, methods which are commonly used for high-dimensional datasets (n<<p). A third approach is to identify the correlated variables and just consider a single one to include in the model. Since this approach is not so straightforward, principal components analysis can be used instead to reduce the correlated variables into one or two independent variables.
This the most important step in data analysis. If you do not plot your data first, you might just as well be stumbling in the dark. The data we have is multidimensional, so a pairs plot is a good way to try to see any patterns in the data. You can use the pairs function to plot the data, but I am writing my own function so I can use web graphics instead.
library(rbokeh)
plotPairs <- function(df, colms, colr){
nms <- expand.grid(names(df)[colms],
rev(names(df)[colms]),
stringsAsFactors = FALSE)
splom_list <- vector("list", length(nms))
for (ii in seq_len(nrow(nms))) {
splom_list[[ii]] <- figure(width = 110,
height = 110,
tools = c("pan", "box_zoom", "reset"),
xlab = nms$Var1[ii],
ylab = nms$Var2[ii]) %>%
ly_points(nms$Var1[ii],
nms$Var2[ii],
data = df,
color = as.factor(df[ , colr]),
size = 5,
legend = FALSE)
}
p <- grid_plot(splom_list,
ncol = length(colms),
same_axes = TRUE,
link_data = TRUE)
return(p)
}
Let’s plot columns 4 to 11 and color the observations by the sex (Variable 2)
plotPairs(Q1_df, 4:11, 2)
Not all the variables are on the same scale. That is problematic if we are running PCA. Note: read, write, math, and science seem to be measured on the same scale so there is no need to standardize the variable for PCA. That said, since we are analyzing the entire dataset which contain variables at differnt scales, it makes more sense to standardize the entire dataset (except of course the categorical variable).
Next we standardize and plot the data.
Q1_df[ , 4:11] <- scale(Q1_df[ , 4:11],
center = TRUE,
scale = TRUE)
plotPairs(Q1_df, 4:11, 2)
Let’s reduce the 4 dimensional data into fewer dimensions. You can fit the model with a single call to the function prcomp like this
pca_fit <- prcomp(~read + write + math + science,
data = Q1_df,
center = TRUE,
scale = FALSE)
Scale is FALSE since the data has been standardized already. The proportion of variance explained can be obtained using the summary function
summary(pca_fit)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7030 0.6736 0.58044 0.55596
## Proportion of Variance 0.7251 0.1134 0.08423 0.07727
## Cumulative Proportion 0.7251 0.8385 0.92273 1.00000
Now pca_fit is a list with three objects (rotation, sdev, X). You can access the objects with the $ operator, so for example, Q is the loading matrix (eigenvectors)
Q <- pca_fit$rotation
Q
## PC1 PC2 PC3 PC4
## read 0.5115308 -0.089313051 0.64949737 -0.5554391
## write 0.4845572 0.754212731 0.09266702 0.4333362
## math 0.5104171 0.001790006 -0.75452230 -0.4125134
## science 0.4929654 -0.650525273 0.01618916 0.5775291
Z is the matrix of sample PC scores. sigma is the vector of standard deviations
X <- scale(Q1_df[ , 8:11],
center = TRUE,
scale = FALSE)
Z <- X %*% Q
# or equivalently
Z <- pca_fit$x
# standardized
sigma <- pca_fit$sdev
Zs <- Z %*% diag(1/sigma)
The first standardized first principle component can be obtained by subsetting
z1s <- Zs[ , 1]
Lets add PC1 to the dataset
Q1_df <- cbind(Q1_df,z1s)
plotPairs(Q1_df, c(4:7,12), 2)
The predictor variables don’t seem to be correlated.
fit1 <- lm(z1s ~ sex + sctyp + civics + motiv + locus + concept,
data = Q1_df)
summary(fit1)
##
## Call:
## lm(formula = z1s ~ sex + sctyp + civics + motiv + locus + concept,
## data = Q1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90895 -0.54151 -0.02537 0.58217 2.16337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03870 0.19587 -0.198 0.843
## sex -0.05733 0.08762 -0.654 0.513
## sctyp 0.10971 0.11377 0.964 0.336
## civics 0.52172 0.04422 11.798 < 2e-16 ***
## motiv 0.05322 0.04587 1.160 0.247
## locus 0.26479 0.04493 5.894 9.84e-09 ***
## concept 0.04411 0.04441 0.993 0.321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.753 on 310 degrees of freedom
## Multiple R-squared: 0.4437, Adjusted R-squared: 0.433
## F-statistic: 41.21 on 6 and 310 DF, p-value: < 2.2e-16
# plot diagnostics
par(mfrow = c(2,2))
plot(fit1)
Models assumptions seem to hold for the most part. More about how to interpret these plots here .
To find which of these 6 variables are the most important and statistically significant predictors, we can perform stepwise regression using the AIC.
NOTE 2: The step function in R takes proper account of the number of parameters fit by dropping or adding variables (sometimes in a group) that minimizes the AIC score. If the package you are using does selection on F-statistics instead, by adding “significant” terms and dropping “non-significant” terms, then multiple comparison issues are not properly dealt with by your package (Friedman, J., Hastie, T., & Tibshirani, R., 2001) and I would recommend switching to another package. A better option would be to take a totally different approach and use ridge regression.
stepFit <- step(fit1)
## Start: AIC=-172.93
## z1s ~ sex + sctyp + civics + motiv + locus + concept
##
## Df Sum of Sq RSS AIC
## - sex 1 0.243 176.02 -174.488
## - sctyp 1 0.527 176.31 -173.976
## - concept 1 0.559 176.34 -173.918
## - motiv 1 0.763 176.54 -173.552
## <none> 175.78 -172.925
## - locus 1 19.697 195.48 -141.257
## - civics 1 78.932 254.71 -57.349
##
## Step: AIC=-174.49
## z1s ~ sctyp + civics + motiv + locus + concept
##
## Df Sum of Sq RSS AIC
## - sctyp 1 0.533 176.56 -175.529
## - motiv 1 0.636 176.66 -175.345
## - concept 1 0.693 176.72 -175.243
## <none> 176.02 -174.488
## - locus 1 19.547 195.57 -143.107
## - civics 1 78.693 254.72 -59.345
##
## Step: AIC=-175.53
## z1s ~ civics + motiv + locus + concept
##
## Df Sum of Sq RSS AIC
## - motiv 1 0.676 177.23 -176.317
## - concept 1 0.677 177.23 -176.316
## <none> 176.56 -175.529
## - locus 1 19.871 196.43 -143.721
## - civics 1 79.382 255.94 -59.828
##
## Step: AIC=-176.32
## z1s ~ civics + locus + concept
##
## Df Sum of Sq RSS AIC
## - concept 1 1.098 178.33 -176.360
## <none> 177.23 -176.317
## - locus 1 21.960 199.19 -141.289
## - civics 1 81.696 258.93 -58.146
##
## Step: AIC=-176.36
## z1s ~ civics + locus
##
## Df Sum of Sq RSS AIC
## <none> 178.33 -176.360
## - locus 1 23.151 201.48 -139.667
## - civics 1 82.369 260.70 -57.985
The final model is
summary(stepFit)
##
## Call:
## lm(formula = z1s ~ civics + locus, data = Q1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0128 -0.5817 -0.0034 0.5573 2.1460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.692e-17 4.233e-02 0.000 1
## civics 5.289e-01 4.391e-02 12.043 < 2e-16 ***
## locus 2.804e-01 4.391e-02 6.385 6.17e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7536 on 314 degrees of freedom
## Multiple R-squared: 0.4357, Adjusted R-squared: 0.4321
## F-statistic: 121.2 on 2 and 314 DF, p-value: < 2.2e-16
Q2_df <- read.table("~/Documents/repos/PSYCH7821/assets/assignments/HW2/ex02_question2_data.txt",
header = TRUE)
head(Q2_df)
## id gpa spatial2d spatial3d mech_res arith computat
## 1 6 51 25 23 11 21 14
## 2 9 40 9 9 15 7 8
## 3 21 36 3 5 8 6 11
## 4 24 53 18 26 7 13 10
## 5 25 40 14 11 7 11 14
## 6 42 59 20 5 12 24 12
summary(Q2_df)
## id gpa spatial2d spatial3d
## Min. : 6.0 Min. :21.00 Min. : 3.00 Min. : 2.00
## 1st Qu.:101.0 1st Qu.:36.00 1st Qu.:10.50 1st Qu.: 8.00
## Median :240.0 Median :43.00 Median :16.00 Median :12.00
## Mean :221.4 Mean :42.38 Mean :16.53 Mean :13.28
## 3rd Qu.:325.0 3rd Qu.:49.00 3rd Qu.:24.00 3rd Qu.:17.50
## Max. :425.0 Max. :68.00 Max. :30.00 Max. :29.00
## mech_res arith computat
## Min. : 2.00 Min. : 6.00 Min. : 5.00
## 1st Qu.: 8.00 1st Qu.:16.00 1st Qu.:11.00
## Median :10.00 Median :21.00 Median :12.00
## Mean :10.03 Mean :20.09 Mean :12.91
## 3rd Qu.:12.00 3rd Qu.:24.00 3rd Qu.:14.00
## Max. :20.00 Max. :35.00 Max. :25.00
Except for gpa, the other variables seem to gave been measured on the same reference-normed scale. Accordingly, there doesn’t seem to be a need for standardization. For PCA we also keep the optional scaling off (the default).
plotPairs(Q2_df, 2:7, 1)
The 6 variables all seem to be highly correlated.
fit2 <- lm(gpa ~ spatial2d + spatial3d + mech_res + arith + computat,
data = Q2_df)
summary(fit2)
##
## Call:
## lm(formula = gpa ~ spatial2d + spatial3d + mech_res + arith +
## computat, data = Q2_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4046 -5.3191 0.8052 4.2488 18.6125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.37285 3.90386 5.475 5.89e-07 ***
## spatial2d 0.11915 0.12831 0.929 0.35612
## spatial3d 0.08029 0.16004 0.502 0.61740
## mech_res 0.06643 0.26552 0.250 0.80314
## arith 0.31333 0.18503 1.693 0.09464 .
## computat 0.85278 0.30327 2.812 0.00632 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.749 on 73 degrees of freedom
## Multiple R-squared: 0.346, Adjusted R-squared: 0.3012
## F-statistic: 7.723 on 5 and 73 DF, p-value: 7.116e-06
NOTE 3: We see that only computat shows statistical significance, but that is expected given that the variables are correlated. Even if we have a situation where each of two highly correlated predictor variables perfectly predicts the reponse, you will see that only one of the two variables will have a highly significant p-value while the other won’t. At any rate, if the goal leans more towards prediction accuracy, not model interpertability, you should not be paying much attention to p-values anyway.
OK, now we need to deal with multicollinearity first. PCA is one option.
First we run PCA on the 5 variables
pca_fit2 <- prcomp(~spatial2d + spatial3d + mech_res + arith + computat,
data = Q2_df,
center = TRUE,
scale = FALSE)
The proportion of variance explained can be obtained using the summary function
summary(pca_fit2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 10.1111 6.2720 4.9561 3.27497 2.66315
## Proportion of Variance 0.5558 0.2139 0.1335 0.05831 0.03856
## Cumulative Proportion 0.5558 0.7696 0.9031 0.96144 1.00000
Now pca_fit is a list with three objects (rotation, sdev, X). You can access the objects with the $ operator, so for example:
Q is the loading matrix (eigenvectors)
Q <- pca_fit2$rotation
Q
## PC1 PC2 PC3 PC4 PC5
## spatial2d -0.7086904 0.61291400 0.309177934 -0.12176929 0.10805352
## spatial3d -0.5329105 -0.19911301 -0.809081555 -0.07012241 -0.12973180
## mech_res -0.1369442 0.03961033 -0.007586042 0.98965660 0.01412676
## arith -0.4086301 -0.65739245 0.495785960 -0.02081940 -0.39321678
## computat -0.1674277 -0.38853138 0.062727794 -0.02003615 0.90370078
Z is the matrix of sample PC scores.
Z <- pca_fit2$x
First we add the five principle components to the dataframe. Next, we fit the model with all five PCs
Q2_df <- cbind(Q2_df, Z)
fit3 <- lm(gpa ~ PC1 + PC2 + PC3 + PC4 + PC5,
data = Q2_df)
summary(fit3)
##
## Call:
## lm(formula = gpa ~ PC1 + PC2 + PC3 + PC4 + PC5, data = Q2_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4046 -5.3191 0.8052 4.2488 18.6125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.37975 0.87187 48.608 < 2e-16 ***
## PC1 -0.40714 0.08678 -4.692 1.23e-05 ***
## PC2 -0.47764 0.13990 -3.414 0.00105 **
## PC3 0.18021 0.17704 1.018 0.31208
## PC4 0.02200 0.26792 0.082 0.93480
## PC5 0.65085 0.32947 1.975 0.05200 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.749 on 73 degrees of freedom
## Multiple R-squared: 0.346, Adjusted R-squared: 0.3012
## F-statistic: 7.723 on 5 and 73 DF, p-value: 7.116e-06
# plot diagnostics
par(mfrow = c(2, 2))
plot(fit3)
Nothing alarming! We move on to model selection
stepFit3 <- step(fit3)
## Start: AIC=329.28
## gpa ~ PC1 + PC2 + PC3 + PC4 + PC5
##
## Df Sum of Sq RSS AIC
## - PC4 1 0.40 4384.2 327.29
## - PC3 1 62.22 4446.0 328.39
## <none> 4383.8 329.28
## - PC5 1 234.34 4618.1 331.40
## - PC2 1 700.00 5083.8 338.98
## - PC1 1 1321.86 5705.6 348.10
##
## Step: AIC=327.29
## gpa ~ PC1 + PC2 + PC3 + PC5
##
## Df Sum of Sq RSS AIC
## - PC3 1 62.22 4446.4 326.40
## <none> 4384.2 327.29
## - PC5 1 234.34 4618.5 329.40
## - PC2 1 700.00 5084.2 336.99
## - PC1 1 1321.86 5706.1 346.11
##
## Step: AIC=326.4
## gpa ~ PC1 + PC2 + PC5
##
## Df Sum of Sq RSS AIC
## <none> 4446.4 326.40
## - PC5 1 234.34 4680.7 328.46
## - PC2 1 700.00 5146.4 335.95
## - PC1 1 1321.86 5768.3 344.96
The final model is
summary(stepFit3)
##
## Call:
## lm(formula = gpa ~ PC1 + PC2 + PC5, data = Q2_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5761 -4.8354 0.6633 4.3220 18.0740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.37975 0.86628 48.921 < 2e-16 ***
## PC1 -0.40714 0.08622 -4.722 1.06e-05 ***
## PC2 -0.47764 0.13900 -3.436 0.000965 ***
## PC5 0.65085 0.32736 1.988 0.050446 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.7 on 75 degrees of freedom
## Multiple R-squared: 0.3366, Adjusted R-squared: 0.3101
## F-statistic: 12.69 on 3 and 75 DF, p-value: 8.596e-07
We see that the fifth principle component was included in the model. What to do? For a more interpertable model, just keep the first two.
NOTE 4: Note that I haven’t mentioned R-squared here at all. The reason is that people are beginning to realize that R-squared is a useless criterion for model fit.
The reason for that are as follows:
These reasons are explained really well here .
==========================================