Question 1 (Factor Analysis)

When we fit a factor analysis model, we usually assume that there are fewer factors than we have indicators (e.g. tests). For example, we can infer the existence of a hypothetical construct, a general factor of intelligence, from the explained shared variance in several ability tests (i.e. verbal, quantitative, spatial, memory).

For this exercise, I ask you to indulge me in assuming the opposite. That there are more factors than we have indicators! More specifically, that each test is actually a linear combination of a great number of latent ability traits we didn’t measure. In his 1916 aper “hierarchy without a general factor”, the English educational psychologist Godfrey Thomson introduced exactly such a model. The data you will be using for this question has been simulated from model based on Thomson’s sampling model. For concreteness, consider the following scenario:

Imagine a human population that possesses 550 distinct and independent mental abilities (latent traits). The abilities are normally distributed among its members (an individual’s profile is a point in this 550 dimensional space). We administer 10 tests to a sample of 50 test takers from this population. The data we collect is a 50 x 10 matrix, which you can find in the file “ThomsonData.csv”

The simulatation assumes that each test is a combination of general and specific ability traits. The general ability for each test is linear combination of a random subset of 500 of the 550 traits. These 500 traits are shared among the ten test, so for example, the general ability component of Test1 could be a linear combination of Trait1, Trait 33, and Trait 200 whereas for Test2 it can be a linear combination of Trait 10, Trait 33, and Trait 200. The specific ability component is specific to each test only.

Question 1a

Specify a model with a single factor (check lavaan syntax for an example). Note that the variables in the dataset are named (“test1” , “test2” , “test3” ,“test4” , “test5” , “test6” ,“test7” , “test8” , “test9”, “test10”)

Solution 1a

The model specification for mean centered data is

\[\begin{aligned} \mathbf y_i &=\boldsymbol \Lambda \boldsymbol \eta_i + \boldsymbol \epsilon_i \\ or \ expanded \ as \ \begin{bmatrix} y_{i,1} \\ \vdots\\ y_{i,10} \end{bmatrix} &= \begin{bmatrix} \lambda_1\\ \vdots\\ \lambda_{10} \end{bmatrix} \begin{bmatrix} \eta_i \end{bmatrix} + \begin{bmatrix} \epsilon_{i,1} \\ \vdots\\ \epsilon_{i,10} \end{bmatrix}\\ where \ Cov(\boldsymbol \epsilon_i) = \boldsymbol \Theta=&\begin{bmatrix} \sigma^2_1 & 0 & \ldots & 0 \\ 0 & \sigma^2_2 & \ldots & 0 \\ \vdots &\vdots & \ddots & \vdots \\ 0 & \ldots & 0 &\sigma^2_p \end{bmatrix}, \\ \ Cov(\boldsymbol \eta_i) = \boldsymbol \Psi =& \begin{bmatrix} 1 \end{bmatrix}, \ \boldsymbol \epsilon_i \perp \boldsymbol \eta_i, \ \ for \ i=1, ..., 50. \end{aligned} \] Where \(\mathbf y_i\) is a vector of the 10 test scores for the i’th person. The covariance structure form of the model for the population is \[ \boldsymbol \Sigma_{yy}= \boldsymbol\Lambda\boldsymbol \Psi \boldsymbol\Lambda^{T} + \boldsymbol \Theta \] In lavaan the model can be specified as shown below if the cfa function is used. The cfa function takes care of the above-mentioned assumptions under the hood.

fa_model <- 'f1 =~ test1+ test2 + test3 + test4 + test5 + test6 + test7+ test8 + test9 + test10'

Question 1b

Read the data and fit the specified model with a standardized latent variable. Provide the standardized parameters and residuals.

Solution 1b

library(lavaan)

df <- read.csv("ThomsonData.csv")
fa_model_fit <- cfa(model = fa_model,
                    data = df, 
                    std.lv = TRUE)
lavInspect(fa_model_fit, "std")
## $lambda
##           f1
## test1  0.763
## test2  0.468
## test3  0.718
## test4  0.902
## test5  0.661
## test6  0.108
## test7  0.751
## test8  0.462
## test9  0.764
## test10 0.570
## 
## $theta
##        test1 test2 test3 test4 test5 test6 test7 test8 test9 test10
## test1  0.418                                                       
## test2  0.000 0.781                                                 
## test3  0.000 0.000 0.484                                           
## test4  0.000 0.000 0.000 0.186                                     
## test5  0.000 0.000 0.000 0.000 0.563                               
## test6  0.000 0.000 0.000 0.000 0.000 0.988                         
## test7  0.000 0.000 0.000 0.000 0.000 0.000 0.437                   
## test8  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.786             
## test9  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.417       
## test10 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.675 
## 
## $psi
##    f1
## f1 1

Corrletation residuals tend be easy to interpert. Here lavaan first transforms the observed and model implied covariance matrix into correlation matrices. That is not the only choice. You can use normalized residuals but you cannot use standardized residuals since NA values occur here.

corr_resid <- residuals(fa_model_fit,"cor")
round(corr_resid$cor,2)
       test1 test2 test3 test4 test5 test6 test7 test8 test9 test10
test1   0.00                                                       
test2   0.11  0.00                                                 
test3  -0.01  0.03  0.00                                           
test4   0.03 -0.06 -0.02  0.00                                     
test5   0.03  0.07  0.03 -0.03  0.00                               
test6   0.12  0.12  0.04  0.00  0.07  0.00                         
test7  -0.04 -0.10 -0.02  0.01  0.08 -0.13  0.00                   
test8  -0.10  0.01 -0.08  0.04 -0.15 -0.16  0.05  0.00             
test9  -0.04 -0.02  0.04 -0.01  0.02 -0.05  0.02  0.11  0.00       
test10 -0.06  0.19  0.06  0.02 -0.07  0.01 -0.03  0.02 -0.03  0.00 

Question 1c

When assesing model fit you can compare the fit of your model to

  1. No model at all by using an absolute fit index
  2. To the null/independence model that assumes that the indicators are uncorrelated by using an incremental fit index
  3. The difference between what is observed an what is predicted using an index based on residuals.

Many researchers have recommended reporting several fit indicies to assess model fit. The three most recommended (with suggested cutoffs) are

  1. The SRMR (< 0.09)
  2. The RMSEA ( < 0.06)
  3. The CFI (> 0.96)

Please provide the model fit measures for these three indicies.

Solution 1c

fitMeasures(fa_model_fit, c("srmr", "rmsea", "cfi"))
##  srmr rmsea   cfi 
## 0.065 0.000 1.000

Question 1d

Which of these three indicies is an abolute, an incremental, or a residual based fit index? Name one property of each index that makes it better than other fit measures in its class (you can do web search).

Solution 1d

  1. The RMSEA is an absolute measure of fit. It imporves on the chi-squared test statistics. Whereas the chi-squared test statistics is very sensitive to the sample size and also to deviations from normality, the RMSEA is more robust in that it assumes a model that holds approximately in the population.

  2. The CFI is an incremental measure of fit. It calculates the incremental improvement in the fit of the model compared to independence model that assumes that the observed variables are uncorrelated. It is an improvement over the Normed-fit index (NFI) in that it takes sample size into account.

  3. The SRMR is the standardised measure of the average covariance residuals, which are the differences between the sample covariance matrix and the estimated covariance matrix assumed under the correct model. The SRMR improves upon the RMR, which is an unstandardized measure that depends on the scale of the variables and is therefore difficult to interpret.

summary(fa_model_fit , fit.measures = TRUE)
## lavaan (0.5-22) converged normally after  88 iterations
## 
##   Number of observations                            50
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               31.386
##   Degrees of freedom                                35
##   P-value (Chi-square)                           0.643
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              228.189
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.025
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1967.851
##   Loglikelihood unrestricted model (H1)      -1952.158
## 
##   Number of free parameters                         20
##   Akaike (AIC)                                3975.702
##   Bayesian (BIC)                              4013.943
##   Sample-size adjusted Bayesian (BIC)         3951.166
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.087
##   P-value RMSEA <= 0.05                          0.799
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   f1 =~                                               
##     test1            11.856    1.910    6.206    0.000
##     test2             4.441    1.318    3.371    0.001
##     test3            12.621    2.213    5.702    0.000
##     test4            19.749    2.457    8.038    0.000
##     test5            10.476    2.051    5.107    0.000
##     test6             0.804    1.098    0.733    0.464
##     test7            12.120    1.998    6.065    0.000
##     test8             8.047    2.420    3.325    0.001
##     test9            13.991    2.251    6.216    0.000
##     test10           10.040    2.365    4.245    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .test1           100.954   23.111    4.368    0.000
##    .test2            70.451   14.449    4.876    0.000
##    .test3           149.604   33.095    4.520    0.000
##    .test4            89.211   29.451    3.029    0.002
##    .test5           141.380   30.388    4.652    0.000
##    .test6            55.072   11.026    4.995    0.000
##    .test7           113.791   25.770    4.416    0.000
##    .test8           238.491   48.874    4.880    0.000
##    .test9           139.852   32.040    4.365    0.000
##    .test10          209.438   43.762    4.786    0.000
##     f1                1.000

Question 1e

What is the null hypothesis? Given the values of the model fit indicies, do you reject the null? Is a single factor model a good fit for the data?

Solution 1e

The likelihood ratio statistic of our model assumes that the null hypothesis is \[ \begin{aligned} H_0:\boldsymbol \Sigma= \boldsymbol\Lambda\boldsymbol \Lambda^{T} + \boldsymbol \Theta &=\begin{bmatrix} \lambda_1\\ \vdots\\ \lambda_{10} \end{bmatrix} \begin{bmatrix} \lambda_1 \cdots\ \lambda_{10} \end{bmatrix} + \begin{bmatrix} \sigma^2_1 & 0 & \ldots & 0 \\ 0 & \sigma^2_2 & \ldots & 0 \\ \vdots &\vdots & \ddots & \vdots \\ 0 & \ldots & 0 &\sigma^2_{10} \end{bmatrix} \\ &= \begin{bmatrix} \lambda_1^2+\sigma^2_1 & \lambda_1\lambda_2 & \ldots & \lambda_1\lambda_{10} \\ \lambda_2\lambda_1 & \lambda^2_2+\sigma^2_{2} & \ldots & \lambda_2\lambda_{10} \\ \vdots &\vdots & \ddots & \vdots \\ \lambda_{10}\lambda_{1} & \ldots & \lambda_{10}\lambda_{9} & \lambda^2_{10}+ \sigma^2_{10} \end{bmatrix} \end{aligned} \] namely that the covariance matrix predicted by the model has the matrix decomposition above. The alternative hypothesis is \[ H_1:\boldsymbol \Sigma= \begin{bmatrix} \sigma_{11} & \sigma_{12} & \ldots & \sigma_{1,10} \\ \sigma_{21} & \sigma_{22} & \ldots & \sigma_{2,10} \\ \vdots &\vdots & \ddots & \vdots \\ \sigma_{10,1} & \ldots & \sigma_{10,9} &\sigma_{10,10} \end{bmatrix} \] namely that the covariance matrix is unconstrained.

The fit indices show an almost perfect fit to the data. Our conclusion: We fail to reject the null hypothesis. There is evidence that the sample covariance matrix of our data can have the decomposition implied by the null hypothesis.

For incremental fit indices, you have a set of nested sequence of hypotheses. One one end of the extreme, you have the the null/independence model that states the observed variables are uncorrelated. In this case, we say that for estimated covariance matrix \(\hat{\Sigma}\) and sample covariance data \(S\)

\[ \hat{\Sigma}= \begin{bmatrix} s_{11} & 0 & \ldots & 0 \\ 0 & s_{22} & \ldots & 0 \\ \vdots &\vdots & \ddots & \vdots \\ 0 & \ldots & 0 &s_{10,10} \end{bmatrix} \]

at the other end of the extreme, we have the saturated model

\[ \hat{\Sigma}= S=\begin{bmatrix} s_{11} & s_{12} & \ldots & s_{1,10} \\ s_{21} & s_{22} & \ldots & s_{2,10} \\ \vdots &\vdots & \ddots & \vdots \\ s_{10,1} & \ldots & s_{10,9} &s_{10,10} \end{bmatrix} \] A incremental fit statistic value close to 0 implies that the proposed model is not an improvment over the independence model. A value close to 1 implies that the proposed model is a perfect fit for the data \(S\)

Residual fit indices are descriptive statistics that tend to be useful when the model does not fit well and you would like to identify the source of the deviation.

Question 1f

Model fit provides little information about how the data was actually generated. Which model, Spearman’s general factor approach or Thomson’s model, is more plausible in real world applications in psychology, educational testing, marketing, or other fields that rely on surveys and self-report data?

Solution 1f

The perfect fit we obtained from this widely incorrect model should give you pause. Consider any single item on a self-report measure, whether it is in a depression test battery or in a survey questionnaire. How many unobserved causes (e.g. mental processes) do you think can give rise to the actual response by a subject? In the physical science, we attempt to uncover the causal mechanism that gives rise to the data by measuring additional observable variables and incorporating them into a model that explains the phenomenon of interest. In the social and psychological sciences, these causes are unobservable and arguably the only way to really measure them is to do so indirectly via brain activity or other physiological proxies. Thomson’s model assumes that you have more latent causes than you have variables. It is a more plausible sampling model of how the world is, but it is not a model you can fit your data to. You simply cannot meaure 550 nonobservable variables using 10 observable variables. Pearson’s model is a model of how you think the world is. It can be useful since it can provide evidence for a pattern of correlations in your data. Nonetheless, interpreting correlations in a finite sample as representing a hypothetical construct that can objectively be measured is a fallacy known in philosophy, psychometrics, and statistics as the reification fallacy. Also note that correlations are measures of linear associations and are only applicable when the linearity assumption holds.

Question 1g

True or False: Thomson is a good example of the proverbial prophet crying in the wilderness.

Solution 1g

In the wilderness, there is no one to hear you even if what you are preaching is demonstrably true. Thomson’s sampling model casts serious doubts upon one aspect of Pearson’s model. Namely, that a latent variable can denote a hypothetical construct such as intelligence that we can objectively measure on the basis of summary correlations in a set of observed variables. That doesn’t mean you cannot measure intelligence or test ability using test items. We do that all the time without resorting to a specific decomposition of the covariance matrix. That also doesn’t mean the latent variables models are useless. In fact, they are proving to be one of the most powerful class of models in statistics in explaining patterns in the data. The criticism lies in how these models can be misused by some researchers in the social sciences, researcher who are just too eager to arrive at true understanding of complex unobservable phenomena armed only with a survey questionnaire in one hand and a point-and-click SEM software package in the other hand (on a smart-phone maybe). Note that the assertions made here are only targeted at the misuse of methods not at the validity of such constructs as intelligence and extroversion. These words share similar meanings among speakers of a particular language and most would agree that people do differ on these traits.

Do the researchers who developed the Big five theory of personality belong to the above-mentioned camp? The second part of the assignment should help you decide?