Let \({\mathbf{y}}^{\intercal}=(y_{1}, y_{2},..., y_{P})\) be an observable P-dimensional random vector and let \({\boldsymbol{\theta}}\) be a D-dimensional random vector of latent traits that probabilistically determine the observable responses. A factor analysis model with factor loading matrix \({\boldsymbol{\Lambda}}\), latent trait correlation matrix \({\boldsymbol{\Phi}}\), and diagonal covariance matrix \({\boldsymbol{\Psi}}\) for the errors \({\boldsymbol{\varepsilon}}\) can be written as
\[ \begin{align} \\ &{\mathbf{y}}={\boldsymbol{\Lambda}}{\boldsymbol{\theta}}+ {\boldsymbol{\varepsilon}}\\ \end{align} \]
where \[ \begin{align} \begin{pmatrix} {\boldsymbol{\theta}}\\ {\boldsymbol{\varepsilon}}\end{pmatrix} \sim &Normal\Bigl(\textbf{0},\begin{bmatrix}{\boldsymbol{\Phi}}& \textbf{0} \\ \textbf{0} & {\boldsymbol{\Psi}}\end{bmatrix}\Bigr) \end{align} \]
and
\[ \begin{align} {\boldsymbol{\Lambda}}= \begin{bmatrix} \lambda_{1,1}& ...& \lambda_{1,D}\\ \lambda_{2,1}& ...& \lambda_{2,D}\\ \vdots & \vdots & \vdots \\ \lambda_{p,1} & ...& \lambda_{P,D}\\ \end{bmatrix} {\boldsymbol{\Phi}}_{D} = \begin{bmatrix} 1 & \varphi_{1,2} & ...& \varphi_{1,D} \\ \varphi_{2,1} & 1 & ... & \varphi_{2,D}\\ \vdots & \vdots & \vdots & \vdots \\ \varphi_{D,1} & ...& \varphi_{D,D-1} & 1 \end{bmatrix} \quad {\boldsymbol{\Psi}}_{P} = \begin{bmatrix} \psi_1& 0 & ...& 0 \\ 0 & \psi_2 & ... & 0\\ \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & ... & \psi_p \end{bmatrix} \end{align} \]
or equivalently as
\[ \begin{align} {\mathbf{y}}\sim & Normal_{p}(\textbf{0},{\boldsymbol{\Lambda}}{\boldsymbol{\Phi}}{\boldsymbol{\Lambda}}^{\intercal}+{\boldsymbol{\Psi}}) \end{align} \]
Note that \({\boldsymbol{\Psi}}\) is diagonal since the assumption of local item independence entails that the partial covariance of any two latent responses given the latent traits equals zero, which in turn implies that the elements of \({\boldsymbol{\varepsilon}}\) are themselves uncorrelated (Lord & Novick, 1968).
For N observations sampled from a single homogeneous population, the joint posterior distribution is then
\[ \begin{align} p(\boldsymbol{\Theta},{\boldsymbol{\Lambda}},\boldsymbol{\tilde{\xi}},\boldsymbol{{\boldsymbol{\Phi}}}\mid {\mathbf{Y}})\propto & \biggl[\prod_{n=1}^N\ p({\mathbf{y}}_{n}\vert\boldsymbol{\theta_n,}{\boldsymbol{\Lambda}})\biggr] \biggl[\prod_{n=1}^N\ p\left(\boldsymbol{\theta}_{n}\mid \boldsymbol{{\boldsymbol{\Phi}}}\right)\biggr] p\left({\boldsymbol{\Lambda}}\mid \boldsymbol{\tilde{\xi}}\right) p(\boldsymbol{{\boldsymbol{\Phi}}})p(\boldsymbol{\tilde{\xi}}) \end{align} \]
where \(\boldsymbol{\Theta}=\left(\boldsymbol{\theta_1},\boldsymbol{\theta_2},... \boldsymbol{\theta_n},... \boldsymbol{\theta_N}\right)\). Moreover, by intergrating out the N D-dimensional latent vectors, we obtain the marginalized posterior
\[ \begin{align*} p({\boldsymbol{\Lambda}},\boldsymbol{\tilde{\xi}},\boldsymbol{{\boldsymbol{\Phi}}}\mid {\mathbf{Y}})\propto & \biggl[ \prod_{n=1}^N \int_{R_{{\boldsymbol{\vartheta}}}}p({\mathbf{y}}_{n}\vert\boldsymbol{\theta_n,}{\boldsymbol{\Lambda}}) p\left(\boldsymbol{\theta}_{n}\mid \boldsymbol{{\boldsymbol{\Phi}}}\right) d\boldsymbol{\theta} \biggr] p\left({\boldsymbol{\Lambda}}\mid \boldsymbol{\tilde{\xi}}\right) p(\boldsymbol{{\boldsymbol{\Phi}}})p(\boldsymbol{\tilde{\xi}}) \end{align*} \]
Here, \({\mathbf{y}}_{n}\) is a \(P\)-dimensional vector of responses for respondent \(n\).
We say that a model is not identified if there are two or more distinct parameter points that can generate the same joint probability distribution. For example, if we can find a matrix not equal to , such that the two distinct set of parameters \(({\boldsymbol{\Lambda}},{\boldsymbol{\Phi}})\) and \(({\boldsymbol{\Lambda}}^\bigstar,{\boldsymbol{\Phi}}^\bigstar)\) define the same probability distribution
\[ \begin{align*} {\mathbf{y}}\sim &Normal\Bigl(\textbf{0},{\boldsymbol{\Lambda}}{\boldsymbol{\Phi}}{\boldsymbol{\Lambda}}^{\intercal}+{\boldsymbol{\Psi}}\Bigr)\\ \equiv &Normal\Bigl(\textbf{0},({\boldsymbol{\Lambda}}\textbf{T}) ({\textbf{T}}^{-1}\bigl({\boldsymbol{\Phi}}\bigr){{\textbf{T}}^{\intercal^{-1}}})({\textbf{T}}^{\intercal}{\boldsymbol{\Lambda}}^{\intercal})+{\boldsymbol{\Psi}}\Bigr)\\ \equiv&Normal\Bigl(\textbf{0},{\boldsymbol{\Lambda}}^\bigstar{\boldsymbol{\Phi}}^\bigstar{\boldsymbol{\Lambda}}^{\bigstar^\intercal}+{\boldsymbol{\Psi}}\Bigr) \end{align*} \]
then it follows then that the model is not identified. Conceptually, think of the latent variables \({\boldsymbol{\theta}}\) as spanning a \(D\)-dimensional space, the latent space, and any combination of the traits that define a particular person \(p\) can be represented by a point residing in that space. However, since \({\boldsymbol{\theta}}\) is latent, it has no frame of reference (and \({\mathbf{y}}\) as well). To establish a frame of reference, we need to specify a point of origin, a unit of measurement or scale, a set of D basis vectors that define the coordinates and the direction. These indeterminacies correspond to location, scale, rotation, and reflection invariance and to label switching if the basis vectors are not ordered.
To specify a metric of measurement, we need to set the location and scale of the latent variables in the model. For \({\boldsymbol{\theta}}\) the location and scale were determined by specifying the means to zero and the variances to 1.
Parameters are said to be invariant under rotation when we can find an orthogonal rotation matrix \(\textbf{T}\), where \(\textbf{T}^{\intercal}\textbf{T}={\boldsymbol{\Phi}}=\mathbf{I}\), such that when we rotate \({\boldsymbol{\Lambda}}\) and \({\boldsymbol{\theta}}\), we obtain two rotated matrices \({\boldsymbol{\Lambda}}^{\ast}={\boldsymbol{\Lambda}}\textbf{T}^{\intercal}\) and \({\boldsymbol{\theta}}^{\ast}=\textbf{T}{\boldsymbol{\theta}}\) whose product gives us back the original matrices
\[ \begin{align*} {\boldsymbol{\Lambda}}^{\ast}{\boldsymbol{\theta}}^{\ast}={\boldsymbol{\Lambda}}\textbf{T}^{\intercal}\textbf{T}{\boldsymbol{\theta}}={\boldsymbol{\Lambda}}{\boldsymbol{\theta}}\end{align*} \]
In factor analysis, there are two approaches to deal with rotational invariance. In Exploratory Factor Analysis (Thurstone, 1947), we set \({\boldsymbol{\Phi}}=\textbf{I}\) to make \({\boldsymbol{\Lambda}}{\boldsymbol{\Lambda}}^{\intercal}\) identifiable and chose \({\boldsymbol{\Lambda}}\) such that it meets certain criteria for interpretability. For example, one criterion, the VARIMAX criterion (Kaiser, 1958), finds a simple interpretable solution by rotating the factors so that each factor has a large number of loadings with values near zero and small number of loadings with large values. Note that, the resulting rotated matrix is not unique. For a \(D\)-dimensional model, there are still \(2^D \times D!\) matrices in its equivalence class. We should add that, that label switching (i.e., column permutations) and sign reversals are not considered a problem in the frequentist framework because inference can be restricted to one likelihood mode with the proper choice of initial values. Moreover, after applying a rotation criterion, the common practice is that the researcher usually flips the signs and orders the \(D\) columns of the orthogonally rotated loading matrix to obtain the most interpretable solution among the \(2^D \times D!\) possible matrices that satisfy the rotation criterion used.
In the second approach, Confirmatory Factor Analysis (Joreskog, 1973), the off-diagonal elements of \({\boldsymbol{\Phi}}\) are allowed to be unconstrained and the rotational invariance of \({\boldsymbol{\Lambda}}\) and \({\boldsymbol{\Phi}}\) is resolved by imposing theoretically informed constraints on \({\boldsymbol{\Lambda}}\). If no constraints are imposed on \({\boldsymbol{\Lambda}}\), then we can find a transformation matrix \(\textbf{T}\) with the constraint
\[\operatorname{Diag}({\boldsymbol{\Phi}})=\operatorname{Diag}({\textbf{T}}^{-1}{\textbf{T}}^{\intercal^{-1}})=\mathbf{I}\]
such that
\[ \begin{align*} {\boldsymbol{\Lambda}}{\boldsymbol{\Phi}}{\boldsymbol{\Lambda}}^{\intercal}=({\boldsymbol{\Lambda}}\textbf{T}) ({\textbf{T}}^{-1}{\boldsymbol{\Phi}}{\textbf{T}}^{\intercal^{-1}})({\textbf{T}}^{\intercal}{\boldsymbol{\Lambda}}^{\intercal}) \end{align*} \]
Note that although the transformation matrix is referred in the literature as an oblique rotation matrix (Browne, 2001), it is not necessarily orthogonal and therefore not exactly a rotation matrix.
(Anderson & Rubin, 1956) suggested that model identification can be achieved by imposing at least \(D^2\) restrictions on \(({\boldsymbol{\Lambda}},{\boldsymbol{\Phi}})\). (Howe, 1955) stated three sufficient conditions for local uniqueness of \({\boldsymbol{\Lambda}}\) and recently added a fourth condition to insure global rotational uniqueness. To ensure uniqueness according to the four conditions, we need to
Note that the first condition imposes \(D\) restrictions and the second condition imposes another\(D(D-1)\) constraints for a total of \(D^2\) constraints. Jointly, the first three conditions reduce the number of possible modes from infinitely many to just the \(2^D\) modes produced by sign reversals. The fourth condition further reduces that number to 1 so that the resulting parameter space is unimodal on the condition that the remaining parameters in the model are identified. At any rate, if we are not sure whether our specified constraints make the parameters locally identifiable, we can check for local identification algebraically by using the Wald Rank Rule that evaluates the rank of the Jacobian (Bekker, 1989) or if we are conducting maximum likelihood based analysis, we can empirically assess local identification from the quality of the information matrix produced by the computer model fit.
Although some parameter constraints can guarantee a unimodal likelihood, they can have the unintended consequence of impacting the shape of the likelihood. In particular, the constraints can induce local modes in the likelihood and make estimation problematic. For example, (Millsap, 2001) has shown that some common constraint choices can produce a discrimination or loading matrix that lies outside the true equivalence class. Moreover, in maximum likelihood estimation, fixing the D(D-1)/2 elements to nonzero constants renders the large sample normal approximation inappropriate (Loken, 2005). But that is not all. Another consequence is that since the constraints are placed on particular elements of the parameter matrix, both estimation and inference will depend on the ordering of the variables in data matrix. Sometimes however, restricted confirmatory model estimation is not so problematic. For example, if our data is very informative and we restrict the factors to be orthogonal and we constrain D(D-1)/2 elements of \({\boldsymbol{\Lambda}}\) to zero while observing the rank condition (Algina, 1980), then the modes will most likely be separated, allowing us to obtain valid estimates of the standard errors of the loadings based on a unimodal normal approximation(Dolan & Molenaar, 1991).
For more information about model identification, you can check my earlier post hereUsing R, we generate data for a 3 dimensional latent factor model with 10 dimensional obsevation vector. We first set the random-number generator seed in order to make the results reproducible. We also load the package MASS to sample observations from a multivariate normal. We generate 300 observation vectors.
library("MASS")
set.seed(42)
D <-3
P <- 10
N <-300
mu_theta <-rep(0,D) # the mean of eta
mu_epsilon<-rep(0,P) # the mean of epsilon
Phi<-diag(rep(1,D))
Psi <- diag(c(0.2079, 0.19, 0.1525, 0.20, 0.36, 0.1875, 0.1875, 1.00, 0.27, 0.27))
l1 <- c(0.99, 0.00, 0.25, 0.00, 0.80, 0.00, 0.50, 0.00, 0.00, 0.00)
l2 <- c(0.00, 0.90, 0.25, 0.40, 0.00, 0.50, 0.00, 0.00, -0.30, -0.30)
l3<- c(0.00, 0.00, 0.85, 0.80, 0.00, 0.75, 0.75, 0.00, 0.80, 0.80)
L <-cbind(l1,l2,l3) # the loading matrix
Theta <-mvrnorm(N, mu_theta, Phi) # sample factor scores
Epsilon <-mvrnorm(N, mu_epsilon, Psi) # sample error vector
Y<-Theta%*%t(L)+Epsilon# generate observable data
Let’s visualise the data
library("rbokeh")
df<-data.frame(Y)
tools <- c("pan", "wheel_zoom", "resize", "reset")
nms <- expand.grid(names(df)[1:P], rev(names(df)[1:P]), stringsAsFactors = FALSE)
nms$yaxis <- rep(c(TRUE, rep(FALSE, P-1)), P)
nms$xaxis <- c(rep(FALSE, (P-1)*P), rep(TRUE, P))
nms$h <- nms$w <- 75
nms$h[nms$xaxis] <- nms$w[nms$yaxis] <- 90
splom_list <- vector("list", P^2)
for(i in seq_along(splom_list)) {
splom_list[[i]] <- figure(width = nms$w[i], height = nms$h[i],
tools = tools, min_border = 2) %>%
ly_points(nms$Var1[i], nms$Var2[i], data = df,
size = 1) %>%
x_axis(visible = nms$xaxis[i], axis_label_text_font_size = "5pt") %>%
y_axis(visible = nms$yaxis[i], axis_label_text_font_size = "5pt")
}
grid_plot(splom_list, nrow = P, ncol = P, same_axes = TRUE, link_data = TRUE)