Most first, and even
second, courses in applied statistics seldom go much further
than ordinary least squares analysis of data from controlled
experiments, group comparisons, or simple prediction studies.
Collectively, these procedures make up regression analysis,
and the linear mathematical functions on which they depend
are referred to as regression models. This basic method
of data analysis is quite suitable for curve-fitting problems
in physical science, where an empirical relationship between
an observed dependent variable and a manipulated independent
variable must be estimated. It also serves well the purposes
of biological investigation in which organisms are assigned
randomly to treatment conditions and differences in the
average responses among the treatment groups are estimated.
An essential feature
of these applications is that only the dependent variable
or the observed response is assumed to be subject to measurement
error or other uncontrolled variation. That is, there is
only one random variable in the picture. The independent
variable or treatment level is assumed to be fixed by the
experimenter at known predetermined values. The only exception
to this formulation is the empirical prediction problem.
For that purpose, the investigator observes certain values
of one or more predictor variables and wishes to estimate
the mean and variance of the distribution of a criterion
variable among respondents with given values of the predictors.
Because the prediction is conditional on these known values,
they may be considered fixed quantities in the regression
model. An example is predicting the height that a child
will attain at maturity from his or her current height and
the known heights of the parents. Even though all of the
heights are measured subject to error, only the child's
height at maturity is considered a random variable.
Where ordinary regression
methods no longer suffice, and indeed give misleading results,
is in purely observational studies in which all variables
are subject to measurement error or uncontrolled variation
and the purpose of the inquiry is to estimate relationships
that account for variation among the variables in question.
This is the essential problem of data analysis in those
fields where experimentation is impossible or impractical
and mere empirical prediction is not the objective of the
study. It is typical of almost all research in fields such
as sociology, economics, ecology, and even areas of physical
science such as geology and meteorology. In these fields,
the essential problem of data analysis is the estimation
of structural relationships between quantitative observed
variables. When the mathematical model that represents these
relationships is linear we speak of a linear structural
relationship. The various aspects of formulating, fitting,
and testing such relationships we refer to as structural
equation modeling.
Although structural equation
modeling has become a prominent form of data analysis only
in the last twenty years (thanks in part to the availability
of the LISREL program), the concept was first introduced
nearly eighty years ago by the population biologist, Sewell
Wright, at the University of Chicago. He showed that linear
relationships among observed variables could be represented
in the form of so-called path diagrams and associated path
coefficients. By tracing causal and associational paths
on the diagram according to simple rules, he was able to
write down immediately the linear structural relationship
between the variables. Wright applied this technique initially
to calculate the correlation expected between observed characteristics
of related persons on the supposition of Mendelian inheritance.
Later, he applied it to more general types of relationships
among persons.
The modern form of linear
structural analysis includes an algebraic formulation of
the model in addition to the path diagram representation.
The two forms are equivalent and the implementation of the
analysis in the LISREL program permits the user to submit
the model to the computer in either representation. The
path analytic approach is excellent when the number of variables
involved in the relationship is moderate, but the diagram
becomes cumbersome when the number of variables is large.
In that case, writing the relationships symbolically is
more convenient. The SIMPLIS manual presents examples of
both representations and makes clear the correspondence
between the paths and the structural equations. Notice that
in the above mentioned fields in which experimentation is
hardly ever possible, psychology and education do not appear.
Controlled experiments with both animal and human subjects
have been a mainstay of psychological research for more
than a century, and in the 1920s experimental evaluations
of instructional methods began to appear in education. As
empirical research developed in these fields, however, a
new type of data analytic problem became apparent that was
not encountered in other fields.
In psychology, the difficulty
was, and still is, that for the most part there are no well-defined
dependent variables. The variables of interest differ widely
from one area of psychological research to another and often
go in and out of favor within areas over relatively short
periods of time. Psychology has been variously described
as the science of behavior or the science of human information
processing. But the varieties of behavior and information
handling are so multifarious that no progress in research
can be made until investigators identify the variables to
be studied and the method of observing them. Where headway
has been made in defining a coherent domain of observation,
it has been through the mediation of a construct-some putative
latent variable that is modified by stimuli from various
sources and in turn controls or influences various observable
aspects of behavior. The archetypal example of such a latent
variable is the construct of general intelligence introduced
by Charles Spearman to account for the observed positive
correlations between successful performance on many types
of problem-solving tasks.
Investigation of mathematical
and statistical methods required in validating constructs
and measuring their influence led to the development of
the data analytic procedure called factor analysis. Its
modern form is due largely to the work of Truman Kelly and
L.L.Thurstone, who transformed Spearman's one-factor analysis
into a fully general multiple-factor analysis. More recently,
Karl Jöreskog added confirmatory factor analysis to
the earlier exploratory form of analysis. The two forms
serve different purposes. Exploratory factor analysis is
an authentic discovery procedure: it enables one to see
relationships among variables that are not at all obvious
in the original data or even in the correlations among variables.
Confirmatory factor analysis enables one to test whether
relationships expected on theoretical grounds actually appear
in the data. Derrick Lawley and Karl Jöreskog provided
a statistical procedure, based on maximum likelihood estimation,
for fitting factor models to data and testing the number
of factors that can be detected and reliably estimated in
the data.
Similar problems of defining
variables appear in educational research, even in experimental
studies of alternative methods of instruction. The goals
of education are broad and the outcomes of instruction are
correspondingly many: an innovation in instructional practice
may lead to a gain in some measured outcomes and a loss
in others. The investigator can measure a great many such
outcomes, but unless all are favorable or all unfavorable
the results become too complex to discuss or provide any
guide to educational policy. Again, factor analysis is a
great assistance in identifying the main dimensions of variation
among outcomes and suggesting parsimonious constructs for
their discussion.
In the LISREL model,
the linear structural relationship and the factor structure
are combined into one comprehensive model applicable to
observational studies in many fields. The model allows multiple
latent constructs indicated by observable explanatory (or
exogenous) variables, recursive and nonrecursive relationships
between constructs, and multiple latent constructs indicated
by observable responses (or endogenous) variables.
The connections between the latent constructs compose the
structural equation model; the relationships between the
latent constructs and their observable indicators or outcomes
compose the factor models. All parts of the comprehensive
model may be represented in the path diagram and all factor
loadings and structural relationships appear as coefficients
of the path.
Nested within the general
model are simpler models that the user of the LISREL program
may choose as special cases. If some of the variables involved
in the structural relationships are observed directly, rather
than indicated, part or all of the factor model may be excluded.
Conversely, if there are no structural relationships, the
model may reduce to a confirmatory factor analysis applicable
to the data in question. Finally, if the data arise from
a simple prediction problem or controlled experiment in
which the independent variable or treatment level is measured
without error, the user may specialize to a simple regression
model and obtain the standard results of ordinary least-squares
analysis.
These specializations
may be communicated to the LISREL computer program in three
different ways. At the most intuitive, visual level, the
user may construct the path diagram interactively on the
screen, and specify paths to be included or excluded. The
corresponding verbal level is the SIMPLIS command language.
It requires only that the user name the variables and declare
the relationships among them. The third and most detailed
level is the LISREL command language. It is phrased in terms
of matrices that appear in the matrix-algebraic representation
of the model. Various parameters of the matrices may be
fixed or set equal to other parameters, and linear and non-linear
constraints may be imposed among them. The terms and syntax
of the LISREL command language are explained and illustrated
in the LISREL program manuals. Most but not all of these
functions are included in the SIMPLIS language; certain
advanced functions are only possible in native LISREL commands.
The essential statistical
assumption of LISREL analysis is that random quantities
within the model are distributed in a form belonging to
the family of elliptical distributions, the most prominent
member of which is the multivariate normal distribution.
In applications where it is reasonable to assume multivariate
normality, the maximum likelihood method of estimating unknowns
in the model is justified and usually preferred. Where the
requirements of maximum likelihood estimation are not met,
as when the data are ordinal rather than measured, the various
least squares estimation methods are available. It is important
to understand, however, except in those cases where ordinary
least squares analysis applies or the weight matrices of
other least squares methods are known, that these are large-sample
estimation procedures. This is not a serious limitation
in observation studies, where samples are typically large.
Small-sample theory applies properly only to controlled
experiments and only when the model contains a single, univariate
or multivariate normal error component.
The great merit of restricting
the analytical methods to elliptically distributed variation
is the fact that the sample mean and covariance matrix (or
correlation matrix and standard deviations) are sufficient
statistics of the analysis. This allows all the information
in the data that bear on the choice and fitting of the model
to be compressed into the relatively small number of summary
statistics. The resulting data compression is a tremendous
advantage in large-scale sample-survey studies, where the
number of observations may run to the tens of thousands,
whereas the number of sufficient statistics are of an order
of magnitude determined by the number of variables.
The operation of reducing
the raw data to their sufficient statistics (while cleaning
and verifying the validity for the data) is performed by
the PRELIS program which accompanies LISREL. PRELIS also
computes summary statistics for qualitative data in the
form of tetrachoric or polychoric correlation matrices.
When there are several sample groups, and the LISREL model
is defined and compared across the groups, PRELIS prepares
the sufficient statistics for each sample in turn.
In many social and psychological
or educational research studies where a single sample is
involved, the variables are usually measured on a scale
with an arbitrary origin. In that case, the overall means
of the variables in the sample can be excluded from the
analysis, and the fitting of the LISREL model can be regarded
simply as an analysis of the covariance structure, in which
case the expected covariance matrix implied by the model
is fitted to the observed covariance matrix directly. Since
the sample covariance matrix is a sufficient statistic under
the distribution assumption, the result is equivalent to
fitting the data. Again, the analysis is made more manageable
because one can examine the residuals from the observed
covariances, which are moderate in number, as opposed to
analyzing residuals of the original observations in a large
sample.
Many of these aspects
of the LISREL analysis are brought out in the examples in
the PRELIS and LISREL program manuals. In addition, the
SIMPLIS manual contains exercises to help the student strengthen
and expand his or her understanding of this powerful method
of data analysis. Files containing the data of these examples
are included with the program and can be analyzed in numerous
different ways to explore and test alternative models.
Today, however, LISREL
for Windows is no longer limited to SEM. The latest
LISREL for Windows includes the following statistical
applications.
- LISREL for structural
equation modeling.
- PRELIS for data manipulations and basic statistical
analyses.
- MULTILEV for hierarchical linear and non-linear
modeling.
- SURVEYGLIM for generalized linear modeling.
- MAPGLIM for generalized
linear modeling for multilevel data.
LISREL for Windows has
a set of 12 accompanying PDF user guides that can be accessed via the Help menu of
the application.
Also see the reference
page for further reading material on structural equation
modeling.