Social scientists use a wide range of statistical methods. To make the burden carried by this task view lighter, I have suppressed detail in some areas that are well covered by related task views (e.g., the
Spatial
task view for spatial statistics), and have pointed to those task views instead.
Most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.
One area of central interest to social scientists that I do not cover here is statistical graphics, even though this is one of the great strengths of R: Basic R graphics, trellis graphics (in the recommended
lattice
package), dynamic 3D graphs (via the
rgl
package), and the many packages that include facilities for various statistical graphs are just too extensive to detail here. Fortunately, a Graphics task view is currently in preparation.
If I have omitted something of importance, or if a new package or function should be mentioned here,
please let me know.
Linear and Generalized Linear Models:
Univariate and multivariate linear models are fit by the
lm
function, generalized linear models by the
glm
function, both in the R-base stats package. Beyond
summary
and
plot
methods for
lm
and
glm
objects, there is a wide array of functions that support these objects:
-
The generic
anova
function in the stats package constructs sequential analysis of variance and analysis of deviance tables, and can compute
F
and likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have
anova
methods as well.) The generic
Anova
function in the
car
package (associated with Fox,
An R and S-PLUS Companion to Applied Regression,
Sage, 2002) constructs so-called "Type-II" and "Type-III" tests for linear and generalized linear models.
-
F
and Wald tests for a variety of hypotheses are available from the
coeftest
and
waldtest
functions in the
lmtest
package, and the
linear.hypothesis
function in the
car
package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the
sandwich
and
car
packages. Also see the
glh.test
function in the
gmodels
package. Nonlinear functions of parameters can be tested via the
delta.method
function in the
alr3
package (associated with Weisberg,
Applied Linear Regression, 3rd Ed.,
Wiley, 2005). The
multcomp
package includes functions for multiple comparisons. The
vuong
function in the
pscl
package tests non-nested hypotheses for generalized linear and some other models. Also see the
rms
package for tests on linear and generalized linear models.
-
A basic R installation has excellent facilities for linear and generalized linear model
"diagnostics," including, for example, hat-values and deletion statistics such as studentized
residuals and Cook's distances (
hatvalues,
rstudent, and
cooks.distance, all in the stats package). These are augmented by other packages: several functions in the
car
package, which emphasizes graphical methods, e.g.,
cr.plots
for component-plus-residual plots and
av.plots
for added-variable plots, in addition to numerical diagnostics, such
vif
for (generalized) variance-inflation factors; the
dr
package for dimension reduction in regression, including SIR, SAVE, and pHd; and the
lmtest
package, which implements a wide variety of tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). More diagnostic methods, e.g., for inverse-response plots, may be found in the
alr3
package. The
forward
package implements diagnostics based on a "forward search" (Atkinson and Riani,
Robust Diagnostic Regression Analysis,
Springer, 2000). Other collinearity diagnostics are in the
perturb
package. Diagnostics may also be found in the
rms
package.
-
Several packages contain functions that are useful for interpreting linear and generalized linear models that have been fit to data: The
qvcalc
packages computes "quasi variances" for factors in linear and generalized linear models (and more generally). The
effects
package constructs effect displays, including, e.g., "adjusted means," for linear and generalized linear models. The
Zelig
package (see under
"Collections"
) creates displays for many kinds of statistical models.
Analysis of Categorical and Count Data:
Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency
tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the
glm
function in the stats package. For over-dispersed data, see also the
aod
package and the
glm.nb
function in the recommended
MASS
package (associated with
Venables and Ripley,
Modern Applied Statistics in S, Fourth Ed.
, Springer, 2002), which fits
negative-binomial GLMs. The multinomial logit model is fit by the
multinom
function in the
recommended
nnet
package, and ordered logit and probit models by the
polr
function in the MASS package. Also see the
MNP
package for the multinomial probit model, and
multinomRob
for the analysis of overdispersed multinomial data.
There are other noteworthy facilities for analyzing categorical and count data:
-
The
table
function in the R-base base package and the
xtabs
and
ftable
functions in the stats package construct contingency tables.
-
The
chisq.test
and
fisher.test
functions in the stats package may be used to test for independence in two-way contingency tables.
-
The
loglm
and
loglin
functions in the MASS package fit hierachical
loglinear models to contingency tables, the former as a front end to
glm, the latter by iterative proportional fitting.
-
Also see
brglm
package for bias-reduction in binomial-response GLMs (useful, e.g., in cases of complete separation);
the
exactLoglinTest
package for exact tests of loglinear models; the
clogit
function in the
survival
package for conditional logistic regression; and the
vcd
package for graphical displays of categorical data.
-
The
gnm
package estimates generalized
nonlinear
models, and can be used, e.g., to fit certain specialized models to mobility tables.
Other Regression Models:
It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and a much wider variety of models with contributed packages:
-
Nonlinear regression:
The
nls
function in the stats package fits nonlinear models by least-squares.
-
Generalized least-squares regression and time-series regression:
The
gls
function in the
recommended
nlme
package fits models by generalized least squares. The
lm
function can also fit weighted least-squares regressions. Also see the
dynlm
package, which allows
lm
to handle time-series data structures, and the
dyn
package, which extends this
capability to
glm
and other regression functions that are sufficiently similar to
lm
in their internal structure.
-
Mixed-effects models:
The recommended
nlme
package, associated with Pinheiro and Bates,
Mixed-Effects Models in S and S-PLUS
(Springer, 2000), fits linear and nonlinear mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the
glmmPQL
function in the MASS package, and by the
lmer
function in the
Matrix
package (related to the
lme4
package, which largely supersedes
nlme
for
linear
mixed models). Also see the
lmeSplines
and
lmm
packages.
-
Generalized estimating equations:
The
gee
and
geepack
packages fit marginal models by generalized estimating equations.
-
Nonparametric regression analysis:
This is one of the conspicuous strengths of R. A standard
R installation includes several functions for smoothing scatterplots, including
loess.smooth
and
smooth.spline, both in the stats package. The
loess
function in the stats package fits simple and multiple-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended
mgcv
package, and the
gam
package, the latter associated with Hastie and Tibshirani,
Generalized Additive Models
(Chapman and Hall, 1990). Some other noteworthy contributed packages in this area are
gss, which fits spline regressions,
locfit, for local-polynomial regression (and also density estimation) (Loader,
Local Regression and Likelihood,
Springer, 1999),
sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini,
Applied Smoothing Techniques for Data Analysis,
Oxford, 1997), and
acepack
for ACE (alternating conditional expecations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression.
-
Robust regression:
The
rlm
function fits linear models by M-estimation and
lqs
computes bounded-influence estimators; both are in the MASS package. (The
cov.rob
function in the same package computes a robust covariance-matrix estimator.)
Also see the
quantreg
package, which computes linear, nonlinear, and nonparametric
quantile regressions;
lmrob
in
robustbase
and
lmRob
in
robust
for MM estimation.
-
Structural-equation models:
The
sem
package fits general (i.e., latent-variable) SEMs by FIML, and structural equations in observed-variable models by 2SLS. Categorical variables in SEMs can be accommodated via the
polycor
package. The
systemfit
package implements a wider variety of estimators for observed-variables models, including nonlinear simultaneous-equations models. See also the
pls
package, for partial least-squares estimation, and the
gR
task view for graphical models.
-
Selection bias and censored regression:
Censored regression models, such as the tobit model, can be fit by the
survreg
function in the recommended
survival
package. The
rq
function in the
quantreg
package can estimate censored quantile-regression models. The
hurdle
and
zeroinfl
functions in the
pscl
package fit hurdle and zero-inflated Poisson and negative-binomial models to count data. The
heckit
function in the
micEcon
package implements two-step Heckman estimators to correct for sample-selection bias. Also see under
Survival Analysis
below.
Other Statistical Methods:
Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists:
-
Survival (Event-History) Analysis:
There is an extensive implementation of methods of survival analysis in the recommended
survival
package, which is associated with Therneau and Grambsch,
Modeling Survival Data
(Springer, 2000). Also see the
eha,
survrec,
frailtypack, and
rms
packages.
-
"Dimensional" Analysis:
Exploratory maximum-likelihood factor analysis is implemented in the
factanal
function in the stats package, which also provides for varimax and promax factor rotation. (Confirmatory factor-analysis models can be fit with the
sem
package.) Additional rotations are available through functions in the
GPArotation
package. The
prcomp
and
princomp
functions in the stats package perform principal-components analysis. The
cmdscale
function in the stats package performs
metric
multidimensional scaling, while the
isoMDS
and
sammon
functions in the MASS package perform
non-metric
multidimensional scaling. For methods of cluster analysis and mixtures see the
Cluster
task view. The
BradleyTerry2
package fits the Bradley-Terry model for paired comparisons. The
ltm
package fits Rasch and other item-response models to binary items. The
irr
package contains functions for assessing inter-rater reliability; also see the
psy
package.
-
Other Multivariate Statistics:
See the
Multivariate
task view, which includes information on graphs for visualizing multivariate data.
-
Missing Data:
A variety of packages implement methods for handling missing data by multiple imputation, including the
mix, and
pan
packages associated with Shafer,
Analysis of Incomplete Multivariate Data
(Chapman and Hall, 1997), and the
mice
and
mitools
packages (the latter for drawing inferences from multiply imputed data sets). There are also some facilities for missing-data imputation in the general
Hmisc
package, which is described below, under
"Collections"
.
-
Bootstrapping and Other Resampling Methods:
The recommended package
boot, associated with Davison and Hinkley,
Bootstrap Methods and Their Application
(Cambridge, 1997), has excellent facilities for bootstrapping and some related methods. Also notable is the
bootstrap
package, associated with Efron and Tibshirani,
An Introduction to the Bootstrap
(Chapman and Hall, 1993), which has functions for bootstrapping and jackknifing.
-
Model Selection:
The
step
function in the stats package and the more broadly applicable
stepAIC
function in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The
regsubsets
function in the
leaps
package performs all-subsets regression. The
BMA
package performs Bayesian model averaging. Beyond these, see the
MachineLearning
task view.
-
Social Network Analysis:
There are several packages useful for social network analysis, including
sna
for sociometric analysis of networks (e.g., blockmodeling),
network
for manipulating and displaying network objects, and
latentnet
for latent position and cluster models for networks.
-
Bayesian Statistical Methods:
Because of its easy programmability, R is a natural environment within which to implement and use Bayesian methods, and there are many packages that provide such methods, including interfaces to external Bayesian software, such as BUGS. For details, see the
Bayesian
task view.
-
Spatial Statistics:
In addition to the recommended
spatial
package, see the
Spatial
task view for an extensive list of functions and packages for spatial data analysis.
-
Time-Series Analysis:
Beyond time-series regression (see
generalized least-squares regression,
above), R has very extensive facilities for time-series analysis, both in the standard R distribution and in contributed packages; for details, see the
Econometrics
and
Finance
task views.
-
Surveys:
The
sampling
package includes functions for selecting survey samples; the
survey
package includes functions for the analysis of data from complex sample surveys, among them functions for fitting linear and generalized linear models.
-
Meta Analysis:
See the
meta
and
rmeta
packages.
-
Propensity Scores and Matching:
See the
Matching,
MatchIt, and
optmatch
packages.
Collections of Functions:
There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are potentially of interest to social scientists:
-
I have already made several references to the recommended
MASS
package, which is
associated with Venables and Ripley's
Modern Applied Statistics With S
. Other recommended
packages associated with this book are
nnet, for fitting neural networks (but also, as
mentioned, multinomial logistic-regression models);
spatial
for spatial statistics; and
class, which contains functions for classification.
-
The
Hmisc
and
rms
packages (both mentioned above), associated with Harrell,
Regression Modeling Strategies
(Springer, 2001), provide functions for data manipulation, linear models, logistic-regression models, and survival analysis, many of them "front ends" to or modifications of other facilities in R.
-
The
Zelig
package integrates a wide array of statistical models of interest to social scientists (see the
Zelig web site
for details).