CRAN Task View: Official Statistics & Survey Methodology

This CRAN task view contains a list of packages that includes methods typically used in official statistics and survey methodology. Many packages provide functionality for more than one of the topics listed below. Therefore this list is not a strict categorization and packages can be listed more than once. Certain data import/export facilities regarding to often used statistical software tools like SPSS, SAS or Stata are mentioned in the end of the task view.

Complex Survey Design: General Comments

Package sampling includes many different algorithms for drawing survey samples and calibrating the design weights.
Package survey can also handle moderate data sets and is the standard package for dealing with already drawn survey samples in R. Once the given survey design is specified within the function svydesign(), point and variance estimates can be computed.
Package simFrame is designed for performing simulation studies in official statistics. It provides a framework for comparing different point and variance estimators under different survey designs as well as different conditions regarding missing values, representative and non-representative outliers.

Complex Survey Design: Details

Package survey allows to specify a complex survey design (stratified sampling design, cluster sampling, multi-stage sampling and pps sampling with or without replacement) for an already drawn survey sample in order to compute accurate point and variance estimates.
Various algorithms for drawing a sample are implemented in package sampling (Brewer, Midzuno, pps, systematic, Sampford, balanced (cluster or stratified) sampling via the cube method, etc.).
The pps package contains functions to select samples using pps sampling. Also stratified simple random sampling is possible as well as to compute joint inclusion probabilities for Sampford's method of pps sampling.
Package stratification allows univariate stratification of survey populations with a generalisation of the Lavallee-Hidiroglou method.
Package SamplingStrata offers an approach for choosing the best stratification of a sampling frame in a multivariate and multidomain setting, where the sampling sizes in each strata are determined in order to satisfy accuracy constraints on target estimates. To evaluate the distribution of target variables in different strata, information of the sampling frame, or data from previous rounds of the same survey, may be used.

Complex Survey Design: Point and Variance Estimation

Package survey allows to specify a complex survey design. The resulting object can be used to estimate (Horvitz-Thompson-) totals, means, ratios and quantiles for domains or the whole survey sample, and to apply regression models. Variance estimation for means, totals and ratios can be done either by Taylor linearization or resampling (BRR, jackkife, bootstrap or user-defined).
Package EVER provides the estimation of variance for complex designs by delete-a-group jackknife replication for (Horvitz-Thompson-) totals, means, absolute and relative frequency distributions, contingency tables, ratios, quantiles and regression coefficients even for domains.
Package laeken provides functions to estimate certain Laeken indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient) including their variance for domains and stratas based on bootstrap resampling.
Package simFrame allows to compare (user-defined) point and variance estimators in a simulation environment.
The lavaan.survey package provides a wrapper function for packages survey and lavaan. It can be used for fitting structural equation models (SEM) on samples from complex designs. Using the design object functionality from package survey, lavaan objects are re-fit (corrected) with the lavaan.survey() function of package lavaan.survey. This allows for the incorporation of clustering, stratification, sampling weights, and finite population corrections into a SEM analysis. lavaan.survey() also accomodates replicate weights and multiply imputed datasets.

Complex Survey Design: Calibration

Package survey allows for post-stratification, generalized raking/calibration, GREG estimation and trimming of weights.
Package EVER provide facilities (function kottcalibrate()) to calibrate either on a total number of units in the population, on mariginal distributions or joint distributions of categorical variables, or on totals of quantitative variables.
The calib() function in Package sampling allows to calibrate for nonresponse (with response homogeneity groups) for stratified samples.
The calibWeights() function in package laeken is a possible faster (depending on the example) implementation of parts of calib() from package sampling.
Package reweight allows for calibration of survey weights for categorical survey data so that the marginal distributions of certain variables fit more closely to those from a given population, but does not allow complex sampling designs.

Editing and Visual Inspection of Microdata

Editing tools:

Package editrules convert readable linear (in)equalities into matrix form.
Package deducorrect depends on package editrules and applies deductive correction of simple rounding, typing and sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled. To determine which values are changed the Levenstein-metric is applied.
Package SeleMix can be used for selective editing for continuous scaled data. A mixture model (Gaussian contamination model) based on response(s) y and a depended set of covariates is fit to the data to quantify the impact of errors to the estimates.
Package rrcovNA provides robust location and scatter estimation and robust principal component analysis with high breakdown point for incomplete data. It is therefore applicable to find representative and non-representative outliers.

Visual tools:

Package VIM is designed to visualize missing values using suitable plot methods. It can be used to analyse the structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where the information of missing values from specified variables are highlighted in selected variables. It also comes with a graphical user interface.
Package tabplot provides the tableplot visualization method, which is used to profile or explore large statistical datasets. Up to a dozen of variables are shown column-wise as bar charts (numeric variables) or stacked bar charts (factors). Key aspects of the analysis with tableplots are the smoothness of a data distribution, the selective occurrence of missing values, and the distribution of correlated variables.
Package treemap provide treemaps. A treemap is a space-filling visualization of aggregates of data with hierarchical structures. Colors can be used to relate to highlight differences between comparable aggregates.

Imputation

A distinction between iterative model-based methods, k-nearest neighbor methods and miscellaneous methods is made. However, often the criteria for using a method depend on the scale of the data, which in official statistics are typically a mixture of continuous, semi-continuous, binary, categorical and count variables. In addition, measurement errors may corrupt non-robust imputation methods. Note that only few imputation methods can deal with mixed types of variables and only few methods account for robustness issues.

EM-based Imputation Methods:

Package mi provides iterative EM-based multiple Bayesian regression imputation of missing values and model checking of the regression models used. The regression models for each variable can also be user-defined. The data set may consist of continuous, semi-continuous, binary, categorical and/or count variables.
Package mice provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary, categorical and/or count variables.
Package mitools provides tools to perform analyses and combine results from multiply-imputated datasets.
Package Amelia provides multiple imputation where first bootstrap samples with the same dimensions as the original data are drawn, and then used for EM-based imputation. It is also possible to impute longitudial data. The package in addition comes with a graphical user interface.
Package VIM provides EM-based multiple imputation (function irmi()) using robust estimations, which allows to adequately deal with data including outliers. It can handle data consisting of continuous, semi-continuous, binary, categorical and/or count variables.
Package mix provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary or categorical variables, but methods for semi-continuous variables are missing.
Package pan provides multiple imputation for multivariate panel or clustered data.
Package norm provides EM-based multiple imputation for multivariate normal data.
Package cat provides EM-based multiple imputation for multivariate categorical data.
Package MImix provides tools to combine results for multiply-imputed data using mixture approximations.
Package robCompositions provides iterative model-based imputation for compositional data (function impCoda()).

Nearest Neighbor Imputation Methods

Package VIM provides an implementation of the popular sequential and random (within a domain) hot-deck algorithm.
VIM also provides a fast k-nearest neighbor (knn) algorithm which can be used for large data sets. It uses a modification of the Gower Distance for numerical, categorical, ordered, continuous and semi-continous variables.
Package yaImpute performs popular nearest neighbor routines for imputation of continuous variables where different metrics and methods can be used for determining the distance between observations.
Package robCompositions provides knn imputation for compositional data (function impKNNa()) using the Aitchison distance and adjustment of the nearest neighbor.
Package rrcovNA provides an algorithm for (robust) sequential imputation (function impSeq() and impSeqRob() by minimizing the determinant of the covariance of the augmented data matrix. It's application is limited to continuous scaled data.
Package impute on Bioconductor impute provides knn imputation of continuous variables.

Miscellaneous Imputation Methods:

Package missMDA allows to impute incomplete continuous variables by principal component analysis (PCA) or categorical variables by multiple correspondence analysis (MCA).
Package mice (function mice.impute.pmm()) and Package Hmisc (function aregImpute()) allow predicitve mean matching imputation.
Package VIM allows to visualize the structure of missing values using suitable plot methods. It also comes with a graphical user interface.

Statistical Disclosure Control

Data from statistical agencies and other institutions are in its raw form mostly confidential and data providers have to be ensure confidentiality by both modifying the original data so that no statistical units can be re-identified and by guaranting a minimum amount of information loss.

Package sdcMicro can be used for the generation of confidential (micro)data, i.e. for the generation of public- and scientific-use files. The package also comes with a graphical user interface.
Package simPopulation simulates synthetic, confidential, close-to-reality populations for surveys based on sample data. Such population data can then be used for extensive simulation studies in official statistics, using simFrame for example.
Package sdcTable can be used to provide confidential (hierarchical) tabular data. It includes the HITAS and the HYPERCUBE technique and uses package lpSolve for solving (a large amount of) linear programs.

Seasonal Adjustment

For general time series methodology we refer to the TimeSeries task view.

Decomposition of time series can be done with the function decompose(), or more advanced by using the function stl(), both from the basic stats package. Decomposition is also possible with the StructTS() function, which can also be found in the stats package.
Many powerful tools can be accessed via packages x12 and x12GUI and package seasonal. x12 provides a wrapper function for the X12 binaries , which have to be installed first. It uses with a S4-class interface for batch processing of multiple time series. x12GUI provides a graphical user interface for the X12-Arima seasonal adjustment software. Less functionality but with the support of SEATS Spec is supported by package seasonal.

Statistical Record Matching

Package StatMatch provides functions to perform statistical matching between two data sources sharing a number of common variables. It creates a synthetic data set after matching of two data sources via a likelihood aproach or via hot-deck.
Package RecordLinkage provides functions for linking and deduplicating data sets.
Package MatchIt allows nearest neighbor matching, exact matching, optimal matching and full matching amonst other matching methods. If two data sets have to be matched, the data must come as one data frame including a factor variable which includes information about the membership of each observation.

Small Area Estimation

Package nlme provides facilities to fit Gaussian linear and nonlinear mixed-effects models and lme4 provides facilities to fit linear and generalized linear mixed-effects model, both used in small area estimation.
The hbsae package provides functions to compute small area estimates based on a basic area or unit-level model. The model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. Auxilary information can be either counts resulting from categorical variables or means from continuous population information.
With package JoSAE point and variance estimation for the generalized regression (GREG) and a unit level empirical best linear unbiased prediction EBLUP estimators can be made at domain level. It basically provides wrapper functions to the nlme package that is used to fit the basic random effects models.

Indices and Indicators

Package laeken provides functions to estimate popular risk-of-poverty and inequality indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient). In addition, standard and robust methods for tail modeling of Pareto distributions are provided for semi-parametric estimation of indicators from continuous univariate distributions such as income variables.
Package ineq computes various inequality measures (Gini, Theil, entropy, among others), concentration measures (Herfindahl, Rosenbluth), and poverty measures (Watts, Sen, SST, and Foster). It also computes and draws empirical and theoretical Lorenz curves as well as Pen's parade. It is not designed to deal with sampling weights directly (these could only be emulated via rep(x, weights)).
Package IC2 include three inequality indices: extended Gini, Atkinson and Generalized Entropy. It can deal with sampling weights and subgroup decomposition is supported.
Function priceIndex() from package micEcon allows to estimate the Paasche, the Fisher and the Laspeyres price indices.

Microsimulation

Synthetic, confidential, close-to-reality populations based on sample data can be simulated using simPopulation. Such population data can then be used as a basis for microsimulation scenarios.
Package sms provides facilities to simulate micro-data from given area-based macro-data. Simulated annealing is used to best satisfy the available description of an area. For computational issues, the calculations can be run in parallel mode.

Additional Packages and Functionalities

Various additional packages are available that provides certain functionality useful in official statistics and survey methodology.

Data Import and Export:

Package SAScii imports ASCII files directly into R using only a SAS input script, which is parsed and converted into arguments for a read.fwf call. This is useful whenever SAS scripts for importing data are already available.
The foreign package includes tools for reading data from SAS Xport (function read.xport()), Stata (function read.dta()), SPSS (function read.spss()) and various other formats. It provides facilites to write file to various formats, see function write.foreign().
Also the package Hmisc provides tools to read data sets from SPSS (function spss.get()) or Stata (function stata.get()).
The pxR package provides a set of functions for reading and writing PC-Axis files, used by different statistical organizations around the globe for disemination of their (multidimensional) tables.
With package prevR and it's function import.dhs() it is possible to directly imports data from the Demographic Health Survey.
Function describe() from package questionr describes the variables of a dataset that might include labels imported with the foreign or memisc packages.

Sampling Techniques:

Package samplingbook includes sampling procedures from the book 'Stichproben. Methoden und praktische Umsetzung mit R' by Goeran Kauermann and Helmut Kuechenhoff (2010).
Package SDaA is designed to reproduce results from Lohr, S. (1999) 'Sampling: Design and Analysis, Duxbury' and includes the data sets from this book.
The main contributions of samplingVarEst are Jackknife alternatives for variance estimation of unequal probability one or two stage designs.
Package TeachingSampling includes functionality for sampling designs and parameter estimation in finite populations.
Package memisc includes tools for the management of survey data, graphics and simulation.
Package odfWeave.survey provides support for odfWeave for the survey package.
Package spsurvey includes facilities for spatial survey design and analysis for equal and unequal probability (stratified) sampling.
The FFD package is designed to calculate optimal sample sizes of a population of animals living in herds for surveys to substantiate freedom from disease. The criteria of estimating the sample sizes take the herd-level clustering of diseases as well as imperfect diagnostic tests into account and select the samples based on a two-stage design. Inclusion probabilities are not considered in the estimation. The package provides a graphical user interface as well.

Maintainer:	Matthias Templ
Contact:	matthias.templ at gmail.com
Version:	2014-01-09

CRAN Task View: Official Statistics & Survey Methodology

CRAN packages:

Related links: