Assumptions about missingness
There are three assumptions about the process by which data become missing [1].
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Probabilistic interpretation
The process by which data become missing is random, and so missing data can be formalised from a probabilistic perspective.
Mathematical formalism
The following unifies the formalisms in [1], [2], and [3].
-
:
data matrix
- : observed values of
- : unobserved values of
- : missingness indicator matrix, where if is observed and otherwise
- : distinct missing data pattern in
- : total
- : set of cases with missing data pattern
- : number of cases in
- : number of observed variables in
- : population mean vector
- : population covariance matrix
- : ML estimate of
- : ML estimate of
- : matrix indicating which variables are observed for pattern
- : subset of given by
- : vector of sample means of observed variables in pattern .
- : subset of given by
- : accounts for degrees of freedom in given by
- : statistic used to test MCAR where =
The definitions of MCAR, MAR, and MNAR are based on the probability distribution of .
-
MCAR
-
MAR
-
MNAR
The above is summarised informally below [1].
Assumption | You can predict with: |
---|---|
MCAR | - |
MAR | |
MNAR | and |
References
[1] King G, Honaker J, Joseph A, Scheve K. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review. 2001 March.
[2] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-202.
[3] Rubin DB. Inference and Missing Data. Biometrika. 1976;63(3):581-92.
[4] Joseph G Ibrahim HZ, Tang N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm. Journal of the American Statistical Asso- ciation. 2008;103(484):1648-58. PMID: 19693282.