Background • missr

Assumptions about missingness

There are three assumptions about the process by which data become missing [1].

Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)

Probabilistic interpretation

The process by which data become missing is random, and so missing data can be formalised from a probabilistic perspective.

Mathematical formalism

The following unifies the formalisms in [1], [2], and [3].

DD : n×pn \times p data matrix
- $D = (D_{obs}, D_{mis})$
$D_{obs}$ : observed values of $D$
$D_{mis}$ : unobserved values of $D$
$M$ : $n \times p$ missingness indicator matrix, where $M_{ij} = 1$ if $D_{ij}$ is observed and $0$ otherwise
$j$ : distinct missing data pattern in $M$
$J$ : total $j$
$S_j$ : set of cases with missing data pattern $j$
$m_j$ : number of cases in $S_j$
$p_j$ : number of observed variables in $S_j$
$\boldsymbol{\mu}$ : population mean vector
$\Sigma$ : population covariance matrix
$\hat{\boldsymbol{\mu}}$ : ML estimate of $\boldsymbol{\mu}$
$\hat{\Sigma}$ : ML estimate of $\Sigma$
$Q_j$ : $p \times p_j$ matrix indicating which variables are observed for pattern $j$
$\hat{\boldsymbol{\mu}}_{obs,j}$ : subset of $\hat{\boldsymbol{\mu}}$ given by $\hat{\boldsymbol{\mu}}_{obs,j} \equiv \hat{\boldsymbol{\mu}}Q_j$
$\bar{D}_{obs, j}$ : vector of sample means of observed variables in pattern $j$ .
$\hat{\Sigma}_{obs,j}$ : subset of $\hat{\Sigma}$ given by $\hat{\Sigma}_{obs,j} \equiv Q_j^{\top}\hat{\Sigma} Q_j$
$\tilde{\Sigma}_{obs, j}$ : accounts for degrees of freedom in $\hat{\Sigma}_{obs,j}$ given by $\tilde{\Sigma}_{obs, j} = \frac{n}{n-1}\hat{\Sigma}_{obs,j}$
$d^2$ : statistic used to test MCAR where $d^2 = \sum_{j=1}^J m_j (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j}) \tilde{\Sigma}_{obs, j}^{-1} (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j})^{\top}$ =

The definitions of MCAR, MAR, and MNAR are based on the probability distribution of $M$ .

MCAR
- $P(M|D) = P(M)$
MAR
- $P(M|D) = P(M|D_{obs})$
MNAR
- $P(M|D) = P(M|D)$

The above is summarised informally below [1].

Assumption	You can predict $M$ with:
MCAR	-
MAR	$D_{obs}$
MNAR	$D_{obs}$ and $D_{mis}$

References

[1] King G, Honaker J, Joseph A, Scheve K. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review. 2001 March.

[2] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-202.

[3] Rubin DB. Inference and Missing Data. Biometrika. 1976;63(3):581-92.

[4] Joseph G Ibrahim HZ, Tang N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm. Journal of the American Statistical Asso- ciation. 2008;103(484):1648-58. PMID: 19693282.