Skip to contents

Assumptions about missingness

There are three assumptions about the process by which data become missing [1].

  1. Missing completely at random (MCAR)
  2. Missing at random (MAR)
  3. Missing not at random (MNAR)

Probabilistic interpretation

The process by which data become missing is random, and so missing data can be formalised from a probabilistic perspective.

Mathematical formalism

The following unifies the formalisms in [1], [2], and [3].

  • DD : n×pn \times p data matrix
    • D=(Dobs,Dmis)D = (D_{obs}, D_{mis})
  • DobsD_{obs} : observed values of DD
  • DmisD_{mis} : unobserved values of DD
  • MM : n×pn \times p missingness indicator matrix, where Mij=1M_{ij} = 1 if DijD_{ij} is observed and 00 otherwise
  • jj : distinct missing data pattern in MM
  • JJ : total jj
  • SjS_j : set of cases with missing data pattern jj
  • mjm_j : number of cases in SjS_j
  • pjp_j : number of observed variables in SjS_j
  • 𝛍\boldsymbol{\mu} : population mean vector
  • Σ\Sigma : population covariance matrix
  • 𝛍̂\hat{\boldsymbol{\mu}} : ML estimate of 𝛍\boldsymbol{\mu}
  • Σ̂\hat{\Sigma} : ML estimate of Σ\Sigma
  • QjQ_j : p×pjp \times p_j matrix indicating which variables are observed for pattern jj
  • 𝛍̂obs,j\hat{\boldsymbol{\mu}}_{obs,j} : subset of 𝛍̂\hat{\boldsymbol{\mu}} given by 𝛍̂obs,j𝛍̂Qj\hat{\boldsymbol{\mu}}_{obs,j} \equiv \hat{\boldsymbol{\mu}}Q_j
  • Dobs,j\bar{D}_{obs, j} : vector of sample means of observed variables in pattern jj.
  • Σ̂obs,j\hat{\Sigma}_{obs,j} : subset of Σ̂\hat{\Sigma} given by Σ̂obs,jQjΣ̂Qj\hat{\Sigma}_{obs,j} \equiv Q_j^{\top}\hat{\Sigma} Q_j
  • Σ̃obs,j\tilde{\Sigma}_{obs, j} : accounts for degrees of freedom in Σ̂obs,j\hat{\Sigma}_{obs,j} given by Σ̃obs,j=nn1Σ̂obs,j\tilde{\Sigma}_{obs, j} = \frac{n}{n-1}\hat{\Sigma}_{obs,j}
  • d2d^2 : statistic used to test MCAR where d2=j=1Jmj(Dobs,j𝛍̂obs,j)Σ̃obs,j1(Dobs,j𝛍̂obs,j)d^2 = \sum_{j=1}^J m_j (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j}) \tilde{\Sigma}_{obs, j}^{-1} (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j})^{\top} =

The definitions of MCAR, MAR, and MNAR are based on the probability distribution of MM.

  • MCAR
    • P(M|D)=P(M)P(M|D) = P(M)
  • MAR
    • P(M|D)=P(M|Dobs)P(M|D) = P(M|D_{obs})
  • MNAR
    • P(M|D)=P(M|D)P(M|D) = P(M|D)

The above is summarised informally below [1].

Assumption You can predict MM with:
MCAR -
MAR DobsD_{obs}
MNAR DobsD_{obs} and DmisD_{mis}

References

[1] King G, Honaker J, Joseph A, Scheve K. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review. 2001 March.

[2] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-202.

[3] Rubin DB. Inference and Missing Data. Biometrika. 1976;63(3):581-92.

[4] Joseph G Ibrahim HZ, Tang N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm. Journal of the American Statistical Asso- ciation. 2008;103(484):1648-58. PMID: 19693282.