Statistical inference in surveys with random forest imputed estimators

Medhi Dagdoug
McGill

Survey sampling is concerned with the estimation of finite population parameters. Most often, the survey variable is only partially observed due to missing data. In surveys, item nonresponse is usually handled through some form of imputation, a procedure consisting of replacing missing values with predicted values. In recent years, imputation through machine learning procedures has attracted a lot of attention in national statistical offices. However, little is known about the theoretical properties of the resulting point estimators. In this talk, we will investigate the properties of regression trees and random forests imputed estimators in surveys. The asymptotic properties of these estimators will be discussed. Variance estimation will be investigated: we will show that traditional variance estimators may be biased for some configuration of hyper-parameters and suggest a novel variance estimator based on a K-fold cross-validation procedure. A simulation study will be presented to assess the performances of the proposed point and variance estimators. Finally, the choice of hyper-parameters in random forest algorithms will be discussed through a mix of theoretical and empirical results.