Le candidat pour le poste de professeur en actuariat ou mathématiques financières, Philippe Gagnon, présentera une conférence le mercredi 5 décembre à 10h30 dans la salle 6214-6254.
A Robust and Efficient Statistical Learning Algorithm
(Relying on Nonreversible Jump Samplers)
Nowadays, analysing high-dimensional data sets — meaning data sets comprised of a massive number of variables — is common practice. It is also common that the number of observations for each of these variables is large. These characteristics give rise to many challenges. For instance, considering that there is always a risk of measurement error, the likelihood that such data sets contain anomalies of this type certainly increases. Also, chances of coming across “extreme” observations increase given that external factors may influence the behaviour of variables now included in the analysed data sets (think of how decisions by the US Congress may have an impact on specific sectors in the economy; analysing more variables from more sectors increases the chances of observing shocks in the data). In other words, it becomes more likely to analyse data sets containing outliers. There is therefore a need for robust procedures that allow to reliably analyse large high-dimensional and possibly contaminated data sets in order to obtain conclusions that are consistent with the majority of the observations (the bulk of the data).
In this talk, I will provide the big picture of the robust and efficient statistical learning algorithm I would like to have the opportunity to develop in Université de Montréal to serve that purpose. It is an automatic procedure in which the models are trained through a full and exact Bayesian analysis. Like a machine learning algorithm, it can be viewed by users as a “black box” that they feed with data, after which the procedure outputs predictions, but additionally, typical uncertainty assessments (e.g. credible intervals and hypothesis testing). To obtain the latter, the robust models however have to be simple enough to be trained via the statistical analysis of large high-dimensional (and possibly contaminated) data sets. Linear regressions represent natural candidates to take a first step towards highly flexible and complex modelling. The linear constraint on the link between the dependent and independent variables limits their capabilities; functions of the covariables can however be included in the models to gain in flexibility. An advantage in this case is that the uncertainty assessments are easy to interpret. The next steps that represent parts of a long-term project then consist of constructing more and more flexible procedures while conserving all the advantages mentioned.
Computation relies on algorithms that Arnaud Doucet — my supervisor at University of Oxford and I are currently developing. We named them nonreversible jump samplers. I will describe them in detail in the talk and show the level of performances they can reach.