Loïc Yengo (CNRS UMR8199 – Lille Institute of Biology )

Classification et sélection simultanée de variables en régression

vendredi 14 décembre 2012, 9h30 - 10h30

Salle de réunion, espace Turing

In high dimensional settings, linear regression models often fail
achieving both aims of prediction and interpretation.

Numerous solutions
to this limitation have been proposed through different frameworks that
traditionally cover stepwise variables selection and, most recently,
penalized approaches. All these approaches rely on the assumption that
the vector of regression coefficients is intrinsically sparse. In many
cases, the latter assumption can be violated since many covariates may
have relatively small but still relevant effects. Accounting for such
covariates, could significantly improve the prediction accuracy as well
as the interpretation of the model, provided a proper methodology to be

We introduced a new framework in which covariates sharing
sufficiently close coefficients are grouped within clusters. This
strategy allows a high level of parcimony and can be used as a variables
selection tool by assuming one cluster to be associated with a zero
coefficient. Our approach differs from the group LASSO as none of the
clusters are known a priori. However, the clusters are inferred through
the estimation of discrete parameters. Consistent estimates of the model
parameters are obtained by maximizing the likelihood. The optimal number
of groups is selected either using cross-validation or a BIC-like
criterion. Our methodology has shown good properties (bias and
prediction) when compared to the LASSO, RIDGE and Elastic-net on