Simultaneous feature selection and clustering for binary data
mardi 25 juin 2013, 13h45 - 14h45
« Information has gone from scarce to superabundant. That brings huge new benefits but also big headaches » (Kenneth Cukier, The Economist, 2010). Indeed, the development of new technologies such has the Web 2.0 has made available an extremely large flow of data in many diverse scientific fields including genomic, information retrieval and other, often interdisciplinary, domains. An intuitive idea would be to think that it is in our interest to use all this information for finding patterns. But, in reality, some of this information can be irrelevant or even redundant and can thus degrade the performance of clustering algorithms.
Feature selection aims to find the attributes that best uncover classes from data. It has been widely studied in the context of supervised learning: given the class label it is natural to keep only the features that are related or lead to these classes. Feature selection in the context of unsupervised learning seems to be more complex to grasp: because we don’t know the class label, how can we decide if a feature is relevant to it? We propose to extend the mixture model approach to answer this question. Basing cluster analysis on mixture models has become a classical and powerful approach. Mixture models assume that a sample is composed of subpopulations characterized by a probability distribution. Loosely speaking the mixture approach aims to maximize the likelihood over the mixture parameters which are commonly estimated by the EM algorithm. In this presentation we will focus on binary data.
L’exposé sera donné en français.