Thi Thanh Yen Nguyen (MAP5, Université Paris Cité)
Optimal transport-based machine learning with applications to genomics and actuarial science
Optimal transport (OT) is a powerful mathematical theory at the interface between the theories of optimization and probability, with many applications in a wide range of fields. This thesis presents the application of OT and statistics to two domains: biology and actuarial sciences.
The first part of the thesis addresses the biological challenge of better understanding micro-RNA (miRNA) regulation in the striatum of Huntington’s disease (HD) model mice. To do so, we build several algorithms designed to learn a pattern of correspondence between two data sets in situations where it is desirable to match elements that exhibit a relationship belonging to a known parametric model. The two data sets contain miRNA and messenger- RNA (mRNA) data, respectively, each data point consisting in a multi-dimensional profile. The strong biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former, say y, should be similar to minus the profile of the latter, say −x. We consider a loosened hypothesis stating that y is then similar to t(x) where t is an affine transformation in a parametric class that includes minus the identity and translates expert knowledge about the experiment that yielded the data. The algorithms unfold in two stages. During the first stage, an OT plan P and an optimal affine transformation are learned, using the Sinkhorn-Knopp algorithm and a mini-batch gradient descent. During the second stage, P is exploited to derive either several co-clusters or several sets of matched elements. A simulation study illustrates how the algorithms work and perform. The real data application further illustrates their applicability and interest.
The second part of thesis addresses an actuarial problem related to drought events in France. Drought events rank as the second most costly natural disasters within the French legal framework of the natural disaster compensation scheme. A critical aspect of the national compensation scheme involves cities submitting requests for the government declaration of natural disaster for a drought event as a key step. We take on the challenge of forecasting which cities will submit such requests. The problem can be tackled as a classification task, leveraging the power of classification algorithms. Taking a slightly different perspective, we introduce an alternative procedure that hinges on OT and iPiano, an inertial proximal algorithm for nonconvex optimization. The optimization problem is designed so as to yield a sparse vector of predictions because it is known that relatively few cities will submit requests. Additionally, we develop a hybrid procedure that synergistically combines and utilizes both types of predictions, resulting in enhanced forecasting accuracy. The real data application is presented and discussed in details. The convergence of the iPiano algorithm is established, using the notion of o-minimal structures.