Summaries - Office of Research & Innovation

Research Summaries

Back Principal Component Analysis Over Tree Spaces and Its Applications to Phylogenomics

Fiscal Year	2019
Division	Graduate School of Operational & Information Sciences
Department	Operations Research
Investigator(s)	Yoshida, Ruriko
Sponsor	National Science Foundation (NSF)
Summary	Principal Component Analysis (PCA) is the most popular approach to rendering two- or three-dimensional representations of the major trends in multidimensional data. The problem of multidimensionality is acute in the rapidly growing area of phylogenomics, which can provide insight into relationships and evolutionary patterns of a diversity of organisms, from humans, plants and animals, to microbes and viruses. PCA is a statistical method that takes data points in a high dimensional Euclidean space into a lower dimensional plane which minimizes the sum of squares between each point in the data set and their orthogonal projection onto the plane. It has been used for clustering high dimensional data points for statistical analysis and it is one of the simplest and most robust ways of doing such dimensionality reduction in a Euclidean vector space. However, it assumes the properties of a Euclidean vector space. The space of all possible phylogenies on a fixed set of species does not form a Euclidean vector space, so principal component analysis must be reformulated in the geometry of tree-space. Motivated by the previous work by T. Nye in 2011 on construction of the first principal component, or principal geodesic, in this proposed project, we propose a geometric object which represents a k-th order principal component: the locus of the weighted Fr\'echet mean of k+1 points in tree-space, where the weights vary over the standard k-dimensional simplex under the Billera-Holmes-Vogtman (BHV), and tropical linear spaces via tropical metric in the tropical geometry known as the max-plus algebra. We first propose to establish basic properties of these objects, in particular that locally they generically have dimension k, and we propose efficient algorithms for projection onto these surfaces.
Keywords	Data Science Graph Theory Unsupervised learning
Publications	Publications, theses (not shown) and data repositories will be added to the portal record when information is available in FAIRS and brought back to the portal
Data	Publications, theses (not shown) and data repositories will be added to the portal record when information is available in FAIRS and brought back to the portal