1 file | 7.52 MB

# On methods for prediction based on complex data with missing values and robust principal component analysis

(2016)
Author
Promoter
(UGent) and (UGent)
Organization
Abstract
Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data. Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles.
Keywords
Principal components, Robust statistics, outliers, data mining, decision trees, missing data

• (...).pdf
• full text
• |
• UGent only
• |
• PDF
• |
• 7.52 MB

## Citation

Chicago
Cevallos Valdiviezo, Holger. 2016. “On Methods for Prediction Based on Complex Data with Missing Values and Robust Principal Component Analysis”. Ghent, Belgium: Ghent University. Faculty of Sciences.
APA
Cevallos Valdiviezo, H. (2016). On methods for prediction based on complex data with missing values and robust principal component analysis. Ghent University. Faculty of Sciences, Ghent, Belgium.
Vancouver
1.
Cevallos Valdiviezo H. On methods for prediction based on complex data with missing values and robust principal component analysis. [Ghent, Belgium]: Ghent University. Faculty of Sciences; 2016.
MLA
Cevallos Valdiviezo, Holger. “On Methods for Prediction Based on Complex Data with Missing Values and Robust Principal Component Analysis.” 2016 : n. pag. Print.
@phdthesis{8502386,
abstract     = {Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. \%Think for example of microarray data.
Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter {\textbackslash}ref\{Chapter1\} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter {\textbackslash}ref\{Chapter2\} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter {\textbackslash}ref\{Chapter3\} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter {\textbackslash}ref\{Chapter4\} we give some concluding remarks on the robust principal components procedures discussed in Chapters {\textbackslash}ref\{Chapter1\}, {\textbackslash}ref\{Chapter2\} and {\textbackslash}ref\{Chapter3\}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems.
Theoretical results confirm the potential better prediction performance of multiple imputation ensembles.},
author       = {Cevallos Valdiviezo, Holger},
keyword      = {Principal components,Robust statistics,outliers,data mining,decision trees,missing data},
language     = {eng},
pages        = {XV, 157},
publisher    = {Ghent University. Faculty of Sciences},
school       = {Ghent University},
title        = {On methods for prediction based on complex data with missing values and robust principal component analysis},
year         = {2016},
}