Advanced data science problems may involve work with large volumes of high-dimensional data – i.e., datasets that have a very large number of attributes, such as human-gene distributions. Dimensionality reduction is an unsupervised learning technique that provides a powerful approach to construct a low-dimensional representation of high-dimensional input data.
One of the central problems in machine learning is representing human-interpretable patterns in complex data. High dimensionality poses two challenges. First, it is hard for a person to conceptualize high-dimensional space, meaning that interpreting a model is non-intuitive. Second, algorithms have a hard time learning patterns when there are many sources of input data relative to the amount of available training data. The purpose of dimensionality reduction is to reduce noise so that a model can identify strong signals among complex inputs – i.e., to identify useful information.
Dimensionality reduction is just one of many advanced machine learning techniques that can be employed using the C3 AI Platform and C3 AI Applications. Examples of dimensionality reduction models include autoencoders, an artificial neural network approach that “encodes” a complex feature space to capture important signals, and principal component analysis (PCA), a statistical method that uses linear methods to combine a large number of input variables to generate a smaller, more meaningful set of features.