drawing


A visual exploration of genomic data, available on my Github repository:

Abstract

Interpreting and visually exploring multi-dimensional data sets is challenging. While two or three-dimensional data is easy to display and understand, the same cannot be said for larger data sets. To improve the visualization and interpretation of these large multi-dimensional data sets, we aim to reduce their dimensionality while preserving as much information as possible. The methods in this post achieve dimension reduction through feature extraction and enable visual interpretation of the retained information.

Explanation

This article examines current visualization techniques and dimension-reducing strategies to improve learning and interpretation. Data analysis methods generally fall into two categories:

  1. Linear: These methods use axes that are linear combinations of the original variables. They find the “optimal” or “interesting” linear projection of points onto a reduced vector space. However, they don’t account for nonlinear structures in the data.

  2. Non-linear: These methods often use the concept of neighborhood, graphs, or linear combinations of neighbors. They are more sensitive to nonlinear data structures and aim to preserve the global and local properties of the data.

Both types of methods are based on the idea that the dimensionality of many data sets is artificially high and can be reduced without significant information loss.

Conclusion

In this project, we explored various data visualization and analysis techniques that can reduce the dimensionality of multivariate data, provide explanatory notes, and assist in classification. Here are some key findings:

  • Principal Component Analysis (PCA) is useful for decorrelating initial variables and can be used for Linear Discriminant Analysis (LDA).
  • LDA performs excellently due to its class knowledge, but struggles with nonlinear phenomena. PCA can initialize LDA, improving interpretation and results in certain cases.
  • PCA looks for relationships between variables, while Multidimensional Scaling (MDS) seeks similarities between observations. MDS eliminates variables, making it more of a dimension reduction method than visualization.
  • Locally Linear Embedding (LLE) and Isomap require adjusting the maximum number of nearest neighbors for each dataset due to their inherent weakness.
  • Isomap tends to create a point cloud where classes are separated and dense, while LLE seeks local peculiarities for each individual, resulting in a scatter plot but remaining readable.

This project allowed us to apply several techniques to two datasets where the number of observations was extremely low compared to the variables.