Dimensionality reduction is an important machine learning technique that reduces the number of features and, at the same time, retains as much information as possible. It is usually performed by obtaining a set of new principal features.
As mentioned before, it is difficult to visualize data of high dimensions. Given a three-dimensional plot, we sometimes don’t even find it too straightforward to observe any findings, not to mention 10, 100, or 1,000 dimensions. Moreover, some of the features in high dimensional data may be correlated and, as a result, bring in redundancy. This is why we need dimensionality reduction. Dimensionality reduction is not simply taking out a pair of two features from the original feature space. It is transforming the original feature space into a new space of fewer dimensions.
The data transformation can be linear, such as the famous one principal component analysis (PCA), which maximizes the variance of projected data, or nonlinear, such as neural networks and t-SNE coming up shortly. For instance, PCA maps the data in a higher-dimensional space to a lower-dimensional space where the variance of the data is maximized. Non-negative matrix factorization (NMF) is another powerful algorithm
Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms. At the end of the day, most dimensionality reduction algorithms are in the family of unsupervised learning as the target or label information (if available) is not used in data transformation.
t-SNE for dimensionality reduction
t-SNE stands for t-distributed Stochastic Neighbor Embedding. It’s a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton. t-SNE has been widely used for data visualization in various domains, including computer vision, NLP, bioinformatics, and computational genomics.
As its name implies, t-SNE embeds high-dimensional data into a low- dimensional (usually two-dimensional or three-dimensional) space where similarity among data samples (neighbour information) is preserved. It first models a probability distribution over neighbours around data points by assigning a high probability to similar data points and an extremely small probability to dissimilar ones. Note that similarity and neighbour are measured by Euclidean distance or other metrics.
Then, it constructs a projection onto a low-dimensional space where the divergence between the input distribution and output distribution is minimized. The original high-dimensional space is modelled as Gaussian distribution, while the output low-dimensional space is modelled as t-distribution.
We’ll herein implement t-SNE using the TSNE class from scikit-learn:
>>> from sklearn.manifold import TSNE
Now let’s use t-SNE to verify our count vector representation. We pick three distinct topics, talk.religion.misc, comp.graphics , and sci.space, and visualize documents vectors from these three topics.
First, just load documents of these three labels, as follows:
>>> categories_3 = [‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]
>>>groups_3=fetch_20newsgroups(categories=categories_3)
It goes through the same process and generates a count matrix, data_cleaned_count_3, with 500 features from the input, groups_3. You can refer to steps in previous sections as you just need to repeat the same code.
Next, we apply t-SNE to reduce the 500-dimensional matrix to two- dimensional matrix:
>>> tsne_model = TSNE(n_components=2, perplexity=40,
random_state=42, learning_rate=500)
>>> data_tsne = tsne_model.fit_transform(data_cleaned_count_3.toarray())
The parameters we specify in the TSNE object are as follows:
: The output dimension
perplexity: The number of nearest data points considered neighbors in
the algorithm with a typical value of between 5 and 50
random_state: The random seed for program reproducibility
learning_rate: The factor affecting the process of finding the optimal mapping space with a typical value of between 10 and 1,000 n_components Note, the TSNE object only takes in the dense matrix, hence we convert the sparse matrix, data_cleaned_count_3, into a dense one using toarray().
We just successfully reduce the input dimension from 500 to 2. Finally, we can easily visualize it in a two-dimensional scatter plot where x axis is the first dimension, y-axis is the second dimension, and the color, c , is based on the topic label of each original document:
>>> import matplotlib.pyplot as plt
>>> plt.scatter(data_tsne[:, 0], data_tsne[:, 1], c=groups_3.target)
>>> plt.show()
Refer to the following screenshot for the end result:
Data points from the three topics are in different colours such as green, purple, and yellow. We can observe three clear clusters. Data points from the same topic are close to each other while those from different topics are far away. Obviously, count vectors are great representations for original text data as they preserve distinction among three different topics.
You can also play around with the parameters and see whether you can obtain a nicer plot where the three clusters are better separated. Count vectorization does well in keeping document disparity.
How about maintaining similarity? We can also check that using documents from overlapping topics, such as five topics, comp.graphics, comp.os.ms-windows.misc,
comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, and comp.windows.x:
>>> categories_5 = [‘comp.graphics’, ‘comp.os.ms-windows.misc’, ‘comp.sys.ibm.pc.hardware’, ‘comp.sys.mac.hardware’, ‘comp.windows.x’]
>>> groups_5 = fetch_20newsgroups(categories=categories_5)
Similar processes (including text clean-up, count vectorization, and t-SNE) are repeated and the resulting plot is displayed as follows:
Data points from those five computer-related topics are all over the place, which means they are contextually similar. To conclude, count vectors are great representations for original text data as they are also good at preserving similarity among related topics.