cMap

Back to cMap
Back to Dan Pape's Homepage

Introduction

cMap is an implementation of a special neural network known as a Self-Organizing Map (SOM). cMap discovers clusters and categories within document datasets for which the internal structure is not known. These categories are then displayed in a three-dimensional graphical fashion, to allow the user to explore and browse the dataset.

The SOM algorithm was developed in the early 1980's to help identify clusters in multidimensional datasets. The SOM does this by effectively smashing the dataset onto a two-dimensional plane. The result is that data points (documents) that were "similar" to each other in the original multidimensional data space are then placed onto nearby areas of the smashed two dimensional data space.

We use a custom implementation of the SOM algorithm, which we call cMap, to cluster and categorize large datasets of documents. This custom implementation has a number of optimizations over the basic SOM algorithm; these optimizations have improved the run-time of the algorithm by many orders of magnitude.

cMap can take an original multidimensional dataset of documents and create a two dimensional data space, or "map", of the document clusters. By examining the distribution of the documents in this map, we are able to create a three dimensional landscape visualization which can easily show the relationships and degree of similarity between the document clusters.

SOM Algorithm

The original SOM algorithm is quite simple: first the data is transformed into a number of "feature vectors" where each element of the vector is associated which some feature or attribute of the data point (for our documents, each feature represents a term or phrase that may be present in the document). The map itself consists of a number of neurons (usually 2-10 thousand) which simply contain a "reference vector" which is of the same length as the feature vectors. These reference vectors are used in the similarity computations between the documents, and are initialized with random values. Next, the following steps are taken:

  1. A feature vector is randomly selected from the input set.
  2. Using some appropriate metric, the feature vector is compared to every neurons' reference vector.
  3. The neuron whose reference vector is most similar to that particular feature vector is chosen as the winning neuron.
  4. The neighboring neurons (neurons which are topographically close in the map up to a certain geometric distance) to the winning neuron are then updated by a certain amount. The effect of this update should be to transform the reference vectors of the neighboring neurons to be slightly more similar the feature vector.
  5. Go back to the first step.

This process is repeated for an indefinite amount of time, until the placement of documents onto the map does not change significantly from one moment to another.

The effect that this process has is to place feature vectors which are similar to each other onto nearby areas of the map. Additionally, areas of similarity (clusters) which are nearby other areas of similarity can be said to be more similar to each other than to others areas further away on the map. An analogy to this, if a map of the United States was said to be a SOM, is to say that people living in Illinois and Indiana are more similar to each other than people living in Illinois and California.

Optimizations

An obvious optimization would be to parallelize the algorithm. However, it would be more beneficial to first optimize the serial algorithm, and then to parallelize.

In the algorithm above, the two most time consuming steps are the 2nd and the 4th. We have taken a number of steps to speed up these steps as much as possible.

Sparsity

One characteristic of the document data we are using is that the feature vectors for each document are very sparse. An optimization was made to rewrite our vector storage and manipulation routines to take advantage of this sparseness. Because most of the feature vectors in our datasets turned out to have a sparsity on the order of 0.01%, we have been able to decrease our memory usage by as much as a factor of 10,000. Also, because our new vector manipulation routines only need to do computations on the non-zero elements of these sparse vectors, they have also decreased their running time by a factor of 10,000. This greatly helps optimize Step 2, because the vector comparison functions are now much faster.

Map Topology

The randomly initialized map becomes somewhat smoothly ordered after only a few iterations through the algorithm. We realized that searching the entire map for a particular feature vector's winning neuron was not necessary--the new winning neuron was likely to be close to the old winning neuron from the previous iteration. We changed the algorithm to search for the winning neuron starting from the position of the last known winning neuron; again, this greatly sped up Step 2, since on average the winning neuron search would only have to computes a few tens of vector comparisons, and not a few thousand.

Update Compression

The costly part of Step 4 is due to the vector update computation that needs to be done for the neighboring neurons of each winning neuron. Statistically, each neuron's reference vector is updated many more times than it is actually used in a winning node search computation. We believe that instead of performing possibly thousands of updates as they occur, it is possible to "compress" these update requests, and only perform a single update when the vector is specifically needed for a winning node search computation.

We are still in the process of completing the update compression optimizations, as well as the parallelizing the final, optimized serial algorithm. Preliminary results look very promising.

Visualization

By examining the final values of the reference vectors contained in the neurons of the map, we can create a landscape representation of the clusters computed by the map. The landscape metaphor is easily understood--the document clusters are represented by the "valleys" in the landscape, and the boundaries between the clusters are represented by the "mountains". Another important aspect of the visualization is that the heights of the mountains represent the degree of dissimilarity between neighboring clusters. For example, two clusters that are next to each other on the map are somewhat similar to each other by virtue of their placement; a low mountain range between them shows that the two neighboring clusters are slightly dissimilar, however, if there is a high mountain range between them, it would indicate that their dissimilarity is quite large, though not so large as to move the clusters apart from each other.

Future Work

There are three main objectives for the future - finish the optimizations and parallelization, figure out a way to intelligently label the map, and improve our visualization interface. Initial progress into all of these areas has been made.