Oftentimes when faced with a data-rich environment, a good way to begin the process of analyzing and organizing the data in order to get a look at the big picture is to use a classification scheme. Here I describe some ways to classify data, practical uses, an in-progress application of the data to Visual and Infrared Mapping Spectrometer (VIMS) spectra of Titan, and some links to other places to obtain further information.
The image above to the left is a 3-color image of Titan taken by VIMS on Cassini's Tb flyby of Titan. I've used only 3 colors of the approximately 64 useful wavelengths at which VIMS sees through Titan's hazy atmosphere to the surface: 5.0 µm is red, 2.0 µm is green, and 1.6 µm is blue. With all of that color information, is there an easy way to sort out which areas are spectrally self-similar? I was looking for an easier way than sitting down with 10,000 spectra and a set of magic markers on an overhead slide, and hierarchical classification is the numerical method that I decided to use.
I actually got my information from a Wikipedia article on classification (also called cluster analysis). I chose hierarchichal cluster analysis because it was the type which required the fewest assumptions or inputs regarding the data. I wanted the cluster analysis to find the things that I hadn't seen before. It does this, and well, but the solution is somewhat unstable—if I make small changes to the input dataset, the precise branching points and directions change. I think that the way to solve this is to first identify useful clusters using the hierarchichal method, and then to go back later with a top-down approach to solidify the clusters that I have chosen.
I implimented the algorithm myself in C++—it wasn't terribly difficult, but making it run in a reasonable amount of time (it's pretty n2) can get tricky. No problems with runtime for a 64×64×64 VIMS cube, but when I tried to run it with a 4-pixels-per-degree cylindrical map with 64 colors, I ran into the 2GB limit that 32-bit processors can address. More optimization addressed that issue without much trouble, though.
The short version of what I do is that I take the list of pixels and identify the two pixels that are nearest each other in "distance" defined as follows:
(x02 + x12 + . . . + xn2)½
I then create a new node composed of pointers to the two closest nodes, delete the pointers to those original nodes in the list, and put a pointer to the new fusion node into the list. I then keep going until there's only 1 node left.
Viewing the resulting tree has been trickier for me. Right now I give my clusterer a set of 32 different RGB color values to use, and tell it to then proceed 5 levels into the tree. It would be nice, however, to have it query me as to which way to go at some junctions&mdsahs;if there are only like 2 pixels off one direction, I'd just as soon have it continue down the main branch to the next intersection.
This type of analysis can be done for more than just image data. For instance, I think that the asteroid color system, which was initially determined by hand, was rederived using this method seemingly quite easily. I should look up that reference, though, really. Let me know what experiences you have with it, and if you find a free or open-source product to implement cluster analysis in an easy way! I'd attach mine, but it depends on my Jcube software, and nobody needs all that cruft :)