Mahout Processing Example

Processing in Mahout

In order to be able to extract the document IDs after the clustering, there are several steps that have to be followed in Mahout after building a training corpus with Gate, processing with TIKA etc.

SparseVectorsFromBehemoth

When generating the vectors, you have to add the --namedVector flag as in the following:

./bin/hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i "path to input" -o "path to output" -a org.apache.lucene.analysis.WhitespaceAnalyzer --namedVector

a detailed description of all options can be found here

Kmeans Clustering

When doing kmeans, you have to take care to either compute the initial clusters using canopy in mahout or set the -k flag to a number for the clusters you aim to have. If k is set, it will write the randomly selected initial points even when the cluster folder is not empty. The --clustering flag has to be set for the document/cluster mapping, which then also creates a ClusteredPoints directory.

./bin/mahout kmeans -i "path to input vectors folder" -o "path to output cluster folder" -c "path to initial clusters" -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -ow -k 10 -cd 0.1 --clustering

Other options in kmeans include the following flags

Clusterdump & ClusterDocumentIDdumper

There are different ways to analyse the results:

Clusterdump:

Here the --pointsDir flag has to be set, pointing to the ClusteredPoints directory created in the previous step. The output .txt file lists the different clusters and also the files in that cluster. Using the -n flag you can specifiy the number of top terms that will be listed for each cluster.

./bin/mahout clusterdump -i "path to previous cluster folder + "/clusters-*-final" -d "path to dictionary file" +/dictionary.file-0" -dt sequencefile -b 100 -n 20 -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir "path to cluster folder"+ /clusteredPoints -o cluster-output.txt

Other options in the clusterdump include the following flags

ClusterDocumentIDdumper

In order to map documents to their cluster ID, run:

./bin/hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i .../clusteredPoints -o cluster-mapping

which will produce a sequence file containing the mapping, that can be extracted using:

hadoop fs -text "sequence-file" > "output-file"

Provide feedback

Saved searches