-
Notifications
You must be signed in to change notification settings - Fork 60
Mahout Processing Example
In order to be able to extract the document IDs after the clustering, there are several steps that have to be followed in Mahout after building a training corpus with Gate, processing with TIKA etc.
When generating the vectors, you have to add the --namedVector flag as in the following:
./bin/hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i "path to input" -o "path to output" -a org.apache.lucene.analysis.WhitespaceAnalyzer --namedVector
a detailed description of all options can be found here
When doing kmeans, you have to take care to either compute the initial clusters using canopy in mahout or set the -k flag to a number for the clusters you aim to have. If k is set, it will write the randomly selected initial points even when the cluster folder is not empty. The --clustering flag has to be set for the document/cluster mapping, which then also creates a ClusteredPoints directory.
./bin/mahout kmeans -i "path to input vectors folder" -o "path to output cluster folder" -c "path to initial clusters" -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -ow -k 10 -cd 0.1 --clustering
Other options in kmeans include the following flags
There are different ways to analyse the results:
Clusterdump:
Here the --pointsDir flag has to be set, pointing to the ClusteredPoints directory created in the previous step. The output .txt file lists the different clusters and also the files in that cluster. Using the -n flag you can specifiy the number of top terms that will be listed for each cluster.
./bin/mahout clusterdump -i "path to previous cluster folder + "/clusters-*-final" -d "path to dictionary file" +/dictionary.file-0" -dt sequencefile -b 100 -n 20 -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir "path to cluster folder"+ /clusteredPoints -o cluster-output.txt
Other options in the clusterdump include the following flags
In order to map documents to their cluster ID, run:
./bin/hadoop jar ./behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.util.ClusterDocIDDumper -i .../clusteredPoints -o cluster-mapping
which will produce a sequence file containing the mapping, that can be extracted using:
hadoop fs -text "sequence-file" > "output-file"