Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Language id

Julien Nioche edited this page Jul 3, 2014 · 3 revisions

Detects the language of a page using http://code.google.com/p/language-detection/

behemoth-language-id*job.jar 

For simple processing with language id:

usage: 
hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i Tika-corpus -o Tikacorpus-lang

For processing & filtering on a specific language:

usage:
hadoop jar behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D document.filter.md.keep.lang=en -i Tika-corpus -o Tikacorpus-EN
Clone this wiki locally