This repository has been archived by the owner on Jul 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
Core module
jnioche edited this page Jan 9, 2013
·
5 revisions
Core commands are found in behemoth-core.job
.
CorpusGenerator is used to ingest a corpus of documents on the local filesystem as a collection of Behemoth documents stored in a SequenceFile on HDFS.
usage: com.digitalpebble.behemoth.util.CorpusGenerator -i <localdir> -o <outputDFSdir> [--recurse] [--unpack]
<localdir> The input path on the local filesystem
<outputDFSDIR> The output path on HDFS
--recurse Process the input path recursively
--unpack Unpack archives
--metadata Add document metadata separated by semicolon e.g. -md source=internet;label=public"
CorpusReader is used to print out the contents of documents stored in a SequenceFile on HDFS.
usage: com.digitalpebble.behemoth.util.CorpusReader -i <inputDFSPath> [-c] [-t] [-a] [-m]
<inputDFSPath> The input path on HDFS
-c Print the first 200 characters of binary content
-t Display the text
-m Display the metadata
-a Display the annotations
The CorpusFilter is used to filter the documents stored in the input SequenceFile and store them in a new SequenceFile.
usage: com.digitalpebble.behemoth.util.CorpusFilter -i <inputPath> -o <outputPath>
<inputPath> The input path on HDFS
<outputPath> The output path on HDFS
In addition to the parameters above, the filtering options are set using the standard Hadoop configuration e.g. -D document.filter.url.keep=.*
ContentExtractor TODO