Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

IO Module

Julien Nioche edited this page Apr 25, 2018 · 6 revisions

IO commands are found in behemoth-io.job.

usage: com.digitalpebble.behemoth.io.nutch.NutchSegmentConverterJob [-dir <segDir> | <segment>] <output>
<segDir>              Nutch segment directory
<segment>             individual segment
<output>              The output path on HDFS.

Converts a Nutch segment into a Behemoth corpus.


usage: com.digitalpebble.behemoth.io.warc.WARCConverterJob -i <archive> -o <output>
<archive>             The WARC archive on HDFS.
<output>              The output path on HDFS.
--metadata            Add document metadata separated by semicolon e.g. -md source=internet;label=public"

Converts a WARC archive into a Behemoth corpus.

Example :

hadoop jar io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -i s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz -o behemothCorpus

Behemoth Modules | Home

Clone this wiki locally