This repository has been archived by the owner on Jul 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
IO Module
Julien Nioche edited this page Apr 25, 2018
·
6 revisions
IO commands are found in behemoth-io.job
.
usage: com.digitalpebble.behemoth.io.nutch.NutchSegmentConverterJob [-dir <segDir> | <segment>] <output>
<segDir> Nutch segment directory
<segment> individual segment
<output> The output path on HDFS.
Converts a Nutch segment into a Behemoth corpus.
usage: com.digitalpebble.behemoth.io.warc.WARCConverterJob -i <archive> -o <output>
<archive> The WARC archive on HDFS.
<output> The output path on HDFS.
--metadata Add document metadata separated by semicolon e.g. -md source=internet;label=public"
Converts a WARC archive into a Behemoth corpus.
Example :
hadoop jar io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob -D fs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY -D fs.s3n.awsSecretAccessKey=$AWS_SECRET_KEY -i s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00099-ip-10-60-113-184.ec2.internal.warc.gz -o behemothCorpus