The `ngrams` tool

Fast rust-based command-line tool for processing text files, extracting ngrams, and making a statistical baseline (rough "language model").

command-line tool (ngrams) for Windows, Linux, Mac OS
normalizes ascii text (lowercases; removes non-alphabetical text; recognizes word and sentence boundaries)
takes windowed word slices as n-grams
tallies the occurrences of n-grams
optional client/server architecture for fast look-up of in-memory n-gram tallies

Installing

Install rust & cargo, then:

cargo build --release

Usage

ngrams

The ngrams tool takes an ascii text file and outputs tallied n-grams. The tally comes first and is space-padded, followed by a tab separator, and lastly the n-gram.

Input is taken to be STDIN if no files are given; otherwise, each given ascii file is processed in turn.

You can optionally specify the window size, i.e. the value of n (--number) in the n-grams (e.g. 3 for 3-grams, 4 for 4-grams, etc.): ngrams -n4 [FILES].

You can also optionally request the output to be sorted with the -s (--sort) flag. Alphabetical sorting is given by -sa and numerical sorting is given with -sn: ngrams -n4 -sn [FILES].

Example:

$ ./ngrams -n4 -sa ../bomdb/bom.txt
     1398	it came to pass
     1299	came to pass that
     1155	and it came to
      257	i say unto you
      234	to pass that the
      169	to pass that when
      159	now it came to
...

If more than one file is given on the command line, all of the results will be merged (accumulatd) in the output tally.

Saving off and reloading results

If you send the tallied results to a file, you can load them back again with the -t (--tallied-input) flag:

# Redirect tallied, unsorted output to file `bom.4grams.tallied`:
$ ./ngrams -n4 ../bomdb/bom.txt >bom.4grams.tallied

# Re-read tallied data, then sort it and show the result:
$ ./ngrams -t -sn bom.4grams.tallied
     1398	it came to pass
     1299	came to pass that
     1155	and it came to
      257	i say unto you
      234	to pass that the
      169	to pass that when
      159	now it came to
...

This capability is particularly useful if you've created a large aggregate file (i.e. baseline) that is the result of processing many files.

Client / Server

For advanced usage, you can also set ngrams up as a server, storing a large amount of data in-memory. This significantly speeds things up if you have a large baseline that you need to compare many books with.

Setting up the server looks like this:

$ ./ngrams --server -t baseline.4grams.tallied
Server mode enabled (unix:/tmp/ngramd.sock)
  -> serving 17417138 ngrams

Now, you can take an ascii file (i.e. an English book, in text format) and score it using the baseline:

$ ./ngrams --client -n4 book.txt
Client mode enabled (unix:/tmp/ngramd.sock)
  -> requesting ngrams
Score: 3.182188199

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
fixtures		fixtures
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The `ngrams` tool

Installing

Usage

ngrams

Saving off and reloading results

Client / Server

About

Releases

Packages

Languages

wordtreefoundation/ngram-tools

Folders and files

Latest commit

History

Repository files navigation

The ngrams tool

Installing

Usage

ngrams

Saving off and reloading results

Client / Server

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

The `ngrams` tool

Packages