Quick-and-dirty language model with OpenGRM

« Back to notes

Assuming that the corpus has been processed appropriately (tokenized, whitespace replaced with entity token, etc.). Also assuming the default smoothing method (Kneser-Ney).

echo "usage: buildlm.sh ngram_count corpus_filename"

NGRAM_SIZE=$1
: ${NGRAM_SIZE:=3}
echo "ng: $NGRAM_SIZE"

FNAME=$2

set -x
set -e

ngramsymbols < $FNAME.split > $FNAME.syms

farcompilestrings -symbols=$FNAME.syms -keep_symbols=1 $FNAME.split > $FNAME.far

ngramcount --order=$NGRAM_SIZE < $FNAME.far > $FNAME.$NGRAM_SIZE.counts
ngrammake  $FNAME.$NGRAM_SIZE.counts > $FNAME.$NGRAM_SIZE.smoothed.mod

« Back to notes