Deployment of a Statistical Machine Translation (English & American Sign Language)

Hello!

In this tutorial, you will be able to deploy your statistical machine translation for the pair of language English and American Sign Language in written form.

If you want to cite my work in your research papers, please refer to this publication:

Achraf Othman, Mohamed Jemni, “Designing High Accuracy Statistical Machine Translation for Sign Language Using Parallel Corpus—Case study English and American Sign Language “, Journal of Information Technology Research, Volume 12, Issue 2, 2019.

Here, we assume that you build statistical machine translation tools (check my previous tutorials). In case of fail, you can download pre-built binary files of moses and related tools as follow:

    mkdir smt
    cd smt
    wget https://www.achrafothman.net/aslsmt/tools/smt-moses-ao-ubuntu-16.04.tgz
    tar -xzvf smt-moses-ao-ubuntu-16.04.tgz
    cd ubuntu-16.04/
    mv bin ../
    mv scripts ../
    mv training-tools ../
    cd ..
    rm -r ubuntu-16.04
    rm -r smt-moses-ao-ubuntu-16.04.tgz

We create a folder named tools and we copy complied GIZA++ files in this folder (to see how to make the GIZA++ check my previous tutorial). To do that, we use the command cp to copy all binary files. I assume that I compiled all files in a folder named smt2, so to copy files, we need just to run the following commands:

    mkdir tools
    cd tools
    cp ../../smt2/tools/GIZA++ .
    cp ../../smt2/tools/mkcls .
    cp ../../smt2/tools/snt2cooc.out .
    cd ..

For this tutorial, I prepared a small corpus for testing between English and American Sign Language. You can use any pair of languages. Below commands to download files:

    mkdir corpus
    cd corpus
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-mini.asl
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-mini.en
    mv corpus-mini.asl corpus.asl
    mv corpus-mini.en corpus.en
    cd ..

Now, we run the tokenization step. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units is called token. Let’s take an example. Consider the below string: “This is a cat.”. After the tokenization step, we get the following: [‘This’, ‘is’, ‘a’, cat’]. For our corpus, we do:

    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus.asl > /root/smt/corpus/corpus.tok.asl
    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus.en > /root/smt/corpus/corpus.tok.en
    # to check output files
    cd corpus
    tail corpus.tok.en
    cd ..

The next step is Truecasing. Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in a badly cased or non-cased text (for example, all-lowercase or all-uppercase text messages). To do that, we run the following commands:

    /root/smt/scripts/recaser/train-truecaser.perl --model /root/smt/corpus/truecase-model.en --corpus /root/smt/corpus/corpus.tok.en
    /root/smt/scripts/recaser/train-truecaser.perl --model /root/smt/corpus/truecase-model.asl --corpus /root/smt/corpus/corpus.tok.asl
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.en < /root/smt/corpus/corpus.tok.en > /root/smt/corpus/corpus.true.en
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.asl < /root/smt/corpus/corpus.tok.asl > /root/smt/corpus/corpus.true.asl

For cleaning, we run the script clean-corpus-n.perl. It is a small script that cleans up a parallel corpus, so it works well with the training script. It performs the following steps: removes empty lines and removes redundant space characters.

    /root/smt/scripts/training/clean-corpus-n.perl /root/smt/corpus/corpus.true en asl /root/smt/corpus/corpus.clean 1 22

Now, it is time to build the language model (LM). It is used to ensure fluent output, so it is built with the target language (i.e English in this case). The KenLM documentation gives a full explanation of the command-line options, but the following will build an appropriate 3-gram language model. To do:

    /root/smt/bin/lmplz -o 3 < /root/smt/corpus/corpus.true.asl >   /root/smt/corpus/corpus.arpa.asl
    # Then we should binarise (for faster loading) the *.arpa.en file using KenLM:
    /root/smt/bin/build_binary /root/smt/corpus/corpus.arpa.asl /root/smt/corpus/corpus.blm.asl

We can test the language model by querying it as follow:

    echo "NAME X-YOU WHAT" | /root/smt/bin/query /root/smt/corpus/corpus.blm.asl

Finally, we proceed to training the translation model. To do this, we run word alignment (using GIZA++), phrase extraction and scoring, create lexicalized reordering tables, and create our Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command, catching logs:

    mkdir working
    cd working
    nohup nice /root/smt/scripts/training/train-model.perl -root-dir train -corpus /root/smt/corpus/corpus.clean -f en -e asl -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/root/smt/corpus/corpus.blm.asl:8 -external-bin-dir /root/smt/tools >& training.out &
    tail -f training.out 
    # once the line starting with "(9) create moses.ini @..." appears, you can type CTRL+C to exit the tail mode.
    cd ..

This is the slowest part of the process “Tuning”, so we might want to line up something to read whilst it’s progressing. Tuning requires a small amount of parallel data, separate from the training data, so again we’ll download some data. Run the following commands (from your home directory again) to download the data and put it in a sensible place.

    cd corpus
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-tuning.asl
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-tuning.en
    # Tokenization
    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus-tuning.asl > /root/smt/corpus/corpus-tuning.tok.asl
    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus-tuning.en > /root/smt/corpus/corpus-tuning.tok.en
    # Truecasing
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.asl < /root/smt/corpus/corpus-tuning.tok.asl > /root/smt/corpus/corpus-tuning.true.asl
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.en < /root/smt/corpus/corpus-tuning.tok.en > /root/smt/corpus/corpus-tuning.true.en
    # Tuning Process
    cd ..
    cd working
    nohup nice /root/smt/scripts/training/mert-moses.pl /root/smt/corpus/corpus-tuning.true.en /root/smt/corpus/corpus-tuning.true.asl /root/smt/bin/moses /root/smt/working/train/model/moses.ini --mertdir /root/smt/bin/ &> mert.out &
    tail -f mert.out
    # once the line starting with "Saving new config to: ./moses.ini Saved: ./moses.ini..." appears, you can type CTRL+C to exit the tail mode.
    cd ..

We can now run the statistical machine translation Moses 💪🏻 with:

    /root/smt/bin/moses -f /root/smt/working/mert-work/moses.ini
    # and type in your favorite English sentence e.g., "what is your name ?" to see the results. 
    # To exit the moses mode, type CTRL+C.

We’ll notice, though, that the decoder takes at least a couple of minutes to start-up. In order to make it start quickly, we can binarise the phrase-table and lexicalized reordering models. To do this, create a suitable directory and binarise the models as follows:

    cd working
    mkdir binarised-model
    /root/smt/bin/processPhraseTableMin -in train/model/phrase-table.gz -nscores 4 -out binarised-model/phrase-table
    /root/smt/bin/processLexicalTableMin -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz -out binarised-model/reordering-table
    cp /root/smt/working/mert-work/moses.ini /root/smt/working/binarised-model/
    cd binarised-model/
    vim moses.ini
        # in the vim editor, type i to start editing mode and edit the following
        @1. Change PhraseDictionaryMemory to PhraseDictionaryCompact
        @2. Set the path of the PhraseDictionaryCompact feature to point to: /root/smt/working/binarised-model/phrase-table.minphr
        @3. Set the path of the LexicalReordering feature to point to: /root/smt/working/binarised-model/reordering-table
        @4. Save moses.ini
        # to save and quit, type ESC and type wq! followed by ENTER.

Until now, the loading and running a translation is pretty fast. To test the statistical machine translation again

    cd ..
    cd ..
    /root/smt/bin/moses -f /root/smt/working/binarised-model/moses.ini
    # and type in your favorite English sentence e.g., "what is your name ?" to see the results.
    # To exit the moses mode, type CTRL+C.

At this stage, we probably wondering how good the translation system is. To measure this, we use another parallel data set (the test set) distinct from the ones we’ve used so far. Let’s pick download the manually created corpora, and so first we have to tokenize and true-case it as before. The model that we’ve trained can then be filtered for this test set, meaning that we only retain the entries needed to translate the test set. This will make the translation a lot faster. We can test the decoder by first translating the test set, then to run the BLEU script on it:

    cd corpus
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-bleu.asl
    wget http://www.achrafothman.net/aslsmt/corpus/corpus-bleu.en
    # Tokenization
    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus-bleu.asl > /root/smt/corpus/corpus-bleu.tok.asl
    /root/smt/scripts/tokenizer/tokenizer.perl -l en < /root/smt/corpus/corpus-bleu.en > /root/smt/corpus/corpus-bleu.tok.en
    # Truecasing
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.asl < /root/smt/corpus/corpus-bleu.tok.asl > /root/smt/corpus/corpus-bleu.true.asl
    /root/smt/scripts/recaser/truecase.perl --model /root/smt/corpus/truecase-model.en < /root/smt/corpus/corpus-bleu.tok.en > /root/smt/corpus/corpus-bleu.true.en
    cd ..
    # Training
    cd working
    /root/smt/scripts/training/filter-model-given-input.pl filtered-corpus-mini mert-work/moses.ini /root/smt/corpus/corpus-bleu.true.en -Binarizer   /root/smt/bin/processPhraseTableMin
    nohup nice /root/smt/bin/moses -f /root/smt/working/filtered-corpus-mini/moses.ini < /root/smt/corpus/corpus-bleu.en >  /root/smt/working/corpus.translated.asl 2> /root/smt/working/corpus.translated.out
    # See the log
    tail -f /root/smt/working/corpus.translated.out
    cd ..

To calculate the score

    /root/smt/scripts/generic/multi-bleu.perl -lc /root/smt/corpus/corpus-bleu.true.asl < /root/smt/working/corpus.translated.asl

Thanks for following this long tutorial 🙂

4 Comments

Johans 6 years ago Reply

Thank you for preparing and sharing this fantastic tutorial. I got BLEU score 0.0. Is it because of the size of the corpus or Did i make a mistake in the process? Thanks.
- Achraf Othman Post Author 6 years ago Reply
  
  Hello, thank you for your feedback. In fact, the used corpora are very small in order to calculate an accurate BLEU Score. I invite you to prepare a bigger corpus in addition to another manually translated corpus toward calculating a BLEU score. At this stage, there isn’t a benchmark for testing between ASL and English.
Asgore 4 years ago Reply

Hello,in step “/root/smt/scripts/tokenizer/tokenizer.perl -l en /root/smt/corpus/corpus.tok.asl”, my terminal said that “bash: /root/smt/corpus/corpus.en: Permission denied”. Could you tell me the reason and how to solve it?
- Achraf Othman Post Author 4 years ago Reply
  
  Try to add “sudo” before the command.

Deployment of a Statistical Machine Translation (English & American Sign Language)

Related

4 Comments

Reply To Achraf Othman Cancel Reply