machine translation - OpenNMT-py low BLEU scores for translators to German -


i've trained opennlp-py models english german , italian german on europarl , got low bleu scores: 8.13 english -> german , 4.79 italian -> german.

as i'm no expert in nns (yet), adopted default configurations provided library. training 13 epochs took in both cases approximately 20 hours. in both cases used 80% of dataset training, 10% validation, , 10% testing.

below commands used creating italian -> german model, used similar sequence of commands other model. can give me advice on how improve effectiveness of models?

# $ wc -l europarl.de-it.de # 1832052 europarl.de-it.de  head -1465640 europarl.de-it.de > train_de-it.de head -1465640 europarl.de-it.it > train_de-it.it  tail -n 366412 europarl.de-it.de | head -183206 > dev_de-it.de tail -n 366412 europarl.de-it.it | head -183206 > dev_de-it.it  tail -n 183206 europarl.de-it.de > test_de-it.de tail -n 183206 europarl.de-it.it > test_de-it.it  perl tokenizer.perl -a -no-escape -l de < ../data/train_de-it.de > ../data/train_de-it.atok.de perl tokenizer.perl -a -no-escape -l de < ../data/dev_de-it.de > ../data/dev_de-it.atok.de perl tokenizer.perl -a -no-escape -l de < ../data/test_de-it.de > ../data/test_de-it.atok.de  perl tokenizer.perl -a -no-escape -l < ../data/train_de-it.it > ../data/train_de-it.atok.it perl tokenizer.perl -a -no-escape -l < ../data/dev_de-it.it > ../data/dev_de-it.atok.it perl tokenizer.perl -a -no-escape -l < ../data/test_de-it.it > ../data/test_de-it.atok.it  python3 preprocess.py \ -train_src ../data/train_de-it.atok.it \ -train_tgt ../data/train_de-it.atok.de \ -valid_src ../data/dev_de-it.atok.it \ -valid_tgt ../data/dev_de-it.atok.de \ -save_data ../data/europarl_de_it.atok.low \ -lower  python3 train.py \ -data ../data/europarl_de_it.atok.low.train.pt \ -save_model ../models_en_de/europarl_it_de_models \ -gpus 0 

you can lot of hints @ training romance multi-way model , training english-german wmt15 nmt engine. main idea run bpe tokenization on concatenated xxyy training corpus , tokenize training corpora learned bpe models.

the byte pair encoding tokenization should beneficial german because of compounding, algorithm helps segment words subword units. trick need train bpe model on single training corpus containing both source , target. see jean senellart's comment:

the bpe model should trained on training corpus - , ideally, train 1 single model source , target model learns translate identical word fragments source target. concatenate source , target training corpus - train tokenize once, learn bpe model on single corpus, use tokenization of test/valid/train corpus in source and target.

another idea tokenize -case_feature. idea languages letters can have different case. see jean's comment:

in general using -case_feature idea languages (with case) - , shows performance dealing , rendering in target case variation in source (for instance uppercase/lowercase, or capitalized words, ...).

to improve mt quality, might try

  1. getting more corpora (e.g. wmt16 corpora)
  2. tune using in-domain training

Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -