Attention Is All You Need

Note

I adapted the code from this awesome PyTorch version. Please check it out as well.

Important

I am using python 3.9 with tensorflow 2.10 as this is their last available version for native-Windows on GPU.

Steps

pip install -r requirements.txt
download.py downloads all the data (en-de file pairs from Europarl, Common Crawl and News Commentary) to the specified folder as argument.
encode.py filters the data based on the arguments (origin, maximum length etc.) and trains the BPE model, saving it to a file.
train.py runs the whole training pipeline with top-down logic found in the file. Everything is managed by the Trainer from trainer.py (logging embeddings, checkpointing etc.).
translate.py runs the model inference and optionally evaluates it with sacrebleu using the Evaluator from evaluator.py.
docs contains notes with svg drawings from the original repo and markdown files explaining the choices I had to make for adaptating from one framework to another.

The code itself is heavily commented and you can get a feel for how language models work by looking at the tests.

Overfitting on one sentence

Input sequence:

"I declare resumed the session of the European Parliament "
"adjourned on Friday 17 December 1999, and I would like "
"once again to wish you a happy new year in the hope that "
"you enjoyed a pleasant festive period."

Results in the following generated hypotheses (all should to be the top one and the exact label for this sentence):

Top generated sequence:

('Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode '
 'des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals '
 'alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.')

All generated sequences in the beam (k=5) search:

[{'hypothesis': 'Ich die am Freitag, dem 17. Dezember unterbrochene '
                'Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3601136207580566},
 {'hypothesis': 'Ich erkläre die am Freitag, dem 17. Dezember unterbrochene '
                'Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -1.4448045492172241},
 {'hypothesis': 'Ich Ich erkläre die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.1513545513153076},
 {'hypothesis': 'Ich erkläre die die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3080737590789795},
 {'hypothesis': 'Ich erkläre erkläre die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3361663818359375}]

These are negative as they are log probabilities, the closest to zero being the top sequence

As a sanity check, the BLEU score should be a perfect 100/100 in all cases:

INFO:root:13a tokenization, cased
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:13a tokenization, caseless
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:International tokenization, cased
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:International tokenization, caseless
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)

Embeddings

After training for a while, some interesting patterns arise. This project integrates them into the Embedding Projector.

In the shared vocabulary between the encoder (english) and decoder (german) we can see some cosine similarities:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Attention Is All You Need

Steps

Overfitting on one sentence

Embeddings

British with britischen and nationaler

will and wollte (& konnte, mochte, wurde)

Bedenken (pondering) is closest to glaube (believe)

Entschließung (resolution) gets associated with completed

gessammelt (collected) maps to decision and Bestimmung (determination) as well as verstärkt (strenghtened)

Change also gets associated with neuer (new)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Attention Is All You Need

Steps

Overfitting on one sentence

Embeddings

British with britischen and nationaler

will and wollte (& konnte, mochte, wurde)

Bedenken (pondering) is closest to glaube (believe)

Entschließung (resolution) gets associated with completed

gessammelt (collected) maps to decision and Bestimmung (determination) as well as verstärkt (strenghtened)

Change also gets associated with neuer (new)