[wip] Entmax loss #3

bpopeters · 2024-01-25T17:13:06Z

This pull request adds support for entmax loss for training GPT models. This can be done through the --loss_function argument, which supports the following values: 'cross_entropy' (default), 'entmax15', 'sparsemax', and 'entmax_bisect'. 'entmax15' and 'sparsemax' make use of an additional --entmax-topk argument which sensibly defaults to 512. If using 'entmax_bisect', the alpha can be specified with --entmax-alpha (defaulting to 1.5) and --entmax-n-iter (defaulting to 30). Note that these flags work only for GPT models without pipeline parallelism (supporting other models should be easy, although I doubt anyone is interested right now; I don't know what would be required for pipeline parallelism).

I've run some quick tests with entmax15 on artemis with a very small (i.e. 3-layer, 128dim) model on {1, 2, 4} GPUs. Performance is quite a bit worse than cross entropy, but I believe this is (at least partially) an artifact of how small the model was -- the output layer and loss computation probably dominated the runtime in a away that it would not with a more reasonably-sized model. However, my attempts to train bigger models have been unsuccessful because memory usage is shockingly high (not just with entmax loss, also with cross entropy).

Note also that entmax loss does not currently support sequence-parallel loss computation. I'm not sure if this is relevant for our case (meaning, scaling up to 1B parameter models). However, it shouldn't be difficult to implement if we need to.

Before merging, we should probably think more about these performance issues.

…my assumptions of dimensions are correct)

bpopeters added 26 commits January 9, 2024 13:08

add entmax loss training for gpt-like models

8543c0e

add entmax gpt example for debugging

d3be645

fix loss indentation problem that breaks entmax training

f946b8f

remove nonfunctional, unused, and unnecessary shell script

1c76bf8

compute support during training (although do not log it yet)

563f968

fix model training bug, accumulate support across gpus

7228a8b

slightly more informative support logging

202caed

update return_support to return_support_size in all cases

8a83cf0

[wip] compute more sparsity stats

48a1781

add more support logging statistics

19b478f

add multiple loss functions (with some print statements to make sure …

c4cf60c

…my assumptions of dimensions are correct)

add force_decoded_accuracy as a zero-shot evaluation metric

a02cf72

update logging with teacher forced accuracy

a8733c4

update xavier_uniform init (which will probably be unused

e51656d

avoid unpacking error when labels is None

055f869

remove print statements for eval computation

e24ebc6

call item() because json cannot handle tensors

f77cf0c

add accuracy at k

68a4a1b

fix guardrail for new task

24a5ed3

small change to try to get LAMBADA to run

77fe108

don't load optimizer state for lambada

c41f8b0

write lambada to results file

4cfc133

fix entmax_bisect_loss return type

0182532

refactor zeroshot gpt evaluation for sparsemax score

9e7f53d

add missing parentheses

f3c66c4

get item from sparsemax score tensor

4f2c4fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Entmax loss #3

[wip] Entmax loss #3

bpopeters commented Jan 25, 2024

[wip] Entmax loss #3

Are you sure you want to change the base?

[wip] Entmax loss #3

Conversation

bpopeters commented Jan 25, 2024