Refactor qa #984

bmosaicml · 2024-02-20T21:54:37Z

This PR is stacked on top of the migration PR #936

It does 5 things

Refactor CodeEval and QA tasks to have a shared superclass called InContextLearningGenerationTaskDataset
Rename QA tasks to InContextLearningGenerationTaskWithAnswersDataset
Introduce shared post-processing functionality shared between all generation tasks. User's can now write arbitrary post-processing functions and add them to a registry that is then accessible via config.
Implement 3 starter post-processing functions that had previously been hardcoded: early stopping, triviaqa-style normalization, regex parsing
Modify the QAAccuracy and CodEval accuracy metrics to apply post-processing functions to the generations at update time.

InContextLearningGenerationTaskDataset handles initialization of the post-processing functions from the config and then the metrics are responsible for applying them to the outputs. This is necessary because CodeEval receives many outputs per input and QAAccuracy receives one.

This refactoring makes us more in-line with Eleuther's eval harness which hallows specifying custom post-processing functions for generate tasks. They support arbitrary regex parsing, whereas we support arbitrary modifications in order to capture the shared commonality between things like triviaqa normalization, early stopping, and regex parsing.

test: mcli logs mpt-eval-rTlNa9

Confirm all performance is identical to before.

| model_name      |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:----------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |

 Model           |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:----------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |

…lm-foundry into migrate_subclasses_to_foundry

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

…lm-foundry into migrate_subclasses_to_foundry

bmosaicml added 20 commits January 27, 2024 14:51

start

cd18e74

still need to migrate fixtures

1fffbad

Merge branch 'main' into migrate_subclasses_to_foundry

5a6e81c

wip onboarding tests

4aac81e

still workin'

946a4af

still wip

289ca55

maybe done; test out on mcli now

3696f8d

mcli

a20877d

remove calibration error

53da3ea

merge

16b8e32

migration

a90766e

migration

72ce793

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

667bdec

…lm-foundry into migrate_subclasses_to_foundry

full migration

ceff0c4

precommit

5bb06cc

fix

fe83828

fix pytests

b54a12b

refactor QA

71e8391

refactor generation tasks

0cafbab

refactor generation tasks

9099495

bmosaicml requested review from dakinggg, tbarton16, codestar12 and maxisawesome February 21, 2024 18:20

bmosaicml and others added 6 commits February 22, 2024 17:48

update

414153e

restore

a3f5a31

Merge branch 'main' into migrate_subclasses_to_foundry

820069a

add

4a1cd79

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

d265979

…lm-foundry into migrate_subclasses_to_foundry

Merge branch 'main' into migrate_subclasses_to_foundry

ddfd7b5

bmosaicml and others added 29 commits March 4, 2024 12:47

Merge branch 'main' into migrate_subclasses_to_foundry

2516c24

allow QA task name stil lfor backward compatibility

f213a40

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

35fd2f1

…lm-foundry into migrate_subclasses_to_foundry

fix

d570e5d

fix test

a5cd308

Merge branch 'main' into migrate_subclasses_to_foundry

0fb37cd

add generation length

901fc69

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

a313499

…lm-foundry into migrate_subclasses_to_foundry

remove max_new_tokens

df19c0d

fix cpu trsts

54bb4c7

Merge branch 'main' into migrate_subclasses_to_foundry

9ebeaa0

Merge branch 'main' into migrate_subclasses_to_foundry

ca9816c

try and fix lm eval test

b9d6cd1

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

691ab20

…lm-foundry into migrate_subclasses_to_foundry

temp disable lm task eval test

c207cd9

fix test?

c85813b

fix tet

08ef908

finish

aca0e63

fix

30fcedd

Merge branch 'main' into migrate_subclasses_to_foundry

59daa26

Update scripts/eval/README.md

4217a78

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

fix comments

6f597a9

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

8c6e622

…lm-foundry into migrate_subclasses_to_foundry

fix bug with seq len

f387a73

Merge branch 'main' into migrate_subclasses_to_foundry

cbfa3da

restore mcli

2f405d9

Merge branch 'migrate_subclasses_to_foundry' of github.com:mosaicml/l…

76e600a

…lm-foundry into migrate_subclasses_to_foundry

Merge branch 'main' into migrate_subclasses_to_foundry

898928e

merge

65962d7

dakinggg closed this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor qa #984

Refactor qa #984

bmosaicml commented Feb 20, 2024 •

edited

Loading

Refactor qa #984

Refactor qa #984

Conversation

bmosaicml commented Feb 20, 2024 • edited Loading

bmosaicml commented Feb 20, 2024 •

edited

Loading