Fine-tuned NER models for banking and regulation texts, trained on eCFR Title 12 using manual and few-shot (GPT 3.5 v3) annotations.
Please see the executive write-up for metrics and process details.
- Eric Phann (data, programming, modeling)
- Kristen Zhang (annotation, reporting, documentation)
- Yaxin Zhao (annotation, research, procedure)
- Sydney Kelly (annotation, future considerations)
- Jake Stallard (annotation, future considerations)
- corpuses folder (configs, .spaCy, etc. for each pipeline)
- data folder (few-shot, manual, and unlabeled data)
- models folder (best/last model for each type)
- milestones 2 & 3 folder (prior deliverables)
- spacy-llm folder (stuff to make few-shot annotations)
- ecfr_ner_models.ipynb (step-by-step Colab notebook)
- write-up.pdf (executive summary; conclusions)
- requirements.txt (for reproducibility)
- Domain: Banking compliance and risk
- Possible applications: NER, text mining, classification, policy, regulations
- eCFR Title 12 (https://github.com/ericphann/dsba6188-group6-project/tree/main/data)
- Generate Entity Labels, Definitions, and Few-shot Data
- Train/Test a Model Using ecfr-few-shot.jsonl
- Compile Metrics and Review
- Label 100 examples from ecfr-unlabeled.jsonl
- Review Labels and Refine Annotation Guidelines
- Create a Final Test Dataset (ecfr-manual.jsonl)
- Model Development
- few-shot-model
- manual-model
- mixed-model
- Refine Annotation Guidelines
- Expand Dataset
- Fine-tuning with Prodigy and SpaCy
- Chunking Data
- Data Privacy and Security