diff --git a/README.md b/README.md index 7021a8a..b4895e0 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,3 @@ # CPSC 477/577 Website -Course website for CPSC 477/577 Natural Language Processing Spring 2024 at Yale University +Course website for CPSC 477/577 Natural Language Processing Spring 2025 at Yale University diff --git a/_config.yml b/_config.yml index d130d73..3811c9b 100644 --- a/_config.yml +++ b/_config.yml @@ -6,7 +6,7 @@ name: CPSC 477/577 title: Natural Language Processing email: arman.cohan@yale.edu description: > - Natural Language Processing - Yale University - Spring 2024 + Natural Language Processing - Yale University - Spring 2025 footer_text: > Powered by Jekyll with al-folio
Theme based on CMU Deep RL. @@ -17,7 +17,7 @@ baseurl: # the subpath of your site, e.g. /blog/ # ----------------------------------------------------------------------------- # Social integration # ----------------------------------------------------------------------------- -github_username: armancohan +github_username: google_analytics: youtube_channel: rss: # notes rss diff --git a/_data/lectures.yml b/_data/lectures.yml index e69de29..75b5347 100644 --- a/_data/lectures.yml +++ b/_data/lectures.yml @@ -0,0 +1,288 @@ +- date: Tue 01/14/25 + lecturer: + - Arman + title: + - Course Introduction + - Logistics + slides: + readings: + optional: + logistics: + +- date: Thu 01/16/25 + lecturer: + title: + - Word embeddings and vector semantics + slides: + readings: + optional: + logistics: + +- date: Tue 01/21/25 + lecturer: + title: + - Word embeddings and vector semantics (cont.) + slides: + readings: + optional: + logistics: + +- date: Thu 01/23/25 + lecturer: + title: + - Basics of Neural Networks and Language Model Training + slides: + readings: + - The Matrix Calculus You Need For Deep Learning (Terence Parr and Jeremy Howard) [link] + - Little book of deep learning (François Fleuret) - Ch 3 + optional: + logistics: + +- date: Tue 01/28/25 + lecturer: + - Arman + title: + - Autograd + - Building blocks of Neural Networks + - Convolutional layers + - Network layers and optimizers + slides: + readings: + - Little book of deep learning (François Fleuret) - Ch 4 + optional: + logistics: + +- date: Thu 01/30/25 + lecturer: + - Arman + title: + - Building blocks of Neural Networks for NLP + - Taks specific neural network architectures + - RNNs + slides: + - https://yaleedu-my.sharepoint.com/:b:/g/personal/arman_cohan_yale_edu/ERiCgJVHJoxMreRomSuVlHkBc2IfZPv7K6JRV7JfsSW5OQ?e=1HGdDK + readings: + - Goldberg Chapter 9 + optional: + logistics: + +- date: Tue 02/04/25 + lecturer: + - Arman + title: + - RNNs (contd.) + - Machine translation + slides: + readings: + - Understanding LSTM Networks (Christopher Olah) [link] + - Eisenstein, Chapter 18 + optional: + - Neural Machine Translation and Sequence-to-sequence Models- A Tutorial (Graham Neubig) [link] + logistics: + +- date: Thu 02/06/25 + lecturer: + title: + - Machine translation (contd.) + - Attention + - Transformers + slides: + readings: + - Statistical Machine Translation (Koehn) [link] + - Neural Machine Translation and Sequence-to-sequence Models- A Tutorial (Graham Neubig) [link] + - Learning to Align and Translate with Attention (Bahdanau et al., 2015) [link] + - Luong et al. (2015) Effective Approaches to Attention-based Neural Machine Translation [link] + - Attention is All You Need (Vaswani et al., 2017) [link] + - Illustrated Transformer [link] + logistics: + +- date: Tue 02/11/25 + lecturer: + title: + - Machine translation (contd.) + - Attention + - Transformers + slides: + readings: + - Statistical Machine Translation (Koehn) [link] + - Neural Machine Translation and Sequence-to-sequence Models- A Tutorial (Graham Neubig) [link] + - Learning to Align and Translate with Attention (Bahdanau et al., 2015) [link] + - Luong et al. (2015) Effective Approaches to Attention-based Neural Machine Translation [link] + - Attention is All You Need (Vaswani et al., 2017) [link] + - Illustrated Transformer [link] + optional: + +- date: Thu 02/13/25 + lecturer: + - Arman + title: + - Transformers (cont'd.) + - Language modeling with Transformers + slides: + readings: + - Illustrated Transformer [link] + - Attention is All You Need (Vaswani et al., 2017) [link] + - The Annotated Transformer (Harvard NLP) [link] + - GPT-2 (Radford et al., 2019) [link] + optional: + logistics: + +- date: Tue 02/18/25 + lecturer: + - Arman + title: + - Pre-training and transfer learning + - Objective functions for pre-training + - Model architectures + - ELMO, BERT, GPT, T5 + slides: + readings: + - The Illustrated BERT, ELMo, and co. (Jay Alammar) [link] + - BERT- Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018) [link] + - GPT-2 (Radford et al., 2019) [link] + optional: + logistics: + +- date: Tue 02/20/25 + lecturer: + - Arman + title: + - Transfer learning (contd.) + - Encoder-decoder pretrained models + - Architecture and pretraining objectives + slides: + readings: + - T5- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2020) [link] + - BART- Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (Lewis et al., 2019) [link] + - What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (Wang et al, 2022) [link] + optional: + logistics: + +- date: Thu 02/25/25 + lecturer: + - Arman + title: > + Midterm Exam 1 + +- date: Thu 02/27/25 + lecturer: + - Arman + title: + - Decoding and generation + - Large language models and impact of scale + - In-context learning and prompting + slides: + readings: + - The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) [link] + - How to generate text- using different decoding methods for language generation with Transformers [link] + - Scaling Laws for Neural Language Models (Kaplan et al., 2020) [link] + - Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) [link] + - GPT3 paper - Language Models are Few-Shot Learners (Brown et al., 2020) [link] + optional: + logistics: + +- date: Tue 03/04/25 + lecturer: Arman + title: + - In-context learning and prompting (cont'd) + - Improving instruction following and few-shot learning + slides: + readings: + - Few-Shot Learning with Language Models (Brown et al., 2020) [link] + - Finetuned Language Models Are Zero-Shot Learners (Wei et al., 2022) [link] + - Multitask Prompted Training Enables Zero-Shot Task Generalization (Sanh et al., 2021) [link] + - Scaling Instruction-Finetuned Language Models (Chung et al., 2022) [link] + - Are Emergent Abilities of Large Language Models a Mirage? (Sha et al., 2023) [link] + - Emergent Abilities of Large Language Models (Wei et al., 2022) [link] + logistics: + +- date: 03/07/25 - 03/24/25 + title: > + Spring recess - No classes + +- date: Tue 03/25/25 + title: + - Post-training + - Reinforcement learning from Human Feedback + - Alignment + slides: + readings: + - Training language models to follow instructions with human feedback (Ouyang et al., 2022) [link] + - Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019) [link] + - Direct Preference Optimization- Your Language Model is Secretly a Reward Model (Rafailov et al., 2023) [link] + - RLAIF- Scaling Reinforcement Learning from Human Feedback with AI Feedback (Lee et al., 2023) [link] + optional: + +- date: Thu 03/27/25 + title: + - Post-training (cont'd) + slides: + readings: + optional: + +- date: Tue 04/01/25 + lecturer: + - Arman + title: > + Midterm Exam 2 + +- date: Thu 04/03/25 + lecturer: + title: + - Evaluation + slides: + readings: + optional: + logistics: + +- date: Tue 04/08/25 + lecturer: + title: + - Parameter-efficient Fine-Tuning + slides: + readings: + optional: + logistics: + +- date: Thu 04/10/25 + lecturer: + title: + - Safety + - Noncomplience + slides: + readings: + optional: + logistics: + +- date: Tue 04/15/25 + lecturer: + title: + - Agent-based systems + slides: + readings: + optional: + logistics: + +- date: Thu 04/17/25 + guest: + - name: TBD + title: + slides: + readings: + optional: + +- date: Tue 04/22/25 + guest: + - name: TBD + title: + slides: + readings: + optional: + +- date: Thu 04/24/25 + guest: + - name: TBD + title: + slides: + readings: + optional: diff --git a/_data/projects.yml b/_data/projects.yml index e69de29..f5bff15 100644 --- a/_data/projects.yml +++ b/_data/projects.yml @@ -0,0 +1,189 @@ +- title: "Training and Benchmarking Neural Machine Translation Models" + members: "Ethan Mathieu, Shankara Abbineni" + abstract: "In this project, we ask two questions: what are the gains to fine-tuning general langauge models on translation; and can general language models, when fine-tuned, perform better on translation tasks than a model trained solely for translation. As such, we train the DeLighT transformer model for English-to-French translation and compare its BLEU performance to other neural machine translation models which we fine-tune. We find that fine-tuned general language models can perform better than language-specific models. Additionally, we build a NextJS web application to allow end users to experiment with the different models and view their performance." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/abbinenishankara_116308_9500712_CPSC477_Final_Report.pdf?csf=1&web=1&e=ahF1g0" + +- title: "Improved Protein Function Prediction by Combining Persistent Cohomology and ProteinBERT Embeddings" + members: "Anna Su, Jason Apostol" + abstract: "Understanding the molecular function of proteins is extremely important in elucidating their biological mechanisms and in engineering new theraputics. We present a protein function classifier combining features from both sequence and structure, through embeddings generated by a pretrained ProteinBERT model trained on ~100 M proteins supplemented with structural generated on a molecular functionspecific implementation of PersLay trained on our smaller target dataset of ~6,000 human protein structures. We show that supplementing the sequence embeddings with structural embeddings improves classifier accuracy by approximately 4 % by using a relatively small number of parameters, and demonstrate that the H_1 homology group is the most important for performance. This work has applications to drug discovery, elucidation of biological pathways, and protein engineering, as it provides a high-fidelity estimate of the role of a protein in a biological system." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/apostoljason_148729_9500993_NLP_Final_Project%20(2).pdf?csf=1&web=1&e=cVnHaD" + +- title: "Biomedical Lay Summarization" + members: "Xincheng Cai, Mengmeng Du" + abstract: "Biomedical research articles contain vital information for a wide audience, yet their complex language and specialized terminology often hinder comprehension for non-experts. Inspired by the BIONLP 2024 workshop, we propose a NLP solution to generate lay summaries, which are more readable to diverse audiences. We implemented two transformer-based models, specifically BART and BART-PubMed. Our study investigates the performance of these models across different biomedical topics and explores methods to improve summarization quality through definition retrieval from Webster Medical Dictionary. By enhancing the readability of biomedical publications, our work aims to promote knowledge accessibility to scientific information." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/caixincheng_192392_9496589_CPSC577_Final_Project.pdf?csf=1&web=1&e=6qHdOP" + +- title: "Advancing AI Safety in LLMs through Dynamic Multi-Agent Debates" + members: "Vincent Li, Anna Zhang, Lindsay Chen" + abstract: "The safety and security of large language models (LLMs) has garnered significant attention with the advent of multi-agent frameworks. Our research expands on methodologies proposed in 'Combating Adversarial Attacks with Multi-Agent Debate'(1) by introducing dynamic role allocation and diversifying agent capabilities within multi-agent frameworks. These enhancements address key limitations, including static role allocation and agent homogeneity, which limit the adaptability of debates in uncovering adversarial strategies. Our proposed framework incorporates dynamic roles such as proposer, opposer, questioner, and mediator, alongside enhanced agent capabilities that allow for nuanced exploration of adversarial dialogues. The framework is implemented and trained using state-of-the-art LLMs and evaluated on existing datasets, demonstrating its effectiveness in identifying and mitigating adversarial threats in LLMs. This innovative approach advances AI safety by fostering more robust and versatile multi-agent interactions, contributing to secure and reliable LLM applications." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/chenlindsay_108121_9496997_CPSC477_Final_Report.pdf?csf=1&web=1&e=Jt3YIA" + +- title: "Transfer Learning is All You Need for Sentiment Analysis" + members: "Minyi Chen, Zishun Zhou, Bowen Duanmu" + abstract: "Transfer learning is a crucial technique that helps us learn from external sources, thus improving model performance on small datasets. In this paper, we work on Twitter Sentiment Datasets with three categories: Neutral, Positive, and Negative, using models like Bert and Gemma, and explore the impact of transfer learning on classification performance. We experimented with various data preprocessing strategies, such as removing stop words and special characters like emojis. We pre-trained our model on different datasets with similar or different tasks. During fine-tuning, we tried various freeze strategies as well. Our best results get 93.5% accuracy, 93.1 % recall, and 93.4 % F1 score in test set. Experimental results indicate that the performance of transfer learning is influenced by various factors, including the model, dataset relationships, and freeze strategies." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/chenminyi_194162_9501120_NeurIPS_2023__Copy_%20(1).pdf?csf=1&web=1&e=KSrOgh" + +- title: "Deciphering Clinical Trial Reports: A Novel NLP Task and Corpus for Evidence Inference" + members: "Xinyi Di, Chengxi Wang, Yun Yang" + abstract: "In healthcare, accurate assessment of treatment efficacy is crucial but hindered by the complex and voluminous nature of clinical trial reports. Traditional methods fall short, highlighting the need for advanced automated solutions. Our research addresses this challenge by developing NLP models that utilize sophisticated attention mechanisms to improve the extraction and synthesis of evidence from these reports. By incorporating LoRA, we enhance the fine-tuning efficiency of our models, making large language models more accessible and effective. We evaluate our approach by comparing the performance of a BERT-based baseline model with advanced models constructed using BioBERT and ClinicalBert. This study not only advances the field of NLP in healthcare but also has the potential to revolutionize the way clinical evidence is processed, hence enhancing patient care." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/dixinyi_168124_9488283_NLP_Final_Report.pdf?csf=1&web=1&e=WEPIz7" + +- title: "Llama3-8-Bing A sarcastic language model learns from Chandler Bing" + members: "Yuntian Liu, Zihan Dong" + abstract: "In this project, we explored various large language models and fine tuning or alignment techniques to classify and generate sarcasm dialogues. We adopted generative AI to boost the sarcasm study and trained a sarcastic chatbot based on llama3-8B model that learned from Chandler Bing." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/dongzihan_106584_9500895_CPSC577_SP24_Final_project.pdf?csf=1&web=1&e=cZFG6n" + +- title: "Biomedical Document Summarization Models (BDSM)" + members: "Lleyton Emery, Diego Aspinwall" + abstract: "In this writeup, we address the natural language processing task of BioLaySumm, which aims to generate layperson-friendly summaries of biomedical research articles. We implemented and compared various summarization approaches, including extractive and abstractive summarization models, large language models, and ensemble models. Through systematic evaluation of the relevance, readability, and factuality of summaries, we sought to identify the most effective summarization strategies for this domain, ultimately contributing to the advancement of health literacy and informed decision-making in the biomedical field." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/emerylleyton_127205_9500990_Emery_Aspinwall_CPSC477_Final_Project_Report.pdf?csf=1&web=1&e=LYIdOF" + +- title: "Models Understand Models: Predicting Unknown from What We Know" + members: "Kaiyuan Guan" + abstract: "Our research builds upon these existing frameworks with the aim of linking the assessment of abstract capabilities to the model's performance on problem sets with established ground truths. We propose a computing economical, easy to use and interpretable method to diagnose the inherent ability of any given LLMs, by leveraging the advantages of linear probes and the discoveries in self-consistency. If we want to build a model that is better than humans, it is crucial to know what leads to failure. Similar to a variety of research, we start with a crucial discovery: language models can produce well-calibrated predictions for token probabilities on-distribution (Guo et al. (2017)). Based on this, we train an MLP model based on the activations of the last token in CoT answers by the LLM, which is elicited by our compound strategy that digging the potential of few-shot, reasoning and model's intuition." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/guankaiyuan_192577_9500874_CPSC_477_Final_Report.pdf?csf=1&web=1&e=vxHdB3" + +- title: "Predicting Primary Sub-Categories of Statistics arXiv Papers" + members: "Ali Aldous, Eugene Han, Elder Veliz" + abstract: We investigate the application of natural language processing techniques for automatic classification and category moderation within the arXiv repository, specifically for classifying statistics papers by primary sub-category using their titles and abstracts. Previous work has demonstrated the efficacy of fine-tuning BERT-based models for classifying arXiv papers but only on balanced datasets with broad categories such as biology and physics or distinct sub-categories under subjects other than statistics. Using a dataset of 60,648 arXiv papers within the statistics category, we experiment with TF-IDF embeddings combined with a Linear Support Vector Classifier, SPECTER2 embeddings combined with Logistic Regression, and RoBERTa models, extending past research to include imbalanced sub-categories with significant content overlap. Our results show that while fine-tuning RoBERTa substantially increases performance on unseen paper titles and abstracts, it underperforms compared to other baselines which may highlight potential shortcomings with this approach. Comprehensive details on the source code are available in the GitHub repository ehan03/arxiv-stat-nlp. Instructions for setup are provided to facilitate replication and verification by other researchers." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/haneugene_146694_9488292_report.pdf?csf=1&web=1&e=GuAEvQ" + +- title: "Synthetic Data for Cross-Domain Uncertainty Analysis" + members: "Stephanie Hu" + abstract: "In many real-world applications of machine learning, obtaining labeled data in sufficient quantities can be a challenging and resource-intensive task. This project adapts a common approach in image generation and processing to address this issue. I design and train a Conditional Generative Adversarial Network (CGAN) for synthetic labeled data generation in a domain distinct from that of the training data. Unfortunately, my results show that the CGAN model architecture and finetuning methods I chose to use are not capable of generating high-quality synthetic data. They are not recognizably English and perform no better than random bag-of-words sampling. Furthermore, it appears that the conditional label has limited weight in the generator model, suggesting my model was unable to extract aspect-level features. I end by positing the limitations of my approach and suggesting further experimentation for model improvement." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/hustephanie_109044_9491677_combinepdf.pdf?csf=1&web=1&e=Edbyf9" + +- title: "Reimplementation of Topic Modeling with Wasserstein Autoencoders" + members: "Aryaan Khan, Yuhang Cui, Raymond Lee" + abstract: "This project re-implements topic modeling with Wasserstein auto-encoders (WAE) from 2 , which have much faster training time compared to traditional topic matching using LDA, and allows direct Dirichlet distribution matching without Gaussian approximation compared to variational auto-encoders (VAE). Re-implementing WAE in the more popular PyTorch framework allows easier integration and better usability, and we will also be verifying the original papers claims by comparing the performance of WAE against LDA. Our implementation of WAE confirmed the original papers results that WAE can have performance on par or better than LDA, and have much faster training time. This project verifies the results of the original WAE paper and provides a PyTorch implementation for future use, which is available on GitHub" + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/khanaryaan_128257_9490100_Topic_Modeling_with_Wasserstein_Autoencoders-combined_1.pdf?csf=1&web=1&e=J1W6BO" + +- title: "Discrimination Risks in LLMs" + members: "Conrad Lee, Irine Juliet Otieno" + abstract: "We explore the risks of discrimination and bias in LLMs with a short survey paper and an experiment. We find that biases in LLMs have their roots in many sources, not just in training corpora. We categorize different manifestations and targets of LLM biases as well as types of debiasing solutions. We support these findings through direct experimentation following the procedure of Dhamala et al (2021). We validate that racial biases exist within the BERT LLM, particularly towards the African American population." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/leeconrad_142837_9500189_CPSC_477_Final_Project_Lee_Otieno.pdf?csf=1&web=1&e=wVKyET" + +- title: "Advancing Author-Specific Language Generation with a Custom Generative Pre-trained Transformer" + members: "Emilia Liu, Jingjia Meng" + abstract: "This project advances natural language processing by developing a custom Generative Pre-trained Transformer (GPT) model designed specifically to emulate Ernest Hemingway's distinctive writing style. Utilizing 'The First Forty-Nine Stories' as the training corpus, the model leverages a multi-head attention mechanism, inspired by the paper 'Attention Is All You Need'. The model's performance was evaluated using various metrics, including ROUGE, METEOR, and BERT scores, to assess its efficiency in style mimicry compared to traditional language models. Results indicated that while the model can generate text with lexical diversity and sentence complexity akin to Hemingway's style, challenges remain in capturing the full spectrum of his stylistic essence. Future work will focus on optimizing model architecture and training processes to enhance the fidelity of generated text to Hemingway's style. This approach not only demonstrates the capabilities of GPT models in personalized language modeling but also opens avenues for future research into author-specific language generation. Such developments hold significant promise for applications in digital humanities, authorial style emulation, and beyond. Code is available at:https://github.com/jjmeng08/CPSC_Project.git" + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/liuemilia_167819_9491967_CPSC_final_report.pdf?csf=1&web=1&e=VfpmNm" + +- title: "Instruction Tuning to Improve Multi-Document Processing Capabilities of LLMs" + members: "Gabrielle Kaili-May Liu, Richard Luo" + abstract: "Multi-document pre-training objectives are a strong approach to boosting LLM performance on downstream multi-document downstream tasks. Yet such approaches tend to be less general and scalable to broader model types and sizes. Additionally complicating multi-document task capabilities is the 'lost-in-the-middle' phenomenon, whereby the performance of long-context language models decreases significantly when relevant information is located in the middle of the context as opposed to the beginning or end. Recent work suggests instruction tuning as a scalable method for enabling automatic instruction generation/following in LLMs. In this project we therefore leverage such an approach to confer LLMs with improved long-context multi-document capabilities in a more scalable way. Preliminary results demonstrate promise for our proposed approach in the 0-shot setting. Our code is available at https://github.com/pybeebee/577_final_project." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/liukaili_195424_9498068_577_Final_Report.pdf?csf=1&web=1&e=PDfWcj" + +- title: "Multimodal ClinicalEDBERT: Predicting Hospital Admissions with MIMIC-IV ED Database" + members: "Yufei Deng, Yihan Liu, Hang Shi" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/liuyihan_169918_9492645_CPSC_577_Final_project_report.pdf?csf=1&web=1&e=kBn3p8" + +- title: "Advancements in NLP for Autonomous Robotic Systems through Transformer Models and Neural Networks" + members: "Liam Merz Hoffmeister, Stephen Miner" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/minerstephen_195487_9500534_Final_Project_Report.pdf?csf=1&web=1&e=RPoUf9" + +- title: "Mixture-of-Experts Transformers: A Survey" + members: "Yizheng (Jerry) Shi, Ginny Xiao, Abhisar Mittal, Ardavan (Harry) Abiri" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/mittalabhisar_147890_9498554_Final%20Project.pdf?csf=1&web=1&e=rnQhtB" + +- title: "Multimodal Training of Transformers" + members: "Leo deJong, Reese Johnson, Siva Nalabothu" + abstract: "The transformer architecture by Vaswani et al. (2017) has replaced RNN-LSTM based approaches in state-of-the-art language modeling. In particular, decoder-only generative large language models between the range of 7 billion and 300 billion parameters (or more than 1 trillion for some mixture of experts models) have shown remarkable performance in mimicking human conversational abilities on a wide range of topics. One of the important directions for continuous improvement of these transformer models is natively adding support for multimodal input and output. To this end, this paper reviews state-of-the-art approaches in multimodality today and presents results related to replicating Lewkowycz et al.'s (2022) attempt to train language models to solve quantitative problems. Our code is available at https://github.com/snalabothu/multimodal-training-of-transformers." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/nalabothusiva_146996_9490849_final_proj.pdf?csf=1&web=1&e=35plWt" + +- title: "Contextual Embeddings for Sentiment Classification Accuracy" + members: "Carl Viyar, Christopher Nathan" + abstract: "Our project seeks to explore the benefits of contextual word embeddings for the task of sentiment classification. We base our approach on the findings of the 2017 paper 'Learned in Translation: Contextualized Word Vectors' by McCann et. al. that uses ELMo embeddings to achieve a 6.8 % increase in sentiment classification accuracy. In this paper, we compare performance using BERT-derived token embeddings with baseline performance using GloVe embeddings, finding that BERT-derived embeddings demonstrated a similar increase in improvement." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/nathanchristopher_103084_9489343_ViyarNathan_CPSC477_FinalProject.pdf?csf=1&web=1&e=XBAF6x" + +- title: "Knowledge Distillation From Gemini to Mistral for Earnings Call Transcript Summarization" + members: "Rohan Phanse, Joonhee Park" + abstract: "Earnings call transcripts are invaluable to investors because they contain insights that can lead to profitable investments and optimal decision-making. However, these calls are often lengthy, making it difficult for investors to quickly identify key insights from them. Prior work with applying large language models to financial document summarization partly addresses this need, but still struggles to identify the most important information that should be included in summaries. In this project, we approach this challenge by finetuning Mistral 7B-Instruct upon an augmented version of Mukherjee et al.'s ECTSum benchmark, in which we replaced the bullet-point summaries in ECTSum with longer summaries. We used Gemini Pro to create this augmented ECTSum dataset and developed a quality ranking system to select the augmented summaries that best aligned with the information in ECTSum's bullet-point summaries. We then performed knowledge distillation by finetuning Mistral 7B-Instruct on the augmented dataset to align it with Gemini's outputs. After finetuning, we observed improvements in ROUGE performance across the board and an increase in ability to recall important statistics from the transcripts. " + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/parkjoonhee_125890_9501197_CPSC_477_NLP_Final_Project_Report.pdf?csf=1&web=1&e=XfzUNF" + +- title: "Domain-Specific Value Alignment of Large Language Models" + members: "Kevin Chan, Paul Lin, Jonah Sparling, Rami Pellumbi" + abstract: "Large Language Models (LLMs) are increasingly being used in a wide range of applications, and organizations may wish to align these models with specific sets of values for particular use cases. In this project, we explore the feasibility of aligning an open-source LLM, Mistral, to create an educational chatbot suitable for young children. We generate a dataset of child-appropriate and inappropriate prompt-completion pairs using ChatGPT-4 and Claude-3 models. We then employ three approaches to align Mistral: 1) a prompt-based method using a safety prefix, 2) supervised fine-tuning using the LoRA technique, and 3) applying control vectors to model activations during inference. To evaluate the safety of the model outputs, we use GPT-4 to perform automatic evaluation on the prompt-completion pairs. Our results show that while the prompt-based approach and fine-tuning significantly improve the model's ability to provide child-appropriate responses, the control vector method underperforms in comparison. This work demonstrates the feasibility of aligning LLMs for specific use cases and highlights the importance of carefully evaluating alignment approaches to ensure they meet the desired safety criteria. " + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/pellumbirami_170007_9490546_CPSC_477_Final_Report.pdf?csf=1&web=1&e=O0aAj3" + +- title: "Genre, Period, and Provenience Classification in Akkadian Cuneiform Documents" + members: "Avital Romach" + abstract: "Many under resourced languages still lag behind the promise that large language models have to offer. It is inefficient to train large language models on small amounts of data, and there are no clear guidelines or recommendations for transferability between languages and scripts [1,2]. For ancient and dead languages, such as Akkadian, written in the cuneiform script, the challenges are more pronounced as Akkadian texts are interpreted through the bias of modern scholars. Particularly, there are several levels of interpretation which can be used as input for machine learning models. This paper presents the first attempt to perform genre, period, and provenience classification in Akkadian cuneiform documents, while trying to assess the issues discussed above. I use two baseline models, Naive Bayes and Logistic Regression, and three BERT models fine-tuned to this task. Each model and classification task was trained and tested on four versions of the same Akkadian texts: lemmatized, normalized (phonetically reconstructed), segmented Unicode cuneiform, and unsegmented Unicode cuneiform. The best performing models for each classification task are multilingual BERT with normalization for genre (96 % weighted F1), Arabic BERT with segmented Unicode cuneiform for period ( 97 % ), and multilingual BERT with normalization for provenience ( 93 % ). I further assess how preprocessing and tokenization methods effect the models' accuracies, how modern editorial practices potentially contribute bias to certain identifications, and the specific difficulties in each type of classification task. The code is available in a GitHub repository." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/romachavital_166881_9500094_Akkadian_Classification_CPSC_577_Final_Project.pdf?csf=1&web=1&e=ky2dRF" + +- title: "Noise and Nuance: Impact of Input Noise on Translation Accuracy in Transformer Models" + members: "Andrew Pan, Nandan Sarkar, Aditya Kulkarni" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/sarkarnandan_169554_9492845_CPSC477__NLP_Project_Writeup.pdf?csf=1&web=1&e=QjXwea" + +- title: "DPO Can Reveal Latent Hallucinations" + members: "Nathan Shan" + abstract: "Hallucinations in large language models, defined as convincing yet incorrect outputs, are a critical challenge that can arise during instruction tuning. Techniques adjacent to reinforcement learning from human feedback (RLHF) methods like Direct Preference Optimization (DPO) can theoretically reduce model tendencies to hallucinate by incorporating human preferences for appropriate certainty and accuracy. We empirically analyze DPO by investigating logit differences of model checkpoints in the TÜLU 2 family (Ivison et al., 2023) to better understand the effect of DPO on hallucinations. We introduce the concept of 'latent hallucinations' and demonstrate their prevalence in models tuned with DPO, suggesting that current alignment methods may not adequately capture human preferences for uncertainty over inaccuracy. Additionally, we show that DPO fails to fulfil its theoretical capabilities of reducing hallucinations expressed by IT models. We also bring attention to limitations of popular benchmarks in detecting hallucinations Our findings highlight the need for improved evaluation methods and understanding of alignment techniques to reduce hallucinations in LLMs. Our code is posted at https://github.com/nshan144/DPO Hallucination" + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/shannathan_194701_9500806_CPSC%20477%20Project%20Paper.pdf?csf=1&web=1&e=jYEOPW" + +- title: "Enhancing Text Summarization of Biomedical Journal Papers with Domain-Specific Knowledge Integration in State-of-the-Art NLP Models" + members: "Nilay Bhatt, Tom Shin, Luning Yang" + abstract: "The exponential growth of biomedical literature over the past decade has not just created a need, but a pressing need for efficient summarization tools. These tools are crucial for researchers to stay informed about recent developments in their field. As the volume and complexity of scientific papers increase, automated summarization has become indispensable for researchers aiming to distill key information rapidly. Although modern Natural Language Processing (NLP) models like BERT and GPT have shown promising results in text summarization, they often need help to fully capture the nuances and domain-specific language inherent in biomedical texts. This results in summaries that lack accuracy or comprehensiveness, posing a significant challenge for researchers. To address these challenges, this project not only leverages state-of-the-art NLP models, including BART, T5, BioGPT, and LED but also supplements them with domain-specific biomedical knowledge. This unique approach is designed to enhance the summarization quality of biomedical journal papers. By integrating specialized knowledge with these advanced models, we aim to not just improve the accuracy and conciseness of summaries but also make them contextually relevant. This will enable researchers to navigate the rapidly expanding scientific literature more effectively. Our experimental design involves in-domain and cross-domain summarization tasks to rigorously assess and refine our models. Ultimately, our goal is to establish new benchmarks for summarization in this specialized field, a significant step towards advancing biomedical literature summarization." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/shintom_191654_9500686_NLP_Final_Report.pdf?csf=1&web=1&e=tgXG9x" + +- title: "Retrieval augmented generation to improve text summarization of biomedical research articles" + members: "Andrew Ton, Yuxuan Cheng" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/tonandrew_125453_9488155_NeurIPS_2023%20(4).pdf?csf=1&web=1&e=LZaFVf" + +- title: "Multi-Modal Data Augmentation for Radiology Report Generation" + members: "Andrew Tran, Haroon Mohamedali, Howard Dai" + abstract: "We approach the Stanford AIMI Radiology Report Generation challenge via finetuning RadFM, a pretrained and instruction tuned model on a variety of radiology tasks. The task is to write accurate and useful radiology reports given sets of x-ray images. Considering the relatively small scale of radiology datasets compared to general image-text datasets, our method involves the creation of synthetic data to supplement training data by directly interpolating between images and also text via GPT-3.5. We make our dataset generation, model training, and inference code publicly accessible on GitHub." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/tranandrew_171733_9499892_Multi-Modal%20Data%20Aug%20For%20Radiology-1.pdf?csf=1&web=1&e=rRWzT7" + +- title: "Decoder-only Cognate Prediction" + members: "Lasse van den Berg, Adnan Bseisu" + abstract: "This project introduces a decoder-only transformer-based approach to cognate prediction, leveraging its capacity to handle sequential data and capture linguistic patterns. Cognates, words across different languages with shared origins, provide significant insights into language history and evolution. However, their prediction is challenging, primarily due to the subtle phonetic and semantic shifts over long periods of time. Our method employs a decoder-only architecture typically used in generative tasks. We adapt this architecture for the task of predicting a cognate in one language, given its cognate pair in a related language. We train our model on a dataset of Romance language and Germanic language cognates. The results demonstrate non-trivial performance with interesting generalization patterns." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/vandenberglasse_171607_9501025_Final_Project_Lasse_van_den_Berg_Adnan_Bseisu.pdf?csf=1&web=1&e=wv6ao1" + +- title: "Enhancing Text Classification with GraphSAGE" + members: "Shurui Wang, Lang Ding, Weiyi You" + abstract: "This project explores the enhancement of text classification through the novel application of TextGraphSAGE, a graph-based neural network model that integrates textual and relational data. By constructing text graphs at a granular level with nodes representing individual words or phrases and edges reflecting adjacency, we aim to capture both local and global textual contexts more effectively. The project compares the performance of our TextGraphSAGE model with conventional deep learning models like CNNs and LSTMs, as well as another graph-based method, TextGCN, across two datasets: Reuters R8 and Twitter Asian Prejudice. Our results indicate that TextGraphSAGE outperforms the baseline models, demonstrating its potential to leverage relational information for superior text classification accuracy and efficiency. Our findings affirm the potential of graph-based methods in advancing text classification tasks. Code is available at https://github.com/JadenWSR/TextGraphSAGE." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/wangshurui_168607_9491945_CPSC477_Final_Report.pdf?csf=1&web=1&e=y3tDvV" + +- title: "Evaluating the Efficacy of Two LLM Defenses Against Adversarial Prompting" + members: "Feranmi Oluwadairo, Liam Varela, Bryan Wee" + abstract: "Large Language Models (LLMs) are increasingly utilised in various user-facing settings. This opens them up to possible adversarial prompts that can be used to generate unaligned responses, jeopardising the safety of the system. This paper investigates the efficacy of two LLM defenses, SmoothLLM and Eraseand-Check, by utilising Greedy Coordinate Gradient attack and PAIR attacks to adversely prompt both a Llama2-7b model, and gpt-3.5-turbo. Our results reiterate the ongoing threat that GCG attacks continue to pose to LLMs, and also explore the feasibility of using SmoothLLM as a plug-and-play defense for closed source models." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/weebryan_166691_9500298_CPSC477_Final_Report.pdf?csf=1&web=1&e=Om2CgF" + +- title: "Data Augmentation for Machine-Generated Text Detection" + members: "June Yoo, Helen Zhou" + abstract: "Machine-Generated Text (MGT) Detection is the problem of classifying the author of a body of text as being either a human or a machine. With the advent of larger LLMs, this task has become increasingly difficult due to the risk of zero-day attacks. In this project, we test if we can increase robustness to unseen authors by fine-tuning a PLM (RoBERTa) against various Authorship Obfuscation (AO) methods for MGT detection in English, and investigate if using a mixture of these models can create a model which is robust to unseen data. Finally, we apply this to SemEval-2024 Task 8 subtask A, which deals with monolingual binary black-box machine-generated text detection." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/yoojune_146352_9501104_477_Report.pdf?csf=1&web=1&e=T4J41J" + +- title: "Multimodal NLP for Patent Documents" + members: "Katherine He , Bill Qian, Aaron Yu" + abstract: "Millions of patents applications are submitted every year, each one containing both extensive text and visual content. Our research integrated techniques from Natural Language Processing (NLP) and Computer Vision (CV) to improve the experience of referencing and understanding patent documents with a multimodal approach that reflects the nature and complexity of the documents. In this paper, we investigated the current capabilities of large vision language models in NLP tasks and finetuned an existing open source model on U.S. patent databases, using a multimodal dataset with a combination of text and image data that we constructed ourselves. We found that using multimodal approaches for patent classification outperforms both text-only and image-only approaches, in addition to demonstrating the gains in performance that can be achieved by finetuning on multimodal patent data. Our work seeks to improve the efficiency of patent work for inventors, patent attorneys, examiners, and the public, but also lay groundwork for future advancements in multimodal analysis techniques and applications in traditional expert domains such as law and intellectual property." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/yuaaron_143270_9501060_CS477_Final_Project_Report%20(2).pdf?csf=1&web=1&e=7b9mzT" + +- title: "HanScripter: Unveiling the Wisdom of Classical Chinese with Llama" + members: "Ke Lyu, Abbey Yuan, Kai Gao" + abstract: "The HanScripter project presents a specialized translation model aimed at accurately translating classical Chinese texts into English. Leveraging Meta's LLaMA 3 architecture, the model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) instruction tuning and parallel instruction-output datasets. This methodology involved constructing a comprehensive dataset containing parallel corpora and developing instruction-tuning templates for nuanced translation. Evaluation metrics, including sacreBLEU, chrF, METEOR, and BERTScore, were employed to assess translation quality. The results indicate that the HanScripter model significantly improves translation accuracy and preserves the meaning of classical Chinese texts, offering a robust framework for bridging the linguistic gap between classical and modern languages." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/yuanabbey_195314_9498956_NLP_FinalReport_final.pdf?csf=1&web=1&e=4zqgtZ" + +- title: "In Sync with Stories: A Book Recommendation System Aligning with Reader Preferences" + members: "Runqiu Zhang" + abstract: "Our study introduces 'In Sync with Stories,' a recommendation system designed to match books with readers' narrative tastes. The system uses Latent Dirichlet Allocation (LDA) to analyze introductory book texts, aligning them with user-provided inputs. By understanding readers' preferences, this genreaware system recommends books that closely resonate with the requested themes. Utilizing a dataset of science fiction titles, the system identifies topics and matches introductions with user preferences through KL-divergence scoring. The results are promising, with the recommendations accurately reflecting relevant narrative genres. While the current implementation relies on LDA, future work will explore integrating deep learning models to enhance accuracy. GitHub Link is here:" + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/zhangrunqiu_167927_9500207_NLP_Final_Version.pdf?csf=1&web=1&e=CchmSD" + +- title: "Adapting Transformer Model for Live EN to CN Translation" + members: "Jason Zheng, Kenny Li" + abstract: "In this paper, we construct the framework for a machine translation model designed to facilitate real-time, bidirectional communication between English and Chinese speakers, specifically addressing the unique linguistic challenges faced by multigenerational immigrant families. Our approach leverages an encoder-decoder architecture enhanced with multi-head self-attention mechanisms to ensure that both linguistic accuracy and cultural nuances are preserved in translations. By potentially integrating this model with a user-friendly interface in the future, we aim to provide a practical tool for immediate language translation, thereby reducing the emotional and practical challenges associated with language barriers within these communities." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/zhengjason_148859_9500662_CPSC_477_Final_Project_Paper.pdf?csf=1&web=1&e=6W47ez" + +- title: "Unpacking Large Language Model's Performance on Quantitative Understanding:NumEval @ SemEval - 2024" + members: "Jielan Helen Zheng, Xiaomeng Miranda Zhu" + abstract: + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/zhumiranda_192081_9500880_Zheng_Zhu_577_Final_Report.pdf?csf=1&web=1&e=sWowtN" + +- title: "Supreme Court Verdict Prediction with LLMs" + members: "Zachary Zitzewitz, Raja Moreno, Tom Sutter" + abstract: "In this paper, we investigate several approaches to applying language models to predicting United States Supreme Court case outcomes. We use GPT-2 combined with a classification head to perform text classification on the facts and legal question of the case. We also use LLaMA-2 to perform open-ended generation based on the facts of the case that we then classify to make a final prediction. Despite testing several language models and architectures, we were unable to attain accuracy scores better than human evaluation, task-specific architectures, or even simple heuristics. However, fine-tuning LLaMA-2 on our dataset led to improved accuracy scores. We conclude that language models may not be natively well-suited to predicting Supreme Court outcomes, but that fine-tuning on tailored datasets can improve their capabilities in this task." + link: "https://yaleedu-my.sharepoint.com/:b:/r/personal/arman_cohan_yale_edu/Documents/courses/cpsc477-sp24/project-reports/zitzewitzzachary_172383_9497403_CPSC%20477%20Final%20Project.pdf?csf=1&web=1&e=b8uPBx" diff --git a/_pages/about.md b/_pages/about.md index 024ee1d..b468843 100644 --- a/_pages/about.md +++ b/_pages/about.md @@ -2,41 +2,59 @@ layout: about permalink: / title: Natural Language Processing -description: CPSC 477/577 • Spring 2024 • Yale University +description: CPSC 477/577 • Spring 2025 • Yale University logo: yale-logo.png news: true --- Welcome to CPSC 477/577! -This course provides a deep dive into Natural Language Processing (NLP), a pivotal and dynamic subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. +This course provides a deep dive into modern Natural Language Processing (NLP), focusing on deep learning approaches and their applications. +The curriculum spans both foundational concepts and cutting-edge developments in the field including Large Language Models (LLMs). +The course begins with core neural network concepts in NLP, covering word embeddings, sequence modeling, and attention mechanisms. Students will gain a strong understanding of these building blocks while learning their practical implementations. +Building on these foundations, we explore transformer architectures and their evolution, including landmark models like BERT, GPT, and T5. The course examines how these models enable sophisticated language understanding and generation through pre-training and transfer learning. +The latter portion covers contemporary advances: LLMs, parameter-efficient fine-tuning, and efficiency techniques. We'll analyze the capabilities and limitations of current systems while discussing emerging research directions. +Through lectures, hands-on assignments, and projects, students will gain both theoretical understanding and practical experience implementing modern NLP systems. The course emphasizes reproducible experimentation and real-world applications. -The course begins by exploring the fundamental principles of NLP, providing a solid grounding in how natural language is processed and understood by machines. Students will first explore the traditional methods of NLP, and study the classic NLP tasks as well as understanding their historical significance and foundational role. These methods, based on statistical and machine learning approaches, lay the groundwork for understanding how machines interpret language. +**Prerequisites:** -Transitioning to modern NLP, the course delves into the revolutionary impact of deep learning and neural networks. Here, students will learn about representation learning methods, including word representations and sentence representations. Then the course dives into the foundations of language modeling and self-supervised learning in NLP. Specifically, we will discuss sequence-to-sequence models, transformers, and transfer learning, including models like GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and T5. These models have transformed the landscape of NLP by enabling more general language understanding and generation capabilities. We then transition into contemporary topics in NLP including LLMs, parameter-efficient fine-tuning, efficiency, and incorporating other modalities. +Intro to ML or Intro to AI are required. -Through a blend of lectures, hands-on projects and assignments, and case studies, students will gain practical experience in both traditional and modern NLP techniques. The goal of the course is to introduce the students to the field and provide them with a comprehensive overview of fundamentals that helped shaped today's advanced AI models. + FAQ: -**Prerequisites:** + Q: Can I take this course? + + *A: You should have taken one of the above courses. If you have not, please consult the instructor before enrolling.* + + This course requires: + - Strong programming skills in Python and prior exposure to libraries such as numpy, PyTorch, or TensorFlow + - Familiarity with probability and statistics, and linear algebra + - Prior exposure to machine learning concepts through courses like CPSC 381/581 (Intro to Machine Learning) or CPSC 370/570 (Artificial Intelligence) + + Q: I have equivalent experience but haven't taken the prereq courses. Can I enroll? + + Contact the instructor with: Relevant coursework/grades and your programming experience (including ML projects) and math background + + Q: I may miss several lectures. Can I still take the course? + + Regular attendance is mandatory, and we do not guarantee lecture recordings. + Missing more than a few classes significantly impacts learning outcomes and participation grade. + If you anticipate scheduling conflicts, consider taking the course in a future semester. -Intro to AI or Intro to Machine Learning or permission of instructor. +**Important Note:** We strongly advise against taking this course if you do not meet the prerequisites. **Resources** - Dan Jurafsky and James H. Martin. Speech and Language Processing (2024 pre-release) - Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing -- Jacob Eisenstein. Natural Language Processing We will also using papers from major conferences in the field including ACL, EMNLP, NAACL, ICLR, NeurIPS, etc. *** - - - - - +- **Lectures:** Tue/Thur 2:30PM - 3:45PM +- **Lecture Location:** TBD +- **Office Hours Location:** [17HH #326](https://maps.app.goo.gl/uySqex4xtLZH2KAK9){:target="\_blank"} +- **Discussion:** [Ed Discussions](https://edstem.org/){:target="\_blank"} *** diff --git a/_pages/logistics.md b/_pages/logistics.md index 8713d92..0f3649f 100644 --- a/_pages/logistics.md +++ b/_pages/logistics.md @@ -11,33 +11,35 @@ title: Logistics ### Introduction -This course provides a deep dive into Natural Language Processing (NLP), a pivotal and dynamic subfield of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. +This course provides a deep dive into modern Natural Language Processing (NLP), focusing on neural approaches and their applications. The curriculum spans both foundational neural concepts and cutting-edge developments in the field. +The course begins with core neural network concepts in NLP, covering word embeddings, sequence modeling, and attention mechanisms. Students will gain a strong understanding of these building blocks while learning their practical implementations. +Building on these foundations, we explore transformer architectures and their evolution, including landmark models like BERT, GPT, and T5. The course examines how these models enable sophisticated language understanding and generation through pre-training and transfer learning. +The latter portion covers contemporary advances: Large Language Models (LLMs), multi-modal integration, parameter-efficient fine-tuning, and model compression. We'll analyze the capabilities and limitations of current systems while discussing emerging research directions. +Through lectures, hands-on assignments, and projects, students will gain both theoretical understanding and practical experience implementing modern NLP systems. The course emphasizes reproducible experimentation and real-world applications. -The course begins by exploring the fundamental principles of NLP, providing a solid grounding in how natural language is processed and understood by machines. Students will first explore the traditional methods of NLP, and study the classic NLP tasks as well as understanding their historical significance and foundational role. These methods, based on statistical and machine learning approaches, lay the groundwork for understanding how machines interpret language. +### Prerequisites +This course requires: -Transitioning to modern NLP, the course delves into the revolutionary impact of deep learning and neural networks. Here, students will learn about representation learning methods, including word representations and sentence representations. Then the course dives into the foundations of language modeling and self-supervised learning in NLP. Specifically, we will discuss sequence-to-sequence models, transformers, and transfer learning, including models like GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and T5. These models have transformed the landscape of NLP by enabling more general language understanding and generation capabilities. We then transition into contemporary topics in NLP including LLMs, parameter-efficient fine-tuning, efficiency, and incorporating other modalities. +- Strong programming skills in Python and prior exposure to libraries such as numpy, PyTorch, or TensorFlow +- Familiarity with probability and statistics, and linear algebra +- Prior exposure to machine learning concepts through courses like CPSC 481 (Intro to Machine Learning) or CPSC 470 (Artificial Intelligence) -Through a blend of lectures, hands-on projects and assignments, and case studies, students will gain practical experience in both traditional and modern NLP techniques. The goal of the course is to introduce the students to the field and provide them with a comprehensive overview of fundamentals that helped shaped today's advanced AI models. +**Important Note:** This course assumes familiarity with fundamental machine learning concepts like gradient descent, neural networks, and backpropagation. +Although we review these concepts, students without prior ML/AI coursework should first take an introductory ML or AI course before enrolling. The course material builds heavily on these concepts from day one, and there will not be time to cover these basics in detail. +If unsure about prerequisites, please consult the instructor before enrolling. ### Learning Resources #### Textbook -- Dan Jurafsky and James H. Martin. Speech and Language Processing (2024 pre-release) +- Dan Jurafsky and James H. Martin. Speech and Language Processing (2024) - Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing -- Jacob Eisenstein. Natural Language Processing We will also using papers from major conferences in the field including ACL, EMNLP, NAACL, ICLR, NeurIPS, etc. - -### Anonymous feedback - -If you wish to share comments, questions, or feedback anonymously please use this form: [Anonymous Form](https://forms.gle/KNuS32Ns69GEorrh7). -I will check this regularly and respond to questions/comments. - ### Communication -We use Canvas and email for main announcements. +We use Canvas, Ed and email for main announcements. For questions about the course, discussions about material, and faciliatating discussions for projects between students, we will mainly use Ed Discussion. *** @@ -46,26 +48,44 @@ For questions about the course, discussions about material, and faciliatating di Final grades will be comprised of: -- 32%: Assignments, which includes both written and coding problem sets -- 20%: Midterm, in person, closed book +- 22%: Assignments, which includes both written and coding problem sets +- 40%: Two Midterm Exams (20% each), in person, closed book - 8%: Participation and quizzes -- 40%: Final projects, including a project proposal (5%), project final presentation (15%), project final report (15%), code and reproducibility checklist (5%) +- 30%: Final projects, including a project proposal (5%), project final presentation (15%), project final report (15%), code and reproducibility checklist (5%) - Grading for graduate students: Graduate students will need to incorporate a novelty element and a more in-depth literature review in their final projects ### AI Assistant policies -Using assistance from AIs such as ChatGPT to complete your homeworks, quizzes, projects, and exam is not allowed except for the following circumstances: +The use of AI tools (including but not limited to ChatGPT, Claude, GPT-4o, etc) for coursework is regulated as follows: + +Permitted Uses: + +- Writing assistance: Grammar checks, style improvements, and clarity enhancements +- Learning tool: Exploring concepts and asking questions while studying -- The assignment explicitly asks for it -- AI Assistant is used to improve writing or check grammar. If you take advantage of any sort of AI assistance for an assignment, you should explicitly mention how you used AI when submitting the assignment. +Prohibited Uses: -**Employing AI tools to complete assignments in cases other than above or without permission of instructor will be considered a violation of the Honor Code.** +- Generating code solutions for assignments +- Completing homework problems +- Answering quiz/exam questions +- Producing project deliverables +- Using GitHub Copilot or similar coding assistants -**Co-pilot and coding assistants are also NOT allowed.** +Required Disclosure: + +You must document any AI assistance in your submissions by: + +- Specifying which parts received AI assistance +- Explaining how the AI was used +- Including relevant prompts/interactions + +Academic Integrity: + +**Using AI tools beyond these guidelines constitutes an Honor Code violation. When in doubt, consult the instructor before using AI tools.** ### Late submissions -You can still submit your assignment after the deadlines for up to **3 days**. +You can still submit your assignment after the deadlines for up to **5 days**. You will, however, receive partial credit for late submissions. Every late day will result in 10% deduction in full credit for that assignment Note: Late days can only be used on the assignments, and not on the project proposal or the final report and the presentation. @@ -76,25 +96,28 @@ Grading components for graduate students will be the same as undergraduate stude For class projects we expect graduate students to work on a research problem (The project should propose either a novel research, a novel investigation of existing methods, an extension of prior work for a specific purpose, or a new application.). Graduate student projects are also expected to have a more thorough literature review component in their final project report. -#### Class project (**40%**) +#### Class project (**30%**) Students must complete a final research project on a topic of their choice related to the class. The students should team up with other students and the team size is limited to 2 to 3 students. If you don't choose a team you will be randomly assigned a team mate. Invidiual projects are allowed only in exceptional cases and by providing reasonable justification. - **5%**: proposal - - Students should submit a 1-2 page proposal for their project. The proposal should state and motivate the problem, and position the proposed project within related work. The project proposal should also include a brief description of the approach as well as the experimental plan (e.g., baselines, datasets, etc) to validate the effectiveness of the approach. Here are some ideas on types of projects.: + - Students should submit a 1 page proposal for their project. The proposal should state and motivate the problem, and position the proposed project within related work. The project proposal should also include a brief description of the approach as well as the experimental plan (e.g., baselines, datasets, etc) to validate the effectiveness of the approach. Here are some ideas on types of projects.: - For undergraduate students the project could be reimplementation of an exsiting method, a new user-facing application that uses NLP models for a new problem, a comprehensive survey into a subtopic of interest, deeper investigation of a paper and providing further insights by conducting additional experiments, or novel reseach. - For graduate students the project should include a component of novelty. E.g., it could propose a novel research, a novel investigation of existing methods, an extension of prior work for a specific purpose, or a new application. -- **15%**: Final project report - - 4-6 (no more than 6 pages) page conference format report (e.g., [NeurIPS](https://www.overleaf.com/latex/templates/neurips-2023/vstgtvjwgdng)) detailing the project motivation, related work, proposed approach, results, and discussion. You can think of this as a conference paper. Negative results will not be penalized, but should be accompanied with detailed analysis of why the proposed methods didn't work and provide some additional insights into the problem.  +- **5%**: Progress report + - Students should submit a 1 page progress report for their project. The progress report should include the current status of the project, the literature review, the challenges faced, and the plan for the remaining part of the project. + +- **10%**: Final project report + - 2-4 (no more than 4 pages) page conference format report (e.g., [NeurIPS](https://www.overleaf.com/latex/templates/neurips-2023/vstgtvjwgdng)) detailing the project motivation, related work, proposed approach, results, and discussion. You can think of this as a short conference paper. Negative results will not be penalized, but should be accompanied with detailed analysis of why the proposed methods didn't work and provide some additional insights into the problem.  - References and appendix won't count towards the page limit -- **15%**: Final project presentation +- **8%**: Final project presentation - 5 minute in person in-class presentations -- **5%**: Code and reproducibility checklist - - Your project code should be clean, readable, with clear running instructions, and the results should be fully reproducible. We will provide a reproducibility checklist that should be returned. +- **2%**: Code and reproducibility checklist + - Your project code should be clean, readable, with clear running instructions, and the results should be fully reproducible. We will provide a reproducibility checklist that should be returned. ## Integrity @@ -105,7 +128,7 @@ If you don't choose a team you will be randomly assigned a team mate. Invidiual We embrace and celebrate diversity, understanding that the richest learning experiences come from the exchange of ideas among individuals from varied backgrounds, cultures, and perspectives. We uphold a commitment to mutual respect and open-mindedness, encouraging each participant to both share their unique insights and actively listen to others. Recognizing that learning is a collaborative and evolving process, we foster an inclusive environment where constructive criticism is welcomed, mistakes are embraced as opportunities for growth, and every student is both a teacher and a learner. Our goal is to cultivate a dynamic, respectful, and inclusive classroom environment. -## FAQs + \ No newline at end of file diff --git a/_pages/student_projects.md b/_pages/student_projects.md index b14de4d..4bbc2b0 100644 --- a/_pages/student_projects.md +++ b/_pages/student_projects.md @@ -1,7 +1,7 @@ --- layout: project permalink: /student-projects/ -title: Projects completed by our amazing students +title: Final projects from last year (2024) description: Final projects completed by amazing students of Spring 2024. ---