Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html
Date | Activity | Summary |
---|---|---|
1/22 | Follow setup instructions in 0.setup/ | Install anaconda and set up environment for class with specific Python libraries. |
1/24 | Complete 1.words/ExploreTokenization_TODO.ipynb before class | This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them. |
1/24 | Execute 1.words/EvaluateTokenizationForSentiment.ipynb | This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html |
1/24 | Complete 1.words/TokenizePrintedBooks_TODO.ipynb | Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks). |
1/29 | Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb | This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3). |
1/29 | Complete 2.distinctive_terms/ChiSquare.ipynb | This notebook illustrates the Chi-Square test in finding distinctive terms between @realdonaldtrump and @AOC |
1/31 | Complete 3.dictionaries/DictionaryTimeSeries_TODO.ipynb | This notebook introduces the use of dictionaries for counting the frequency of some category of words in text, using sentiment (from the AFINN sentiment lexicon) in the time series data of tweets as an example. |
2/5 | Complete 4.classification/CheckData_TODO.ipynb | Collect data for classification; verify that it's in the proper format. |
2/5 | Complete 4.classification/Hyperparameters_TODO.ipynb | This notebook explores text classification, introducing a majority class baseline and analyzing the affect of hyperparameter choices on accuracy. |