Due date: 17:00 on Thursday, November 7, 2024.
- Late assignments will not be accepted without a valid medical certificate or other documentation of an emergency.
- For CSC485 students, this assignment is worth 33% of your final grade. For CSC2501 students, it is worth 25% of your final grade.
- Read the whole assignment carefully.
- Type the written parts of your submission in no less than 12pt font.
- Your work must be your own. Do not work with anyone else on any of the problems. If you need assistance, contact the instructor or TA.
- Any clarifications to the problems will be posted on the Discourse forum for the class. Check the page regularly.
- The starter code directory for this assignment is distributed via MarkUs. Refer to code files in that directory.
- When implementing code, read the docstrings as some provide important instructions, implementation details, or hints.
- Fill in your name, student number, and UTORid on the relevant lines at the top of each file you submit.
- Implement the
deepest
function inq0.py
to find the synset in WordNet with the largest maximum depth and report both the synset and its depth on each of its paths to a root hyperonym. Hint: usewn.all_synsets
andsynset.max_depth
methods.
- Implement the
superdefn
function inq0.py
that takes a synsets
and returns a list consisting of all of the tokens in the definitions ofs
, its hyperonyms, and its hyponyms. Useword_tokenize
as shown in chapter 3 of the NLTK book.
- Implement the
stop_tokenize
function inq0.py
that takes a string, tokenizes it usingword_tokenize
, removes any tokens that occur in NLTK’s list of English stop words and also removes any tokens that consist entirely of punctuation characters. Use Python’s punctuation characters from the string module. Maintain the original case in the return value.
- Implement the
mfs
function that returns the most frequent sense for a given word in a sentence. Note thatwordnet.synsets()
orders its synsets by decreasing frequency.
- In the
lesk
function inq1.py
, implement the simplified Lesk algorithm as specified in Algorithm 1, includingOverlap
.Overlap(signature, context)
returns the cardinality of the intersection of the bagssignature
andcontext
. Usestop_tokenize
function to tokenize the examples and definitions.
- In the
lesk_ext
function inq1.py
, implement a version of Algorithm 1 where thesignature
also includes the words in the definition and examples of sense’s hyponyms, holonyms, and meronyms. Usestop_tokenize
as before.
- Explain why the extension in
lesk_ext
is helpful. Consider the likely sizes of the overlaps.
- In the
lesk_cos
function inq1.py
, implement a variant oflesk_ext
that usesCosSim
instead ofOverlap
. Modifysignature
andcontext
to be vector-valued and construct the vectors as described. Usestop_tokenize
to get the tokens for the signature.
- In the
lesk_cos_oneside
function inq1.py
, implement a variant oflesk_cos
that, when constructing the vectors for the signature and context, does not include words that occur only in the signature. Usestop_tokenize
to get the tokens for the signature.
- Compare how well
lesk_cos_oneside
performs compared tolesk_cos
. Justify your answer with examples.
- If we use
CosSim
for vectors with binary values (representing sets), how is it related to the set intersection? (No implementation required.)
- In the
lesk_w2v
function inq1.py
, implement a variant oflex_cos
where the vectors for the signature and context are constructed by taking the mean of the word2vec vectors for the words in the signature and sentence, respectively. Treat the signature and context as sets rather than multisets. Usestop_tokenize
to get the tokens for the signature.
- Alter your code so that all tokens are lowercased before they are used for any of the comparisons, vector lookups, etc. Analyze how this alters the different methods’ performance and explain why. Do not submit this lowercased version.
- Is context really necessary? Give an example of a sentence where word order–invariant methods such as those implemented for Q1 will never be able to completely disambiguate. Explain the more general pattern and why these methods cannot provide the correct sense for each ambiguous word.
- Implement
gather_sense_vectors
inq2.py
to assign sense vectors as described.
- In the docstring for
gather_sense_vectors
, explain why sorting the corpus by length before batching is much faster than leaving it as-is. Hint: think about padding.
- Implement
bert_1nn
inq2.py
to predict the sense for a word in a sentence given sense vectors produced bygather_sense_vectors
. Keep in mind the note in the docstring about loop usage.
- Think of at least one other issue that would come up when attempting to use the code for this assignment to disambiguate arbitrary sentences. Consider either the Lesk variants from Q1 or the BERT-based method here (or both).
- Implement
get_forward_hooks
.
- Implement
causal_trace_analysis
to compute the impact of states, MLP and attention.
- Report your generated causal tracing result plots for the prompt “The Eiffel Tower is located in the city of” with the output “Paris” in your report.
- Experiment with different sizes of GPT-2 models (e.g., small, medium, large, and XL) to examine how model size impacts causal tracing patterns. Address the following in your report:
- At what model size do you observe that the causal tracing pattern no longer appears?
- Discuss potential reasons for how and why this change in causal tracing patterns occurs as the model size increases or decreases.
- Using GPT-2 XL, experiment with various prompts to identify prompt types that result in a causal tracing pattern similar to the one illustrated in Figure 2. Document your findings with examples and discuss what characteristics of the prompts might contribute to this similarity.
- For GPT-2 XL, explore different prompts and tasks to find cases where the causal tracing pattern is absent or significantly diminished. Describe the prompt/task and hypothesize why the pattern does not emerge. Discuss any trends or patterns you identified and reflect on the broader implications of how language models process, store and generate factual information obtained from pretraining.
- Submit electronically via MarkUs.
- Submit a total of five required files:
a2written.pdf
: a PDF document containing answers to questions 0a, 1d, 1f, 1h, 2a, and 2d. Also include a typed copy of the Student Conduct declaration and sign it by typing your name.q0.py
: the entire file with your implementations filled in.q1.py
: the entire file with your implementations filled in. Do not include the alterations for question 1h.q2.py
: the entire file with your implementations filled in.q3.py
: the entire file with your implementations filled in.