Skip to content

Commit

Permalink
describe the pitfalls of NLP modeling
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Dec 2, 2024
1 parent 27d17e1 commit 03f4b8f
Show file tree
Hide file tree
Showing 5 changed files with 18 additions and 6 deletions.
7 changes: 7 additions & 0 deletions latex/bib/acmart.bib
Original file line number Diff line number Diff line change
Expand Up @@ -2259,4 +2259,11 @@ @phdthesis{monperrus2018living
author={Monperrus, Martin},
year={2018},
school={HAL Archives Ouvertes}
}

@inproceedings{kernighan1990spelling,
title={A spelling correction program based on a noisy channel model},
author={Kernighan, Mark D and Church, Kenneth and Gale, William A},
booktitle={COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics},
year={1990}
}
Binary file modified latex/thesis/Thesis.pdf
Binary file not shown.
12 changes: 7 additions & 5 deletions latex/thesis/content/Ch0_Literature_Review.tex
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
\chapter{\rm\bfseries Related Literature}
\label{ch:litreview}

Translating concepts into computer programs demands a high degree of precision, as computers have strict criteria for admitting valid programs. These constraints help eliminate meaningless programs and reduce runtime errors, but can be cumbersome to debug. During the process of programming, these constraints are inevitably violated, requiring typically manual repair. Automated repair attempts to suggest potential repairs from which the author may choose. This subject has been closely investigated in programming language research and treated in a number of existing literature reviews~\cite{monperrus2018living, le2021automatic}. We direct our attention primarily towards syntax repair, which attempts to repair parsing errors, the earliest stage in program analysis.
Translating ideas into computer programs demands a high degree of precision, as computers have strict criteria for admitting valid programs. These constraints act as a failsafe against faulty programs and runtime errors, but can be tedious to debug. During the editing process, these constraints are invariably violated by the hasty or inexperienced programmer, requiring manual repair. To assist with this task, automated program repair (APR) attempts to generate possible revisions from which the author may choose. This subject has been closely investigated by programming language research and treated in a number of existing literature reviews~\cite{monperrus2018living, le2021automatic}. We direct our attention primarily towards syntax repair, which attempts to fix parsing errors, the earliest stage in program analysis.

\section{Syntax Repair}

Various strategies have been proposed to handle syntactic program errors, which have been a longstanding open problem since the advent of context-free languages. In 1972, Aho and Peterson~\cite{aho1972minimum} introduce an algorithm that returns a syntactically valid sequence whose distance from the original sequence is minimal. Their method guarantees that a valid repair will be found, but only selects one and does not produce every valid repair within the same distance.
Spellchecking is an early precursor to syntax repair that originates from word processing and seeks to find, among a finite dictionary, the most likely intended revision of a misspelled word~\cite{kernighan1990spelling}. Akin to spellchecking, syntax repair considers the case where this dictionary is not necessarily finite, but rather generated by a grammar representing a potentially infinite collection of words called a language. In the case of programming language syntax, the language and corresponding grammar is typically context-free~\cite{chomsky1959algebraic}.

While algorithmically elegant, this approach is problematic, because source code has both formal and natural characteristics. A pragmatic solution must not only suggest valid repairs, but also generate suggestions a human being is likely to write in practice. To model the natural distribution of valid programs, researchers have borrowed techniques from natural language processing to generate natural repairs.
Various methods have been proposed to handle syntactic program errors, which have been a longstanding open problem since the advent of context-free languages. In 1972, Aho and Peterson~\cite{aho1972minimum} introduce an algorithm that returns a syntactically valid sequence whose distance from the original sequence is minimal. Their method guarantees that a valid repair will be found, but only generates a single repair and does attempt to optimize the naturalness of the generated solution.

Recent work Yasunaga et al.~\cite{yasunaga2021break} and Sakkas et al.~\cite{sakkas2022seq2parse} use language models to sample probable fixes, but do not sample from the space of all valid repairs, and have difficulty with inference scaling, where additional test time samples are not . Furthermore, the generated samples are not all syntactically valid.
While algorithmically elegant, this approach is problematic because source code has both formal and natural characteristics. A pragmatic solution must not merely suggest valid repairs, but also generate suggestions a human is likely to prefer in practice. To model coding conventions and stylistic patterns, researchers have borrowed techniques from natural language processing, but often sacrifice validity, precision or latency as these techniques are prone to misgeneralize and hallucinate syntactically invalid repairs.

Our work addresses all these concerns. We try to generate all valid programs and prioritize them by naturalness, while ensuring that the latency and response time is minimal. In other words, we target soundness, completeness, naturalness and latency.
Recent work Yasunaga et al.~\cite{yasunaga2021break} and Sakkas et al.~\cite{sakkas2022seq2parse} use natural language models to generate probable fixes, but do not attempt to sample from the space of all and only valid repairs. As a consequence, they have difficulty with inference scaling, where additional test time compute does not translate into improved precision. Furthermore, the generated samples may not even be syntactically valid, as we observe in practice.

Our work addresses all of these concerns. We try to generate every nearby valid program and prioritize the solutions by naturalness, while ensuring response time is tolerable. In other words, we attempt to satisfy soundness, completeness, naturalness and latency simultaneously.


\clearpage
2 changes: 1 addition & 1 deletion latex/thesis/content/Ch2_Formal_Language_Theory.tex
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ \chapter{\rm\bfseries Formal Language Theory}

Union, concatenation and repetition are all mundane in the theory of formal languages. Intersection and negation are more challenging concepts to borrow from set theory, and do not translate naturally into the Chomsky hierarchy. For example, the intersection of two CFLs is Turing Complete, but the intersection of a CFL and a regular language is a CFL.

Deciding intersection non-emptiness of a finite collection of finite automata is known to be PSPACE-complete~\cite{kozen1977lower}. It is still unknown whether a faster algorithm than the product construction exists for deciding intersection non-emptiness of just two finite automata.
Deciding intersection non-emptiness (INE) of a finite collection of finite automata is known to be PSPACE-complete~\cite{kozen1977lower}. It is still unknown whether a faster algorithm than the product construction exists for deciding INE of just two finite automata.

The textbook algorithm proceeds as follows: create an automaton containing the cross-product of states, and simulate both automata in lockstep, creating arcs between states that are co-reachable. If a final state is reachable in the product automaton, then the intersection is non-empty. This requires space proportional to the Cartesian product of the two states.

Expand Down
3 changes: 3 additions & 0 deletions latex/thesis/content/Terminology.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ \chapter*{\rm\bfseries Terminology}
\item \textbf{Automaton}: A mathematical model of computation that can occupy one of a finite number of states at any given time, and makes transitions between states according to a set of rules.
\item \textbf{Deterministic}: A property of a system that, given the same input, will always produce the same output.
\item \textbf{Grammar}: A set of rules that define the syntax of a language.
\item \textbf{Language}: A set of words generated by a grammar. For the purposes of this thesis, the language can be finite or infinite.
\item \textbf{Word}: A member of a language, consisting of a sequence of terminals. For the purposes of this thesis, words are always finite.
\item \textbf{Terminal}: A single token from an alphabet. For the purposes of this thesis, the alphabet is always finite.
\item \textbf{Intersection}: The set of elements common to two or more sets.
\item \textbf{Probabilistic}: A property of a system that, given the same input, may produce different outputs.
\item \textbf{Theory}: A set of sentences in a formal language.
Expand Down

0 comments on commit 03f4b8f

Please sign in to comment.