diff --git a/latex/bib/acmart.bib b/latex/bib/acmart.bib index 2a57ade7..ea0a5552 100644 --- a/latex/bib/acmart.bib +++ b/latex/bib/acmart.bib @@ -2044,4 +2044,15 @@ @inproceedings{pasti2023intersection url = "https://aclanthology.org/2023.eacl-main.52", doi = "10.18653/v1/2023.eacl-main.52", pages = "737--749", +} + +@article{nederhof2004language, + title={The language intersection problem for non-recursive context-free grammars}, + author={Nederhof, Mark-Jan and Satta, Giorgio}, + journal={Information and Computation}, + volume={192}, + number={2}, + pages={172--184}, + year={2004}, + publisher={Elsevier} } \ No newline at end of file diff --git a/latex/splash2024/splash.tex b/latex/splash2024/splash.tex index 630ef38e..c0a30fb6 100644 --- a/latex/splash2024/splash.tex +++ b/latex/splash2024/splash.tex @@ -944,7 +944,7 @@ The Seq2Parse and BIFI experiments were conducted on a single Nvidia V100 GPU with 32 GB of RAM. For Seq2Parse, we use the default pretrained model provided in commit \texttt{7ae0681}~\footnote{https://github.com/gsakkas/seq2parse/tree/7ae0681f1139cb873868727f035c1b7a369c3eb9}. Since it was unclear how to extract multiple repairs from their model, we only take a single repair prediction. For BIFI, we use the Round 2 breaker and fixer from commit \texttt{ee2a68c}\footnote{https://github.com/michiyasunaga/BIFI/tree/ee2a68cff8dbe88d2a2b2b5feabc7311d5f8338b}, the highest-performing model reported by the authors, with a variable-width beam search to control the number of predictions, and let the fixer model predict the top-k repairs, for $k=\{1, 5, 10, 20\times10^5\}$. - The language intersection experiments were conducted on 40 Intel Skylake cores running at 2.4 GHz, with 150 GB of RAM, running bytecode compiled for JVM 17.0.2. To train our scoring function, we use a order-5 Markov chain trained on 55 million BIFI tokens. Training takes roughly 10 minutes, after which re-ranking is nearly instantaneous. Sequences are scored using NLL with Laplace smoothing and our evaluation measures the Precision@\{1, 5, 10, All\} for samples at varying latency cutoffs. We apply a 30-second latency cutoff for our sampler. + The language intersection experiments were conducted on 40 Intel Skylake cores running at 2.4 GHz, with 150 GB of RAM, running bytecode compiled for JVM 17.0.2. To train our scoring function, we use an order-5 Markov chain trained on 55 million BIFI tokens. Training takes roughly 10 minutes, after which re-ranking is nearly instantaneous. Sequences are scored using NLL with Laplace smoothing and our evaluation measures the Precision@\{1, 5, 10, All\} for samples at varying latency cutoffs. We apply a 30-second latency cutoff for our sampler. % Pairwise naturally-occurring errors and human fixes are the most authentic source of real-world syntax repairs, but can be difficult to obtain due to the paucity of parallel syntax error corpi. In the absence of natural syntax repairs, one viable alternative is to collect a dataset of syntactically valid code, and synthetically corrupt it. The original source code becomes the ground truth repair for the synthetically generated typo, and the target for evaluating the precision of our repair procedure. @@ -1302,7 +1302,7 @@ Latency can vary depending on several factors including string length, grammar size, and critically the Levenshtein edit distance. This can be an advantage because in the absence of any contextual or statistical information, syntax and Levenshtein edits are often sufficiently constrained to identify a small number of valid repairs. It is also a limitation, because as the number of edits grows, the admissible set expands rapidly and the number of valid repairs quickly becomes too large to be useful without a very precise naturalness metric to distinguish equidistant repairs. - Space complexity increases sharply with edit distance and to a lesser extent with length. This can be partly alleviated with more precise criteria to avoid creating superfluous productions, but the memory overhead is still considerable. Memory pressure can be attributed to engineering factors such as the grammar encoding, but is also an inherent challenge of grammar intersection. Managing the size of the intersection grammar during construction is a critical factor in scaling up this technique. Weighted intersections would also be help to limit the size of the intersection grammar, as it would enable us to prune improbable productions without diminishing precision. + Space complexity increases sharply with edit distance and to a lesser extent with length. This can be partly alleviated with more precise criteria to avoid creating superfluous productions, but the memory overhead is still considerable. Memory pressure can be attributed to engineering factors such as the grammar encoding, but is also an inherent challenge of grammar intersection. Managing the size of the intersection grammar during construction is a critical factor in scaling up this technique. Weighted intersections would also help to limit the size of the intersection grammar, as it would enable us to prune improbable productions without diminishing precision. \subsubsection{Toolchain integration} @@ -1332,7 +1332,7 @@ Being an algebraic formalism, language equations naturally give rise to a kind of calculus, vaguely reminiscent of Leibniz' and Newton's. First studied by Brzozowski~\cite{brzozowski1964derivatives, brzozowski1980equations} and Antimirov~\cite{antimirov1996partial}, one can take the derivative of a language equation, yielding another equation. This can be interpreted as a kind of continuation or language quotient, revealing the suffixes that complete a given prefix. This technique leads to an elegant family of algorithms for incremental parsing~\cite{might2011parsing, adams2016complexity} and automata minimization~\cite{brzozowski1962canonical}. In our setting, differentiation corresponds to code completion. - Bar-Hillel~\cite{bar1961formal} establishes the closure of CFLs under intersection with regular languages, but does not elaborate on how to construct the corresponding grammar in order to recognize it. Beigel~\cite{beigelproof} and Pasti et al.~\cite{pasti2023intersection} provide helpful insights into the construction of the intersection grammar, which our work specializes to intersection with nominal Levenshtein automata. + Bar-Hillel~\cite{bar1961formal} establishes the closure of CFLs under intersection with regular languages, but does not elaborate on how to construct the corresponding grammar in order to recognize it. Beigel~\cite{beigelproof} and Pasti et al.~\cite{pasti2023intersection} provide helpful insights into the construction of the intersection grammar, and Nederhof and Satta~\cite{nederhof2004language} specifically consider finite CFL intersections, but neither consider the Levenshtein distance. Our work specializes language intersection with nominal Levenshtein automata. In this paper, we restrict our attention to language equations for context-free languages, whose variables coincide with edit locations in the source code of a computer program, and solutions correspond to syntax repairs. Although prior work has studied the use of language equations for parsing~\cite{might2011parsing}, to our knowledge they have never specifically been considered for the purpose of code completion or syntax error correction. diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt index 9dc660e3..f5f8e769 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt @@ -86,6 +86,34 @@ object Grammars { term -> id | int | paren_expr """.parseCFG().freeze() +// https://aclanthology.org/2020.conll-1.41.pdf#page=12 + val hardestCFL: CFG = """ + S' -> R ${'$'} Q S L ; + L -> L' , U + L' -> , V L' + L' -> ε + R -> U , R' + R' -> R' V , + R' -> ε + U -> W U + U -> ε + V -> W V + V -> W + W -> ( + W -> ) + W -> [ + W -> ] + W -> ${'$'} + Q -> L ; R + Q -> ε + S -> S Q T + S -> T + T -> ( Q S Q ) + T -> [ Q S Q ] + T -> ( Q ) + T -> [ Q ] + """.trimIndent().parseCFG().noNonterminalStubs + val seq2parsePythonCFG: CFG = """ START -> Stmts_Or_Newlines Stmts_Or_Newlines -> Stmt_Or_Newline | Stmt_Or_Newline Stmts_Or_Newlines diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt index 67436bac..2be846e9 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt @@ -397,6 +397,15 @@ class SetValiantTest { } } +/* +./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.SetValiantTest.testHardestCFL" +*/ + @Test + fun testHardestCFL() { + Grammars.hardestCFL.enumSeqSmart(List(10){ "_" }).distinct() + .forEach { println(it) } + } + /* ./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.SetValiantTest.testLevenshteinAutomata" */