diff --git a/latex/tacm2024/tacm_poster.tex b/latex/tacm2024/tacm_poster.tex index 42d07a4e..0e8d3704 100644 --- a/latex/tacm2024/tacm_poster.tex +++ b/latex/tacm2024/tacm_poster.tex @@ -175,7 +175,7 @@ \mysection{Motivation} \null\hspace*{3cm}\begin{minipage}[c]{0.85\columnwidth} -Suppose we want to force an autoregressive LLM to generate syntactically valid next tokens $P(x_n \mid x_1, \ldots, x_{n-1})$, under certain resource constraints. Here is a concrete example: ``Generate an arithmetic expression with two or more variables in ten or fewer tokens.''. If we sample the partial trajectory, +Suppose we want to force an autoregressive LLM to generate syntactically valid next tokens $P(x_n \mid x_1, \ldots, x_{n-1})$, under certain resource constraints. Here is a concrete example: ``Generate an arithmetic expression with two or more variables in ten or fewer tokens.'' If we sample the partial trajectory, \begin{center}\texttt{( x + ( y * }\underline{\texttt{(}}\end{center}\\ then we will spend quite a long time rejecting invalid completions, because this trajectory has passed the point of no return. Even though \texttt{(} is a locally valid continuation, we need to avoid this scenario, because we would like a linear sampling delay and to guarantee this, we must avoid backtracking. \end{minipage} @@ -398,7 +398,7 @@ Consider a time series, $A$, whose points which are not too close nor far apart, and $n \leq \sum_{i=1}^{|A|} \mathbf{1}[A_i = \bs]$. We want to sample the typical set using an LLM.\vspace{0.5cm} \begin{itemize}[leftmargin=2cm] \item The words are bitvectors of some length, $T$, i.e., $A = \{\ws, \bs\}^T$ -\item Consecutive $\bs$ separated by $\ws^{[a,b]}$, i.e., $B = \ws^*(\bs\ws^{[a, b]})^{[n,\infty)}\{\bs,\epsilon\}\ws^*$ +\item Consecutive $\bs$ separated by $\ws^{[a,b]}$, i.e., $B = \ws^*(\bs\ws^{[a, b]})^{[n,\infty)}\{\bs,\varepsilon\}\ws^*$ \end{itemize}\vspace{0.5cm} The DPP language is regular. Let $C$ be an FSA such that $\mathcal{L}(C) = \mathcal{L}(A) \cap \mathcal{L}(B)$. For example, here is the minimal automaton for $T=13, a=3, b=5, n=2$. diff --git a/latex/thesis/Thesis.pdf b/latex/thesis/Thesis.pdf index 1095f1d1..184d79d6 100644 Binary files a/latex/thesis/Thesis.pdf and b/latex/thesis/Thesis.pdf differ diff --git a/latex/thesis/content/Ch2_Formal_Language_Theory.tex b/latex/thesis/content/Ch2_Formal_Language_Theory.tex index 531b8b91..ce4fa679 100644 --- a/latex/thesis/content/Ch2_Formal_Language_Theory.tex +++ b/latex/thesis/content/Ch2_Formal_Language_Theory.tex @@ -88,7 +88,7 @@ \chapter{\rm\bfseries Formal Language Theory} \caption{TODO: depict product construction for finite automata here.} \end{figure} -The goal of this thesis is to speed up the product construction by leveraging (1) parameterized complexity (2) pruning and (3) parallelization to speed up the wallclock runtime of the product construction and generalize it to CFG-REG intersections. We show it is possible to decide intersection non-emptiness in realtime for Levenshtein automata and build a tool to demonstrate it on real-world programming languages and grammars. +The goal of this thesis is to speed up the product construction by leveraging (1) parameterized complexity (2) pruning and (3) parallelization to speed up the wallclock runtime of the product construction and generalize it to CFG-REG intersections. We show it is possible to decide INE in realtime for Levenshtein automata and build a tool to demonstrate it on real-world programming languages and grammars. Finally, we show a probabilistic extension of the REG-CFL product construction, which can be used to decode the top-K most probable words in the intersection of two languages. This is useful for applications in natural language processing, where we might want to find the most natural word that satisfies multiple constraints, such as being a valid repair with fewer than $k$ edits whose probability is maximized. diff --git a/latex/thesis/content/Terminology.tex b/latex/thesis/content/Terminology.tex index 3459a537..4f8522aa 100644 --- a/latex/thesis/content/Terminology.tex +++ b/latex/thesis/content/Terminology.tex @@ -8,7 +8,7 @@ \chapter*{\rm\bfseries Terminology} \item \textbf{Deterministic}: A property of a system that, given the same input, will always produce the same output. \item \textbf{Grammar}: A set of rules that define the syntax of a language. \item \textbf{Language}: A set of words generated by a grammar. For the purposes of this thesis, the language can be finite or infinite. - \item \textbf{Word}: A member of a language, consisting of a sequence of terminals. For the purposes of this thesis, words are always finite. + \item \textbf{Word}: A member of a language, consisting of a sequence of terminals. For the purposes of this thesis, a word is always finite. \item \textbf{Terminal}: A single token from an alphabet. For the purposes of this thesis, the alphabet is always finite. \item \textbf{Intersection}: The set of elements common to two or more sets. \item \textbf{Probabilistic}: A property of a system that, given the same input, may produce different outputs. diff --git a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/automata/GRE.kt b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/automata/GRE.kt new file mode 100644 index 00000000..938b8b9e --- /dev/null +++ b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/automata/GRE.kt @@ -0,0 +1,70 @@ +package ai.hypergraph.kaliningraph.automata + +import ai.hypergraph.kaliningraph.parsing.* +import ai.hypergraph.kaliningraph.tensor.UTMatrix +import ai.hypergraph.kaliningraph.types.* + +// Generalized regular expression: https://planetmath.org/generalizedregularexpression +sealed class GRE(vararg val args: GRE) { + companion object { operator fun invoke(s: Σᐩ) = ONE(s) } + + class EPS: GRE() + class ONE(val s: Σᐩ): GRE() + class SET(val s: Set<Σᐩ>): GRE() + class NEG(val g: GRE): GRE(g) + class UNI(val l: GRE, val r: GRE): GRE(l, r) + class CAT(val l: GRE, val r: GRE): GRE(l, r) + class INT(val l: GRE, val r: GRE): GRE(l, r) + + infix fun and(a: GRE): GRE = INT(this, a) + operator fun plus(g: GRE): GRE = UNI(this, g) + operator fun times(g: GRE): GRE = CAT(this, g) + operator fun not(): GRE = NEG(this) + + override fun toString(): String = when (this) { + is ONE -> s + is SET -> "( ${s.joinToString(" ")} )" + is NEG -> "! ( $g )" + is UNI -> "( $l ∪ $r )" + is CAT -> "$l $r" + is INT -> "$l ∩ $r" + is EPS -> "ε" + } +} + + +fun CFG.initGREListMat(tokens: List): UTMatrix> = + UTMatrix( + ts = tokens.map { token -> + val ptreeList = MutableList(nonterminals.size) { null } + (if (token != HOLE_MARKER) bimap[listOf(token)] else unitNonterminals) + .associateWith { nt -> + if (token != HOLE_MARKER) GRE.ONE(token) + else bimap.UNITS[nt]?.let { GRE.SET(it) } + }.forEach { (k, v) -> ptreeList[bindex[k]] = v } + ptreeList + }.toTypedArray(), + algebra = greAlgebra + ) + +val CFG.greAlgebra: Ring> by cache { + vindex.let { + Ring.of( + nil = List(nonterminals.size) { null }, + plus = { x, y -> greUnion(x, y) }, + times = { x, y -> greJoin(x, y) } + ) + } +} + +fun greUnion(l: List, r: List) = + l.zip(r) { l, r -> if (l == null) r else if (r == null) l else l + r } + +fun CFG.greJoin(left: List, right: List): List = vindex2.map { + val t = it.map { (B, C) -> if (left[B] != null && right[C] != null) left[B]!! * right[C]!! else null } + if (t.isEmpty()) null else t.reduce { acc, int -> if (acc == null) int else if (int == null) acc else acc + int } +} + +fun CFG.startGRE(tokens: List): GRE? = + initGREListMat(tokens).seekFixpoint().diagonals.last()[0][bindex[START_SYMBOL]] + diff --git a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/CFG.kt b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/CFG.kt index 366112a8..2a4099b4 100644 --- a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/CFG.kt +++ b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/CFG.kt @@ -69,6 +69,13 @@ val CFG.vindex: Array by cache { } } +val CFG.vindex2: Array>> by cache { + Array(bindex.indexedNTs.size) { i -> + bimap[bindex[i]].filter { it.size > 1 } + .map { listOf(bindex[it[0]], bindex[it[1]]) } + } +} + val CFG.bindex: Bindex<Σᐩ> by cache { Bindex(nonterminals) } val CFG.normalForm: CFG by cache { normalize() } val CFG.depGraph: LabeledGraph by cache { dependencyGraph() } diff --git a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiant.kt b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiant.kt index 14510cec..99099f0c 100644 --- a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiant.kt +++ b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiant.kt @@ -3,6 +3,7 @@ package ai.hypergraph.kaliningraph.parsing import ai.hypergraph.kaliningraph.* +import ai.hypergraph.kaliningraph.automata.GRE import ai.hypergraph.kaliningraph.sampling.* import ai.hypergraph.kaliningraph.tensor.* import ai.hypergraph.kaliningraph.types.* diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/BrzozowskiTest.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/BrzozowskiTest.kt index 5f96dd46..980c9e18 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/BrzozowskiTest.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/BrzozowskiTest.kt @@ -1,5 +1,8 @@ package ai.hypergraph.kaliningraph.parsing +import ai.hypergraph.kaliningraph.automata.* +import ai.hypergraph.kaliningraph.repair.vanillaS2PCFG +import ai.hypergraph.kaliningraph.tokenizeByWhitespace import ai.hypergraph.kaliningraph.types.* import ai.hypergraph.kaliningraph.types.powerset import kotlin.test.* @@ -8,6 +11,21 @@ import kotlin.test.* ./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.BrzozowskiTest" */ class BrzozowskiTest { +/* +./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.BrzozowskiTest.testGRE" +*/ + @Test + fun testGRE() { + val ab = GRE("A") + GRE("B") + val nabab = !(ab * ab) + + println(nabab.toString()) + + val t = Grammars.ifThen.startGRE(List(5) { "_" }) + + println(t?.toString()?.length) + } + /* ./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.BrzozowskiTest.testLeftQuotient" */