Language programs #11

rlouf · 2023-03-20T11:41:31Z

rlouf
Mar 20, 2023
Maintainer

Language models are distributions over sequences

Language model is a distribution over a sequence of tokens. Sampling from a language model returns sequences of tokens that follow the model's distribution. The output of a pre-trained language model parametrized by a prompt $P$ is a random variable:

$$ sequence \sim \operatorname{LM}_\theta(P) $$

What would this look like in code? In the following $s_{rv}$ represents a random variable:

model = lm.Normal()

prompt = "test"
s_rv = model(prompt)
type(s_rv)
# RandomString

Constrained language models

We can further constrain the output of the LM, in which case we are defining a new distribution.

$$ sequence \sim \operatorname{LM}^c_\theta(P) $$

Say we want the sequences to stop after a set of tokens have been found, to start with a set of tokens. The constraints apply to the LM distribution itself:

model = constrain(
    lm.Normal(),
    stops_at = ["\n", "."],
    starts_with = ["The"],
)

prompt = "test"
s_rv = model(prompt)

We can expand these contraints to add more complex validation methods, for example for code-generation tasks (see this, this and this paper for instance). The LQML paper suggests an efficient way to apply these constraints.

An interesting case is when we limit the output of the LM to a finite number of tokens. In this case we define a new random variable we can truly sample from. Syntax is not yet clear in my mind, but I feel we should distinguish this case from the starts_with and stops_at constraints above:

model = lm.Normal()

prompt = "test"
s_rv = model(prompt).choose_between(["beach towel", "watch"])

Language generators

A language generator is a function that returns a token sequence given an input token sequence. It may be deterministic or stochastic, may or may not be parametrized. The combination of a LLM with a decoding method (argmax, sample, beam search, nucleus sampling, etc.) is a language generator. Decoders can be seen as program transformation, the same way joint_logprob is in AePPL: they produce and execution graph that returns a string.

model = lm.Normal()

prompt = "test"
s_rv = model(prompt)

s = argmax(s_rv)  # greedy, tries to get the "best" sequence
s = beam_search(s_rv)  # greedy, tries to get the "best" sequence
s = self_consistency(s_rv) # greedy as well
s = ancestral_sampling(s_rv)

Self-consistency is defined in this paper

Language programs

Language programs are Directed Acyclic Graphs that link different LM-distributed random variables together. They are typically applied recursively to an initial prompt that is augmented with the RVs:

model = txt.llm.Normal()

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

In theory, executing this graph with e.g. prompt.eval() should return random strings (maybe with ancestral_sampling?). In practice, we often want to get an optimal-ish output. In this case we can transform the graph using the previously-defined operators. Different operators behave in different ways. For instance, argmax greedily decodes the graph, so this program:

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = argmax(prompt)
out.eval()

is equivalent to this one:

prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a
prompt.eval()

Other program transformations, like beam_search, yield different results when they're applied to a whole graph or to individual LM rvs. When applied to a graph with multiple LM calls, the beams used to decode a variable are continued when decoding the next variable, thus trying to find the most likely sequence for the program as a whole (called scripted beam search in the LQML paper. When applied to the LM calls individually the beams are re-initialized after each decoding:

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = beam_search(prompt)
out.eval()

# is NOT equivalent to

prompt = "Q: "
q = beam_search(model(prompt))
prompt += q + "\nA: "
a = beam_search(model(prompt))
prompt += a
prompt.eval()

Other random variables

Other random variables can be part of a language program. They are not affected by generators, in the sense that an .eval() call on the output will consist in first drawing from the random variables' distribution and then decode. Example of random variable:

a_rv = choice(["The", "A", "All"])
a_rv.eval()
# The
a_rv.eval()
# All

This also applies to llm(prompt).choose_between(["The", "A", "All"]) types of random variables. Such variables can be used in a context where we want to infer the best few-shot prompts for a given task, for instance.

Infer the posterior distribution of values

In a program where we do not apply a generating transformation (such as beam_search) to graphs containing LM-distributed random variables like a_rv = model(prompt) it is not clear how to perform efficient inference, because defining good proposal distributions in this space is non-trivial afaik. It remains nevertheless an open possibility with this implementation.

It is however possible to perform simulation-based inference when using one of the generators, thus treating language programs as simulators. We can use humans in the loop to validate the sample, or apparently even use LMs as discriminators.

Use tools (like web search)

Tools are operators that take a sequence as an input and return a sequence (or a list of sequences). They can thus easily be added to the graph:

model = llm.Normal()

p = "Prompt"
a_rv = model(p)
res = google_search(a_rv)
p += res
b_rv = model(p)

Here is a survey on augmented language models. We could use web search, API calls, code execution, etc.

We can even add humans in the loop with for instance a human_input operator.

Multi-modality

Multi-modality is achieved by defining an ImageVariable type, and defining operators that act on/return these types. For instance with a stable_diffusion operator:

prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a

img_prompt = beam_search(prompt)
img = stable_diffusion(img_prompt)

img.eval()

farice · 2023-03-21T12:46:55Z

farice
Mar 21, 2023

This was a delightful read. Not sure which format you suggest for adding questions here (used to gdocs) - adding a new comment for each chunk of questions as they come to mind?

0 replies

rlouf · 2023-03-21T13:03:26Z

rlouf
Mar 21, 2023
Maintainer Author

Not sure which format you suggest for adding questions here (used to gdocs) - adding a new comment for each chunk of questions as they come to mind?

Yes this works, you can use the quote reply feature for context.

0 replies

farice · 2023-03-21T13:15:57Z

farice
Mar 21, 2023

Left general comments. One particular area I'd like to understand better is the interfaces which get closer to "prompt engineering" or templating for a particular LM call. Thinking few-shot learning, instruction prompting, chain of thought. This interface is indeed "higher level", and so may be outside of the scope of what you wanted to align on first!

Language models are distributions over sequences

Language model is a distribution over a sequence of tokens. Sampling from a language model returns sequences of tokens that follow the model's distribution. The output of a pre-trained language model parametrized by a prompt P is a random variable:

sequence∼LMθ⁡(P)

What would this look like in code? In the following srv represents a random variable:
model = lm.Normal()

prompt = "test"
s_rv = model(prompt)
type(s_rv)
# RandomString

So seems like the lm.Normal() object is the next-token RV which can be sampled?

Constrained language models

We can further constrain the output of the LM, in which case we are defining a new distribution.

sequence∼LMθc⁡(P)

Say we want the sequences to stop after a set of tokens have been found, to start with a set of tokens. The constraints apply to the LM distribution itself:
model = constrain(
    lm.Normal(),
    stops_at = ["\n", "."],
    starts_with = ["The"],
)

prompt = "test"
s_rv = model(prompt)
We can expand these contraints to add more complex validation methods, for example for code-generation tasks (see this, this and this paper for instance). The LQML paper suggests an efficient way to apply these constraints.

An interesting case is when we limit the output of the LM to a finite number of tokens. In this case we define a new random variable we can truly sample from. Syntax is not yet clear in my mind, but I feel we should distinguish this case from the starts_with and stops_at constraints above:
model = lm.Normal()

prompt = "test"
s_rv = model(prompt).choose_between(["beach towel", "watch"])
Language generators

A language generator is a function that returns a token sequence given an input token sequence. It may be deterministic or stochastic, may or may not be parametrized. The combination of a LLM with a decoding method (argmax, sample, beam search, nucleus sampling, etc.) is a language generator. Decoders can be seen as program transformation, the same way joint_logprob is in AePPL: they produce and execution graph that returns a string.

So, the generator then fixes the sampling procedure to generate sequences? Hence, OpenAI we might have params like temperature, top_p, n, best_of, presence_penalty, frequency_penalty, logit_bias in here). Makes perfect sense that we can view this as a program transformation - perhaps viewed as a PDF over character sequence space.

model = lm.Normal()

prompt = "test"
s_rv = model(prompt)

s = argmax(s_rv)  # greedy, tries to get the "best" sequence
s = beam_search(s_rv)  # greedy, tries to get the "best" sequence
s = self_consistency(s_rv) # greedy as well
s = ancestral_sampling(s_rv)

nit: noticed same comment # greedy, tries to get the "best" sequence" for two algos, which is ... fair 😆

Self-consistency is defined in this paper

Language programs

Language programs are Directed Acyclic Graphs that link different LM-distributed random variables together. They are typically applied recursively to an initial prompt that is augmented with the RVs:
model = txt.llm.Normal()

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

Embeddings based memory might be one to flesh out for this composition flow. Powerful. Others in SK, langchain, and our good friend the cookbook may be exercises for the reader.

In theory, executing this graph with e.g. prompt.eval() should return random strings (maybe with ancestral_sampling?). In practice, we often want to get an optimal-ish output. In this case we can transform the graph using the previously-defined operators. Different operators behave in different ways. For instance, argmax greedily decodes the graph, so this program:
prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = argmax(prompt)
out.eval()
is equivalent to this one:
prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a
prompt.eval()
Other program transformations, like beam_search, yield different results when they're applied to a whole graph or to individual LM rvs. When applied to a graph with multiple LM calls, the beams used to decode a variable are continued when decoding the next variable, thus trying to find the most likely sequence for the program as a whole (called scripted beam search in the LQML paper. When applied to the LM calls individually the beams are re-initialized after each decoding:
prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = beam_search(prompt)
out.eval()

# is NOT equivalent to

prompt = "Q: "
q = beam_search(model(prompt))
prompt += q + "\nA: "
a = beam_search(model(prompt))
prompt += a
prompt.eval()

Other section we could think about is the corresponding interface for auto-optimization of programs, particularly in context of prompt engineering. I caught this recently.

Other random variables

Other random variables can be part of a language program. They are not affected by generators, in the sense that an .eval() call on the output will consist in first drawing from the random variables' distribution and then decode. Example of random variable:
a_rv = choice(["The", "A", "All"])
a_rv.eval()
# The
a_rv.eval()
# All

Powerful.

This also applies to llm(prompt).choose_between(["The", "A", "All"]) types of random variables. Such variables can be used in a context where we want to infer the best few-shot prompts for a given task, for instance.

Infer the posterior distribution of values

In a program where we do not apply a generating transformation (such as beam_search) to graphs containing LM-distributed random variables like a_rv = model(prompt) it is not clear how to perform efficient inference, because defining good proposal distributions in this space is non-trivial afaik. It remains nevertheless an open possibility with this implementation.

It is however possible to perform simulation-based inference when using one of the generators, thus treating language programs as simulators. We can use humans in the loop to validate the sample, or apparently even use LMs as discriminators.

Yeah I think this direction is pretty interesting. It should pair nicely with UQ methods that attempt to quantify a decoupled epistemic uncertainty.

Use tools (like web search)

Tools are operators that take a sequence as an input and return a sequence (or a list of sequences). They can thus easily be added to the graph:
model = llm.Normal()

p = "Prompt"
a_rv = model(p)
res = google_search(a_rv)
p += res
b_rv = model(p)
Here is a survey on augmented language models. We could use web search, API calls, code execution, etc.

We can even add humans in the loop with for instance a human_input operator.

Multi-modality

Multi-modality is achieved by defining an ImageVariable type, and defining operators that act on/return these types. For instance with a stable_diffusion operator:

Would be super curious how this would look for one of the visual-chatGPT or ViperGPT examples.

prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a

img_prompt = beam_search(prompt)
img = stable_diffusion(img_prompt)

img.eval()

0 replies

rlouf · 2023-03-21T14:04:52Z

rlouf
Mar 21, 2023
Maintainer Author

Left general comments. One particular area I'd like to understand better is the interfaces which get closer to "prompt engineering" or templating for a particular LM call. Thinking few-shot learning, instruction prompting, chain of thought. This interface is indeed "higher level", and so may be outside of the scope of what you wanted to align on first!

#1 defines what would be the equivalent of flax.linen.Sequential for this library, that would work for prompting that does not include any control flow. Control flow is introduced in #3.

So seems like the lm.Normal() object is the next-token RV which can be sampled?

In my current thinking the RV represents the sequence (i.e. everything before an <EOS> token) that could be sampled from the distribution defined by the LM.

Embeddings based memory might be one to flesh out for this composition flow.

If you're thinking about storing and retrieving from vector stores, this could be a simple Op, as discussed in #6.

Others in SK, langchain, and our good friend the cookbook may be exercises for the reader.

Yes, this is the "just adding stuff" part of the development, where adding new features is a breeze. These could even be defined in externals libraries by users should they need to.

Other section we could think about is the corresponding interface for auto-optimization of programs, particularly in context of prompt engineering. I caught this recently.

Agreed. I'd really like see what these guys do; my guess is just pattern-matching. Can be implemented as a program rewrite if that's the case.

Yeah I think this direction is pretty interesting. It should pair nicely with UQ methods that attempt to quantify a decoupled epistemic uncertainty.

This is exactly what David Dohan was after. He was kind of stuck because he is not seeing decoding methods as program transformations yet.

Would be super curious how this would look for one of the visual-chatGPT or ViperGPT examples.

You could try to reproduce their results by implementing the operators they have in their toy library, send the definition of the API to GPT as the beginning of the prompt and append the question, execute the returned code.

0 replies

farice · 2023-03-21T17:58:10Z

farice
Mar 21, 2023

This all makes sense to me, and I find to be very compelling. Especially around program transformations.

Where would you think to start, an end-to-end example? Perhaps the buildt use case, or something simpler from one of the aforementioned chaining libs?

0 replies

rlouf · 2023-03-21T18:54:08Z

rlouf
Mar 21, 2023
Maintainer Author

An example similar to this one (without constraints to begin with) that plugs in a diffuser model for the hype:

import txt

llm = txt.lm.Normal()

def meta_prompt(question):
    prompt = txt.prompt("""{{ question }}
    I believe the best person to answer this question is {{ expert }}.
    Indeed, {{ expert }} addressed this question: {{ answer }}""")

    expert = llm
    answer = llm

    return prompt(question=question, expert=expert, answer=answer)

out = txt.lm.beam_search(meta_prompt("What is the Earth's diameter?")
out.eval()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language programs #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

Language models are distributions over sequences

Constrained language models

Language generators

Language programs

Other random variables

Infer the posterior distribution of values

Use tools (like web search)

Multi-modality

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Language programs #11

rlouf Mar 20, 2023 Maintainer

Language models are distributions over sequences

Constrained language models

Language generators

Language programs

Other random variables

Infer the posterior distribution of values

Use tools (like web search)

Multi-modality

Replies: 6 comments

farice Mar 21, 2023

rlouf Mar 21, 2023 Maintainer Author

farice Mar 21, 2023

Language models are distributions over sequences

Constrained language models

Language generators

Language programs

Other random variables

Infer the posterior distribution of values

Use tools (like web search)

Multi-modality

rlouf Mar 21, 2023 Maintainer Author

farice Mar 21, 2023

rlouf Mar 21, 2023 Maintainer Author

rlouf
Mar 20, 2023
Maintainer

farice
Mar 21, 2023

rlouf
Mar 21, 2023
Maintainer Author

farice
Mar 21, 2023

rlouf
Mar 21, 2023
Maintainer Author

farice
Mar 21, 2023

rlouf
Mar 21, 2023
Maintainer Author