Skip to content

Commit

Permalink
Fix some typos
Browse files Browse the repository at this point in the history
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
  • Loading branch information
jwm4 authored Nov 18, 2024
1 parent 630d637 commit 92eb6b5
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions docs/sdg/sdg-refactor.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ The existing synthetic data generation (SDG) repository includes several related

Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs usable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs.

We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionallly a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users.
We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionally a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users.

Also as context, in the near future we are absorbing a set of updates to the core SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the non-core functionality (preprocessing and postprocessing). However, it is mentioned here in the context section in case that context winds up being useful.

Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model).

An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github respository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here:
An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github repository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here:

- *Raw seed content* -- A set of elements each of which has a set of context/question/answer tuples. Some elements may be *knowledge* elements which also have references to documents.
- *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of approrpriate size for SDG.
- *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of appropriate size for SDG.
- *Taxonomy* -- A tree structure encoded as a git repo. Some leaves of the taxonomy are unstaged, indicating that they should be used for raw seed content.

## Question 1: What user flows should be supported?
Expand Down Expand Up @@ -71,10 +71,10 @@ Pro:
Con:

- The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library.
- To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break.
- To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break.
- As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own.
- Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly *could* have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we *will* do so if they are separated.
- The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab commmunity and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code.
- The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab community and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code.
- The expected RAG functionality in 2025 will have some complex interactions with both preprocessing and postprocessing, perhaps even involving user flows in which the core SDG functionality is not needed. In that case, it would be confusing to have the code path for RAG include a call out to the SDG library for doing preprocessing but not actually doing the core SDG.
- It would just be simpler to explain to all stakeholders if the functionality that I've been calling "core SDG" was really just called "SDG". We can't do that now because the SDG library has preprocessing and postprocessing in it too.

Expand Down

0 comments on commit 92eb6b5

Please sign in to comment.