Refactor preprocessing and postprocessing in SDG #155

jwm4 · 2024-11-14T21:28:33Z

This is a draft proposal that describes pros and cons of various options for refactoring preprocessing and postprocessing in SDG. It includes draft decisions, but the purpose of the draft is not to assert that these are the final decisions but rather to act as a vehicle for reaching final decisions. So comments are strongly encouraged, and readers should not expect that the decisions in the current draft will be unchanged between now and the time when this document is merged.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

cdoern

Left a few lengthy comments so I will let people read those individually.

I generally agree with the idea that separate libraries does not make sense. Though, I strongly feel like logically it makes sense for the "cli" repo instructlab/instructlab should own data ingestion as it owns model, data, config, and filesystem management generally and is dedicated to the user experience of the project.

docs/sdg/sdg-refactor.md

cdoern · 2024-11-14T21:50:00Z

docs/sdg/sdg-refactor.md

+
+One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step .  Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2.  Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command.  Here is a rough outline of what the separate commands would be:
+
+- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future).


As we discussed, I like the split of these commands, the modularity makes sense to me!

I too like splitting this out, although I'm not entirely sure if this is the right dimension or if conversion of documents with Docling should be entirely separate from traversing a taxonomy, extracting data from qna.yamls, and fetching the referenced knowledge. To put it more specifically, we may want to end up with something like ilab document convert that converts a set of input documents to a set of docling jsons or markdown that may be subsequently hand-edited, committed to a git repository, or something else. In the spirit of this document, we probably don't have to decide this here and now. But, document conversion may have a separate iteration loop from a user's point of view, and not just be something that happens automatically during SDG.

docs/sdg/sdg-refactor.md

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

docs/sdg/sdg-refactor.md

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

docs/sdg/sdg-refactor.md

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

anastasds · 2024-11-18T18:17:01Z

I think that this document should be contextualized by the purpose of the change. "Refactor SDG" - okay, but why? To accomplish what goal? We can keep refactoring forever if we want to.

It seems like the goal here is to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing.

The question that follows is where these new, modularized pieces should live. This is another architectural decision - it looks everyone thinks it should go into the CLI, at least for now. In order to do that while still achieving the goal of modularization and without creating a new repository, we will have to do it in a certain way.

The methodical line of thinking here becomes much easier when individual decisions are treated individually, which is why I have been advocating for a series of ADRs that are each individually digestible, easier to discuss, and easier to peer review. That would look like a series of discrete decisions that would be something like this sequence:

The SDG codebase will be refactored in order to modularize based on pre-processing, data generation, and post-processing steps.
Pre-processing logic for SDG will be moved into the CLI codebase.
Post-processing logic for SDG will be moved into the CLI codebase
The SDG codebase will be designed around the principle of "dataset in, dataset out".

As I discuss in the my ADR doc, treating each of these decisions individually and critically engaging with the "Consequences" section of the format would yield direct next steps. For example, in order to move pre-/post-processing out of the SDG codebase, there will like have to be inter-team collaboration, there will have to be a discussion around how to incorporate it into the CLI codebase, backwards compatibility will probably require some care, and so on - these things fall naturally out of that vehicle for technical writing and decision making.

Finally, I think that this document should make a decision and state that clearly. If no objections are raised from stakeholders, it moves forward. If there are, then we can address those and adjust as appropriate. Without taking some sort of stance, however, this will continue to sit in limbo.

docs/sdg/sdg-refactor.md

nathan-weinberg

I've finally had time to review this document - thanks @jwm4 for putting it together!

My general impression is I agree with the decisions at the bottom of the current draft - most strongly I feel the preprocessing and postprocessing steps should be a part of the CLI at this point - that being said, one thing we should consider in the design is ensuring that the interfaces SDG is exposing for these inputs and outputs is not too tightly coupled to the CLI - we need SDG to be consumable independently by the future API, UI, etc. With that work still in the early stages, I think moving these things into the CLI makes sense at this time as a POC of what such an interface should look like.

docs/sdg/sdg-refactor.md

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

docs/sdg/sdg-refactor.md

khaledsulayman · 2024-11-21T19:23:27Z

docs/sdg/sdg-refactor.md

+
+- Avoids the pros of both Option 1 and Option 2.
+- As with Option 1, this approach involves a lot of coordination.  There are a lot of stakeholders involved in the `instructlab/instructlab` repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing.  However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community.
+- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them.  In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG.  In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the command-line interface code and preprocessing/postprocessing.  However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure.


as it pertains to this point, I will say that the dataset in / dataset out principle should also alleviate some of the burden

anastasds · 2024-11-22T04:09:06Z

I think that this document should be contextualized by the purpose of the change. "Refactor SDG" - okay, but why? To accomplish what goal? We can keep refactoring forever if we want to.

It seems like the goal here is to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing.

I have not seen a response on this and feel that it is important.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

jjasghar · 2024-12-12T16:19:13Z

So reading this, isn't going back to puttind sdg back into the "core" package defeating the purpose of of "core" anyway. I thought the goal per the renaming of core was to start modularizing things.

jwm4 · 2024-12-13T17:15:39Z

@jjasghar writes:

So reading this, isn't going back to puttind sdg back into the "core" package defeating the purpose of of "core" anyway. I thought the goal per the renaming of core was to start modularizing things.

Sorry for the confusion! The plan here is to keep the actual synthetic data generation in the /SDG repo where it is now, but to move all the preprocessing and postprocessing to core. I will update the "goals" section to make that clearer.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

jjasghar · 2024-12-13T18:46:13Z

This is extremely well thought out, and clear now. Thank you. I approve this to be merged.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

Create sdg-refactor.md

984af49

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

nathan-weinberg self-requested a review November 14, 2024 21:44

cdoern reviewed Nov 14, 2024

View reviewed changes

Fix linting and spelling errors

36f4ceb

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

williamcaban reviewed Nov 15, 2024

View reviewed changes

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

cdoern mentioned this pull request Nov 15, 2024

Allow documents to be loaded locally for Knowledge Injection instructlab/instructlab#2659

Open

6 tasks

Update in response to review comments

3d41e04

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

anastasds reviewed Nov 15, 2024

View reviewed changes

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

jwm4 added 2 commits November 17, 2024 19:40

Update draft conclusions

630d637

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

Fix some typos

92eb6b5

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

bbrowning reviewed Nov 18, 2024

View reviewed changes

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

bbrowning reviewed Nov 18, 2024

View reviewed changes

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

nathan-weinberg reviewed Nov 18, 2024

View reviewed changes

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

jwm4 added 3 commits November 21, 2024 13:06

Update sdg-refactor.md

f05ecf4

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

Update .spellcheck-en-custom.txt

7c336b1

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

Fix spell error

850300f

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

jwm4 marked this pull request as ready for review November 21, 2024 18:20

nathan-weinberg approved these changes Nov 21, 2024

View reviewed changes

khaledsulayman reviewed Nov 21, 2024

View reviewed changes

jwm4 mentioned this pull request Dec 3, 2024

RAG ingestion and chat pipelines #161

Open

jwm4 added 2 commits December 11, 2024 16:19

Use new name for core repo. Also, less specifics about CLI.

4c3c4da

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

More changes in response to review comments

a268c7b

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

Update sdg-refactor.md

b1e1a83

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

jjasghar approved these changes Dec 13, 2024

View reviewed changes

jwm4 force-pushed the patch-1 branch from 215817e to b1e1a83 Compare December 17, 2024 14:36

Merge branch 'main' into patch-1

7503a09

Signed-off-by: Bill Murdock <bmurdock@redhat.com>

jjasghar merged commit c601d42 into instructlab:main Dec 17, 2024
4 checks passed

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination for Anastas Stoyanovsky instructlab/instructlab#2932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor preprocessing and postprocessing in SDG #155

Refactor preprocessing and postprocessing in SDG #155

jwm4 commented Nov 14, 2024

cdoern left a comment

cdoern Nov 14, 2024

bbrowning Nov 18, 2024

anastasds commented Nov 18, 2024 •

edited

Loading

nathan-weinberg left a comment

khaledsulayman Nov 21, 2024

anastasds commented Nov 22, 2024

jjasghar commented Dec 12, 2024

jwm4 commented Dec 13, 2024 •

edited

Loading

jjasghar commented Dec 13, 2024


		One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be:

		- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future).

Refactor preprocessing and postprocessing in SDG #155

Refactor preprocessing and postprocessing in SDG #155

Conversation

jwm4 commented Nov 14, 2024

cdoern left a comment

Choose a reason for hiding this comment

cdoern Nov 14, 2024

Choose a reason for hiding this comment

bbrowning Nov 18, 2024

Choose a reason for hiding this comment

anastasds commented Nov 18, 2024 • edited Loading

nathan-weinberg left a comment

Choose a reason for hiding this comment

khaledsulayman Nov 21, 2024

Choose a reason for hiding this comment

anastasds commented Nov 22, 2024

jjasghar commented Dec 12, 2024

jwm4 commented Dec 13, 2024 • edited Loading

jjasghar commented Dec 13, 2024

anastasds commented Nov 18, 2024 •

edited

Loading

jwm4 commented Dec 13, 2024 •

edited

Loading