Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor preprocessing and postprocessing in SDG #155

Merged
merged 12 commits into from
Dec 17, 2024

Conversation

jwm4
Copy link
Contributor

@jwm4 jwm4 commented Nov 14, 2024

This is a draft proposal that describes pros and cons of various options for refactoring preprocessing and postprocessing in SDG. It includes draft decisions, but the purpose of the draft is not to assert that these are the final decisions but rather to act as a vehicle for reaching final decisions. So comments are strongly encouraged, and readers should not expect that the decisions in the current draft will be unchanged between now and the time when this document is merged.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@nathan-weinberg nathan-weinberg self-requested a review November 14, 2024 21:44
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few lengthy comments so I will let people read those individually.

I generally agree with the idea that separate libraries does not make sense. Though, I strongly feel like logically it makes sense for the "cli" repo instructlab/instructlab should own data ingestion as it owns model, data, config, and filesystem management generally and is dedicated to the user experience of the project.

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Show resolved Hide resolved

One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be:

- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I like the split of these commands, the modularity makes sense to me!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too like splitting this out, although I'm not entirely sure if this is the right dimension or if conversion of documents with Docling should be entirely separate from traversing a taxonomy, extracting data from qna.yamls, and fetching the referenced knowledge. To put it more specifically, we may want to end up with something like ilab document convert that converts a set of input documents to a set of docling jsons or markdown that may be subsequently hand-edited, committed to a git repository, or something else. In the spirit of this document, we probably don't have to decide this here and now. But, document conversion may have a separate iteration loop from a user's point of view, and not just be something that happens automatically during SDG.

docs/sdg/sdg-refactor.md Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
jwm4 added 2 commits November 17, 2024 19:40
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@anastasds
Copy link
Contributor

anastasds commented Nov 18, 2024

I think that this document should be contextualized by the purpose of the change. "Refactor SDG" - okay, but why? To accomplish what goal? We can keep refactoring forever if we want to.

It seems like the goal here is to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing.

The question that follows is where these new, modularized pieces should live. This is another architectural decision - it looks everyone thinks it should go into the CLI, at least for now. In order to do that while still achieving the goal of modularization and without creating a new repository, we will have to do it in a certain way.

The methodical line of thinking here becomes much easier when individual decisions are treated individually, which is why I have been advocating for a series of ADRs that are each individually digestible, easier to discuss, and easier to peer review. That would look like a series of discrete decisions that would be something like this sequence:

  1. The SDG codebase will be refactored in order to modularize based on pre-processing, data generation, and post-processing steps.
  2. Pre-processing logic for SDG will be moved into the CLI codebase.
  3. Post-processing logic for SDG will be moved into the CLI codebase
  4. The SDG codebase will be designed around the principle of "dataset in, dataset out".

As I discuss in the my ADR doc, treating each of these decisions individually and critically engaging with the "Consequences" section of the format would yield direct next steps. For example, in order to move pre-/post-processing out of the SDG codebase, there will like have to be inter-team collaboration, there will have to be a discussion around how to incorporate it into the CLI codebase, backwards compatibility will probably require some care, and so on - these things fall naturally out of that vehicle for technical writing and decision making.

Finally, I think that this document should make a decision and state that clearly. If no objections are raised from stakeholders, it moves forward. If there are, then we can address those and adjust as appropriate. Without taking some sort of stance, however, this will continue to sit in limbo.

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finally had time to review this document - thanks @jwm4 for putting it together!

My general impression is I agree with the decisions at the bottom of the current draft - most strongly I feel the preprocessing and postprocessing steps should be a part of the CLI at this point - that being said, one thing we should consider in the design is ensuring that the interfaces SDG is exposing for these inputs and outputs is not too tightly coupled to the CLI - we need SDG to be consumable independently by the future API, UI, etc. With that work still in the early stages, I think moving these things into the CLI makes sense at this time as a POC of what such an interface should look like.

docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
jwm4 added 3 commits November 21, 2024 13:06
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@jwm4 jwm4 marked this pull request as ready for review November 21, 2024 18:20
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved
docs/sdg/sdg-refactor.md Outdated Show resolved Hide resolved

- Avoids the pros of both Option 1 and Option 2.
- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the `instructlab/instructlab` repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community.
- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the command-line interface code and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it pertains to this point, I will say that the dataset in / dataset out principle should also alleviate some of the burden

@anastasds
Copy link
Contributor

I think that this document should be contextualized by the purpose of the change. "Refactor SDG" - okay, but why? To accomplish what goal? We can keep refactoring forever if we want to.

It seems like the goal here is to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing.

I have not seen a response on this and feel that it is important.

jwm4 added 2 commits December 11, 2024 16:19
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@jjasghar
Copy link
Member

So reading this, isn't going back to puttind sdg back into the "core" package defeating the purpose of of "core" anyway. I thought the goal per the renaming of core was to start modularizing things.

@jwm4
Copy link
Contributor Author

jwm4 commented Dec 13, 2024

@jjasghar writes:

So reading this, isn't going back to puttind sdg back into the "core" package defeating the purpose of of "core" anyway. I thought the goal per the renaming of core was to start modularizing things.

Sorry for the confusion! The plan here is to keep the actual synthetic data generation in the /SDG repo where it is now, but to move all the preprocessing and postprocessing to core. I will update the "goals" section to make that clearer.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@jjasghar
Copy link
Member

This is extremely well thought out, and clear now. Thank you. I approve this to be merged.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@jjasghar jjasghar merged commit c601d42 into instructlab:main Dec 17, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants