-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor preprocessing and postprocessing in SDG #155
Conversation
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few lengthy comments so I will let people read those individually.
I generally agree with the idea that separate libraries does not make sense. Though, I strongly feel like logically it makes sense for the "cli" repo instructlab/instructlab
should own data ingestion as it owns model, data, config, and filesystem management generally and is dedicated to the user experience of the project.
docs/sdg/sdg-refactor.md
Outdated
|
||
One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: | ||
|
||
- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I like the split of these commands, the modularity makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too like splitting this out, although I'm not entirely sure if this is the right dimension or if conversion of documents with Docling should be entirely separate from traversing a taxonomy, extracting data from qna.yamls, and fetching the referenced knowledge. To put it more specifically, we may want to end up with something like ilab document convert
that converts a set of input documents to a set of docling jsons or markdown that may be subsequently hand-edited, committed to a git repository, or something else. In the spirit of this document, we probably don't have to decide this here and now. But, document conversion may have a separate iteration loop from a user's point of view, and not just be something that happens automatically during SDG.
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
I think that this document should be contextualized by the purpose of the change. "Refactor SDG" - okay, but why? To accomplish what goal? We can keep refactoring forever if we want to. It seems like the goal here is to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. The question that follows is where these new, modularized pieces should live. This is another architectural decision - it looks everyone thinks it should go into the CLI, at least for now. In order to do that while still achieving the goal of modularization and without creating a new repository, we will have to do it in a certain way. The methodical line of thinking here becomes much easier when individual decisions are treated individually, which is why I have been advocating for a series of ADRs that are each individually digestible, easier to discuss, and easier to peer review. That would look like a series of discrete decisions that would be something like this sequence:
As I discuss in the my ADR doc, treating each of these decisions individually and critically engaging with the "Consequences" section of the format would yield direct next steps. For example, in order to move pre-/post-processing out of the SDG codebase, there will like have to be inter-team collaboration, there will have to be a discussion around how to incorporate it into the CLI codebase, backwards compatibility will probably require some care, and so on - these things fall naturally out of that vehicle for technical writing and decision making. Finally, I think that this document should make a decision and state that clearly. If no objections are raised from stakeholders, it moves forward. If there are, then we can address those and adjust as appropriate. Without taking some sort of stance, however, this will continue to sit in limbo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finally had time to review this document - thanks @jwm4 for putting it together!
My general impression is I agree with the decisions at the bottom of the current draft - most strongly I feel the preprocessing and postprocessing steps should be a part of the CLI at this point - that being said, one thing we should consider in the design is ensuring that the interfaces SDG is exposing for these inputs and outputs is not too tightly coupled to the CLI - we need SDG to be consumable independently by the future API, UI, etc. With that work still in the early stages, I think moving these things into the CLI makes sense at this time as a POC of what such an interface should look like.
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
|
||
- Avoids the pros of both Option 1 and Option 2. | ||
- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the `instructlab/instructlab` repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. | ||
- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the command-line interface code and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it pertains to this point, I will say that the dataset in / dataset out principle should also alleviate some of the burden
I have not seen a response on this and feel that it is important. |
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
So reading this, isn't going back to puttind |
@jjasghar writes:
Sorry for the confusion! The plan here is to keep the actual synthetic data generation in the |
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
This is extremely well thought out, and clear now. Thank you. I approve this to be merged. |
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
This is a draft proposal that describes pros and cons of various options for refactoring preprocessing and postprocessing in SDG. It includes draft decisions, but the purpose of the draft is not to assert that these are the final decisions but rather to act as a vehicle for reaching final decisions. So comments are strongly encouraged, and readers should not expect that the decisions in the current draft will be unchanged between now and the time when this document is merged.