-
Notifications
You must be signed in to change notification settings - Fork 122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See one important comment about splitting the load()
function to two functions.
If it's easy - I suggest doing it now, since we'll need it in the future anyway. If it's too much effort due to tests change etc - leave it, we'll deal with it once it actually becomes a problem.
|
||
|
||
def load_dataframe_from_path(path: str) -> pd.DataFrame: | ||
def load_from_path(path: str) -> List[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this function into two - one that loads a directory of files to a DF
, then a separate one that takes a DF
and converts it to List[Document]
.
It would make it easier to parallelize \ pipeline these operations in the future. Especially as we'll start dealing with larger-than-memory DFs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is exactly what I did, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are doing:
for f in all_files:
documents.extend(_load_single_file_by_suffix(f))
Where _load_single_file_by_suffix()
returns List[Document]
.
What I mean is _load_single_file() -> DF
, and a separate, convert_df_to_docs()
, then do something like:
df = pd.concat(
[_load_single_file(f) for f in all_files],
axis=0,
ignore_index=True,
)
documents = _convert_df_to_docs(df)
So we can separate the loading from the conversion (and parallelize them separately, in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super critical - we can also do that later.
I'm approving anyway, up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong the right solution for large amount of data is to use generators, if we want to hold all the documents in RAM together we already aim for low volume. Anyway I don't think the intention of the CLI right now is to support load very large volumes, if it works for 100K we are good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all LGTM.
Remove upsert dataframe method from KB, and implement it in CLI
Type of Change
Test Plan
Adjusted unit tests for new functionallity