Remove upsert df from kb #67

acatav · 2023-10-11T15:25:45Z

Remove upsert dataframe method from KB, and implement it in CLI

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Infrastructure change (CI configs, etc)
Non-code change (docs, etc)
None of the above: (explain here)

Test Plan

Adjusted unit tests for new functionallity

igiloh-pinecone

See one important comment about splitting the load() function to two functions.
If it's easy - I suggest doing it now, since we'll need it in the future anyway. If it's too much effort due to tests change etc - leave it, we'll deal with it once it actually becomes a problem.

resin_cli/cli.py

resin/knoweldge_base/knowledge_base.py

resin_cli/data_loader/data_loader.py

igiloh-pinecone · 2023-10-12T14:54:57Z

resin_cli/data_loader/data_loader.py



-def load_dataframe_from_path(path: str) -> pd.DataFrame:
+def load_from_path(path: str) -> List[Document]:


Please split this function into two - one that loads a directory of files to a DF, then a separate one that takes a DF and converts it to List[Document].

It would make it easier to parallelize \ pipeline these operations in the future. Especially as we'll start dealing with larger-than-memory DFs

I think that this is exactly what I did, am I missing something?

You are doing:

for f in all_files: documents.extend(_load_single_file_by_suffix(f))

Where _load_single_file_by_suffix() returns List[Document].

What I mean is _load_single_file() -> DF, and a separate, convert_df_to_docs(), then do something like:

df = pd.concat( [_load_single_file(f) for f in all_files], axis=0, ignore_index=True, ) documents = _convert_df_to_docs(df)

So we can separate the loading from the conversion (and parallelize them separately, in the future)

Not super critical - we can also do that later.
I'm approving anyway, up to you.

If I'm not wrong the right solution for large amount of data is to use generators, if we want to hold all the documents in RAM together we already aim for low volume. Anyway I don't think the intention of the CLI right now is to support load very large volumes, if it works for 100K we are good

tests/unit/cli/test_data_loader.py

igiloh-pinecone

All in all LGTM.

acatav added 5 commits October 11, 2023 17:46

move dataframe to documents logic to CLI

b90811d

adjust unit tests to new inmplementation

adf6bb1

remove upsert dataframe from KB

864faf6

Merge branch 'dev' into remove-upsert-df-from-kb

bb6be28

fix condition

ef42e1c

igiloh-pinecone suggested changes Oct 12, 2023

View reviewed changes

small pr comments

652c477

igiloh-pinecone approved these changes Oct 15, 2023

View reviewed changes

Merge branch 'dev' into remove-upsert-df-from-kb

18a613d

acatav merged commit b0ad780 into dev Oct 15, 2023
9 checks passed

acatav deleted the remove-upsert-df-from-kb branch October 15, 2023 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove upsert df from kb #67

Remove upsert df from kb #67

acatav commented Oct 11, 2023

igiloh-pinecone left a comment

igiloh-pinecone Oct 12, 2023

acatav Oct 15, 2023

igiloh-pinecone Oct 15, 2023

igiloh-pinecone Oct 15, 2023

acatav Oct 15, 2023

igiloh-pinecone left a comment



		def load_dataframe_from_path(path: str) -> pd.DataFrame:
		def load_from_path(path: str) -> List[Document]:

Remove upsert df from kb #67

Remove upsert df from kb #67

Conversation

acatav commented Oct 11, 2023

Type of Change

Test Plan

igiloh-pinecone left a comment

Choose a reason for hiding this comment

igiloh-pinecone Oct 12, 2023

Choose a reason for hiding this comment

acatav Oct 15, 2023

Choose a reason for hiding this comment

igiloh-pinecone Oct 15, 2023

Choose a reason for hiding this comment

igiloh-pinecone Oct 15, 2023

Choose a reason for hiding this comment

acatav Oct 15, 2023

Choose a reason for hiding this comment

igiloh-pinecone left a comment

Choose a reason for hiding this comment