Skip to content
This repository has been archived by the owner on Nov 13, 2024. It is now read-only.

Remove upsert df from kb #67

Merged
merged 7 commits into from
Oct 15, 2023
Merged

Remove upsert df from kb #67

merged 7 commits into from
Oct 15, 2023

Conversation

acatav
Copy link
Contributor

@acatav acatav commented Oct 11, 2023

Remove upsert dataframe method from KB, and implement it in CLI

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

Adjusted unit tests for new functionallity

Copy link
Contributor

@igiloh-pinecone igiloh-pinecone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See one important comment about splitting the load() function to two functions.
If it's easy - I suggest doing it now, since we'll need it in the future anyway. If it's too much effort due to tests change etc - leave it, we'll deal with it once it actually becomes a problem.

resin_cli/cli.py Outdated Show resolved Hide resolved
resin_cli/cli.py Outdated Show resolved Hide resolved
resin/knoweldge_base/knowledge_base.py Show resolved Hide resolved
resin_cli/data_loader/data_loader.py Outdated Show resolved Hide resolved
resin_cli/data_loader/data_loader.py Outdated Show resolved Hide resolved
resin_cli/data_loader/data_loader.py Show resolved Hide resolved


def load_dataframe_from_path(path: str) -> pd.DataFrame:
def load_from_path(path: str) -> List[Document]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this function into two - one that loads a directory of files to a DF, then a separate one that takes a DF and converts it to List[Document].

It would make it easier to parallelize \ pipeline these operations in the future. Especially as we'll start dealing with larger-than-memory DFs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is exactly what I did, am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are doing:

        for f in all_files:
            documents.extend(_load_single_file_by_suffix(f))

Where _load_single_file_by_suffix() returns List[Document].

What I mean is _load_single_file() -> DF, and a separate, convert_df_to_docs(), then do something like:

        df = pd.concat(
            [_load_single_file(f) for f in all_files],
            axis=0,
            ignore_index=True,
        )
       documents = _convert_df_to_docs(df)

So we can separate the loading from the conversion (and parallelize them separately, in the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super critical - we can also do that later.
I'm approving anyway, up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong the right solution for large amount of data is to use generators, if we want to hold all the documents in RAM together we already aim for low volume. Anyway I don't think the intention of the CLI right now is to support load very large volumes, if it works for 100K we are good

tests/unit/cli/test_data_loader.py Show resolved Hide resolved
Copy link
Contributor

@igiloh-pinecone igiloh-pinecone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all LGTM.

@acatav acatav merged commit b0ad780 into dev Oct 15, 2023
9 checks passed
@acatav acatav deleted the remove-upsert-df-from-kb branch October 15, 2023 20:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants