Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data import #144

Open
zeyus opened this issue Mar 28, 2023 · 0 comments
Open

Refactor data import #144

zeyus opened this issue Mar 28, 2023 · 0 comments

Comments

@zeyus
Copy link
Member

zeyus commented Mar 28, 2023

The new implementation reads the schema and imports the data straight after upload.

It will need to be benchmarked but it's likely that even though reading from the filesystem is slow, it may just be quicker to read the schema by iterating over all the rows, and then only import the selected fields, because right now importing 400k tweets takes about 40 minutes and then the subsequent delete query (pre index, indexed version is being tested now) takes an additional 20 minutes if done in a single query, and > 60 minutes if done in individual queries.

While we're at it, consider using MongoDB for document storage, and join with a unique key (or the document ID)

See also: https://github.com/NLP4ALL/nlp4all/wiki/Performance

If we go this route (probably more performant) that will require hooks on the init-db and drop-db as well as when deleting and adding data sources.

Update
Version with gin index on the document column actually takes longer both for import and for property deletion. This makes sense as it actually has to update more information at each step, and probably the indexing doesn't extend to such deep nesting (it could, if the structure was consistent). Seems like MongoDB may be the way to go.

UPDATE 2

MongoDB has now been implemented, which is now a 3 minute import. Key deletion still takes around 8 minutes, but that just leaves one remaining task. Process the schema BEFORE import, and only import the required keys. the whole process will be much quicker and probably total around the same (3 min)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant