Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Update convert function in tardis.py to handle .parquet files #120

Merged
merged 6 commits into from
Aug 12, 2024

Conversation

ian-wazowski
Copy link

@ian-wazowski ian-wazowski commented Aug 12, 2024

Changes

    for file in input_files:
        print('Reading %s' % file)
        if file.endswith('.csv'):
            df = pl.read_csv(file)
        elif file.endswith('.parquet'):
            df = pl.read_parquet(file, pyarrow_options={'use_threads': True})
        else:
            raise ValueError('Unsupported file format: %s' % file)
        if df.columns == trade_cols:

Related

discord chat

@nkaz001 nkaz001 merged commit 1734cda into nkaz001:master Aug 12, 2024
3 checks passed
@nkaz001
Copy link
Owner

nkaz001 commented Aug 12, 2024

Tardis.dev provides the file in .csv.gz format.
By the way, does Tardis also provide data in parquet format?

@ian-wazowski
Copy link
Author

ian-wazowski commented Aug 13, 2024

Tardis.dev provides the file in .csv.gz format. By the way, does Tardis also provide data in parquet format?

No, I'm working on downloading the tardis dataset and then converting it to parquet(lz4, column-wise encoding).

It's 10x faster to read than csv.gz, and the compression ratio increases by about 10-15%.

@nkaz001
Copy link
Owner

nkaz001 commented Aug 13, 2024

The processing time required to convert raw Tardis data into Parquet format needs to be taken into account. In any case, I believe it's more appropriate to provide one as a separate data utility since the data has already been processed, not the raw Tardis data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants