Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow format #56

Open
5 tasks
HelenCEBM opened this issue Jan 24, 2022 · 3 comments
Open
5 tasks

Use arrow format #56

HelenCEBM opened this issue Jan 24, 2022 · 3 comments
Labels

Comments

@HelenCEBM
Copy link

HelenCEBM commented Jan 24, 2022

Feather format is smaller than CSV, i.e. more efficient on space/processing, and stores dtypes, helping to avoid some problems when loading the data for further processing.

We initially moved to .csv.gz, which was an improvement on uncompressed CSVs. However, it uses a significant amount of CPU. We believe that moving to Arrow/Feather would use much less CPU and be an overall improvement.

To do:

@StevenMaude
Copy link
Contributor

StevenMaude commented Dec 6, 2023

This issue was opened almost two years ago.

What's changed since then:

  • We moved to .csv.gz by default in the documentation examples and this template. It's maybe not as small as .feather/.arrow, but smaller than CSV, and doesn't require much special handling.
  • ehrQL supports .arrow files.
  • The OpenSAFELY Stata image includes a library that can load .arrow files.

What hasn't been done:

  • updating .gitignore
  • explaining somewhere in the ehrQL output formats why you might want to use .arrow (as mentioned above)

@bloodearnest bloodearnest changed the title Use feather format Use arrow format Aug 29, 2024
@bloodearnest
Copy link
Contributor

Updating this old issue, as discussed in a recent team meeting, I think we should make arrow the default, and not use or mention csv.gz in this template.

We may also want to test and include an extension like this in the vscode setup for viewing arrow files also.

https://marketplace.visualstudio.com/items?itemName=w568w.datasets-viewer

@lucyb
Copy link
Contributor

lucyb commented Oct 1, 2024

I'm going to remove it from the team's board, as it's something that's on Eli's list to investigate and I don't think we should do this work until they've investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants