Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need the rawdata? #2199

Open
scarlehoff opened this issue Nov 7, 2024 · 7 comments
Open

Do we need the rawdata? #2199

scarlehoff opened this issue Nov 7, 2024 · 7 comments

Comments

@scarlehoff
Copy link
Member

I'm starting to get worried about the growth of the repository due to the rawdata...

when we cannot rely directly on hepdata it makes sense to save it but can't we just download the rawdata as part of the filter.py run? I remember that some of you were working on that, would it be possible?

cc @enocera @Radonirinaunimi @giacomomagni

@scarlehoff
Copy link
Member Author

If it is necessary (because we want to keep a copy), I will create a separate repository to keep the rawdata there since the rawdata does not need version control.

@Radonirinaunimi
Copy link
Member

I have expressed this concern before and the solution I proposed at that time was to have a download_hepdata_table which reads both the hepdata address and the version from the metadata.yaml (we absolutely want to make sure that the correct version is downloaded) and download the appropriate table. As it is right now, the best place to put such a function is in the filter_utils. I still think this is the best solution.

What also makes it worse currently is that for some datasets the whole hepdata tables are downloaded while only a few tables are needed.

@Radonirinaunimi
Copy link
Member

In this way we don't store the rawdata in the repo but only locally for those implementing the data and downloaded during the CI check of the commondata.

FWIW, we can revive and refine this module.

@enocera
Copy link
Contributor

enocera commented Nov 7, 2024

I indeed remember @Radonirinaunimi 's concern and proposal. I also remember that I did not support it too much, hoping (naively) that ONLY the few relevant tables would have been downloaded and stored. Even if you decide not to store the tables, I consider that ONLY the relevant ones should be downloaded locally.

@Radonirinaunimi
Copy link
Member

If it is necessary (because we want to keep a copy), I will create a separate repository to keep the rawdata there since the rawdata does not need version control.

But this also is a solution ofc (and maybe using git submodules).

@scarlehoff
Copy link
Member Author

Yes, what prompted me to open the issue is that revising some of the dataset I noticed many more tables than necessary were included.

As a first step (in the automatic download approach) we can just download the whole hepdata data from a given version so we avoid having to deal with specific tables (as I don't think the information is correctly included in all metadata) That will be only in the computer of the person implementing the dataset and in the CI.

@giacomomagni
Copy link
Contributor

For what it matters, I'm also in favour of downloading them once needed, otherwise storing them as .tar could be a possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants