Do we need the rawdata? #2199

scarlehoff · 2024-11-07T10:15:23Z

I'm starting to get worried about the growth of the repository due to the rawdata...

when we cannot rely directly on hepdata it makes sense to save it but can't we just download the rawdata as part of the filter.py run? I remember that some of you were working on that, would it be possible?

cc @enocera @Radonirinaunimi @giacomomagni

scarlehoff · 2024-11-07T10:20:38Z

If it is necessary (because we want to keep a copy), I will create a separate repository to keep the rawdata there since the rawdata does not need version control.

Radonirinaunimi · 2024-11-07T10:24:22Z

I have expressed this concern before and the solution I proposed at that time was to have a download_hepdata_table which reads both the hepdata address and the version from the metadata.yaml (we absolutely want to make sure that the correct version is downloaded) and download the appropriate table. As it is right now, the best place to put such a function is in the filter_utils. I still think this is the best solution.

What also makes it worse currently is that for some datasets the whole hepdata tables are downloaded while only a few tables are needed.

Radonirinaunimi · 2024-11-07T10:29:42Z

In this way we don't store the rawdata in the repo but only locally for those implementing the data and downloaded during the CI check of the commondata.

FWIW, we can revive and refine this module.

enocera · 2024-11-07T10:30:45Z

I indeed remember @Radonirinaunimi 's concern and proposal. I also remember that I did not support it too much, hoping (naively) that ONLY the few relevant tables would have been downloaded and stored. Even if you decide not to store the tables, I consider that ONLY the relevant ones should be downloaded locally.

Radonirinaunimi · 2024-11-07T10:37:19Z

If it is necessary (because we want to keep a copy), I will create a separate repository to keep the rawdata there since the rawdata does not need version control.

But this also is a solution ofc (and maybe using git submodules).

scarlehoff · 2024-11-07T11:15:52Z

Yes, what prompted me to open the issue is that revising some of the dataset I noticed many more tables than necessary were included.

As a first step (in the automatic download approach) we can just download the whole hepdata data from a given version so we avoid having to deal with specific tables (as I don't think the information is correctly included in all metadata) That will be only in the computer of the person implementing the dataset and in the CI.

giacomomagni · 2024-11-07T11:59:40Z

For what it matters, I'm also in favour of downloading them once needed, otherwise storing them as .tar could be a possibility.

scarlehoff mentioned this issue Nov 12, 2024

Implementation of ATLAS_Z0J_8TEV PT-Y and PT-M in the new format #2169

Merged

Radonirinaunimi mentioned this issue Nov 13, 2024

[WIP] Reimplement ATLAS_Z0_7TEV_46FB_CC #2204

Open

Radonirinaunimi mentioned this issue Nov 27, 2024

Remove HepData raw tables from the repository #2228

Open

19 tasks

scarlehoff added the data toolchain label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need the rawdata? #2199

Do we need the rawdata? #2199

scarlehoff commented Nov 7, 2024

scarlehoff commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

enocera commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

scarlehoff commented Nov 7, 2024

giacomomagni commented Nov 7, 2024

Do we need the rawdata? #2199

Do we need the rawdata? #2199

Comments

scarlehoff commented Nov 7, 2024

scarlehoff commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

enocera commented Nov 7, 2024

Radonirinaunimi commented Nov 7, 2024

scarlehoff commented Nov 7, 2024

giacomomagni commented Nov 7, 2024