You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.
Is your feature request related to a problem? Please describe.
Aimee suggests that it is too complicated for a user to know how and why they should configure the split_occurrences tool to group their occurrence records so that there are not too many files in a directory.
Describe the solution you'd like
The tool should know when a threshold (configurable or not) is passed for the number of files generated in a directory (assume one per species) and if that threshold is reached, it should chunk the key field so that there is a hierarchy created and that will reduce the number of files in a single directory. Ideally, these directories created make sense to the user.
Describe alternatives you've considered
Leave things as they are and elaborate in the documentation how they should be used for larger datasets
Always chunk the key field(s)
The text was updated successfully, but these errors were encountered:
For species binomials, I think a clean structure would be something like {First letter of genus}/{Genus}/{Species name}.csv, or maybe the top level is the first two letters of the genus. Acer rubrum -> A/Acer/Acer rubrum.csv or Ac/Acer/Acer rubrum.csv
For something that doesn't have any stand-alone information, like a GBIF accepted taxon id, I don't see a better option than just chunking it by some number of characters. 123456789 -> 123/456/123456789.csv
It would be easiest to default to chunking by some number of characters when needed but is that acceptable when there is some human-discernible value in the field? Is there a solution that works no matter what the data is and retains helpful information?
Maybe we could do the following (moving to the next step as needed):
Don't chunk (retains the most information) -> Acer rubrum.csv or 1234567890.csv
Chunk by splitting on space (retains genus name so still clear, would not help numeric values) -> Acer/Acer rubrum.csv or 1234567890/1234567890.csv
Chunk by space first, then by first letter (picks up fields without spaces) -> A/Acer/Acer rubrum.csv or 1/1234567890.csv
Chunk by second (third, fourth, etc) character -> A/Ac/Acer/Acer rubrum.csv or 1/12/1234567890.csv
Is your feature request related to a problem? Please describe.
Aimee suggests that it is too complicated for a user to know how and why they should configure the split_occurrences tool to group their occurrence records so that there are not too many files in a directory.
Describe the solution you'd like
The tool should know when a threshold (configurable or not) is passed for the number of files generated in a directory (assume one per species) and if that threshold is reached, it should chunk the key field so that there is a hierarchy created and that will reduce the number of files in a single directory. Ideally, these directories created make sense to the user.
Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: