Skip to content

Latest commit

 

History

History
98 lines (79 loc) · 5.87 KB

README.md

File metadata and controls

98 lines (79 loc) · 5.87 KB

iRODS ingest

In the last few years there has been a lot of development around iRODS at the WUR. As main author, I was personally involved in quite a few of these innovations. Each one taught me new lessons, which I combined in this script to help others get started with uploading data to the tape archive.

This respository contains an configurable script to move data to iRODS/tape, and add the metadata provided in an Excel file. It was build to work with the WUR iRODS instances, but its modulair design make it easy to convert to other iRODS setups. Where Ubuntu and IOS are supported this has not been tested and might contain some bugs.

How it works in a nutshell:

  • 1.Load excel and check if the Foldernames with a v in the_to_upload column exists in the folder
    1. Generate in_progress.csv this file is used to store the irods, zip paths and the current status of the processing
    1. Use multiprocessing to zip folders if desired
    1. Upload the files/folder to iRODS
    1. Add the metadata to the file/folder
    1. If desired send the file to tape
    1. Check if the file is on tape

Statusses in the _status columns and the transfer from one to another is vizualized in the image below

statusses

At each status change the in_progress.csv file is updated to enable the script to continue where it stopped. Next to this it uses multiprocessing to make optimal use of the available resources, including a way to limit the disk space usage. Before the upload various checks are performed to ensure iRODS and SQL naming conventions are met, on top of this it is advised to check the metadata for consistency (not implemented).

Important notes:

  • It is advised to only upload files, or zipped folders, to tape
  • The WUR iRODS instances use S3 to talk to underlaying storage like the tape archive. The S3AI limits the maximum filesize at 5TB, when using winrar files and folder > 5TB are zipped in parts using the standard multizip approach. unfortunately python's zip implementations don't support this at the time of writing. Metadata is only added to the first part.

NPEC specific details one has to change before adopting the script:

  • Metadata is set with a 'NPEC_' prefix,in i_operations.py > add_metadata > tagname = f"NPEC_{col}"
  • iRODS paths are created using the npec structure: ipath = target_ipath.joinpath('M4', row['System'].upper(), str(row['Year'])) in helpers.py >> create_task_df.

Preperation

The key in this whole process in the Excel file accopanying the data in the folder. It has two manditory columns:

  • Foldername: name of the file/folder to upload, this must be placed in the same folder as the excel file!
  • _to_upload: Add a v in this column to indicate the row should be uploaded, all rows without a v are ignored. Additional columns can be added, these will become metadata fields. They may not start with a _! As example NPEC uses the folowing columns:
- Foldername: name of the file/folder
- Year: 2024
- NPEC module: ClimateCells, Greenhouse, OpenField
- System: G8, Traitseeker, UAVS
- Client: requester of the experiment
- Crop: plants (comma seperated list)
- Comment: free text
- _to_upload: Add a `v`

An example Excel is provided in the test_data folder:

test_excel

Config

A config file is used to pass all the parameters, the goal is to make it easy to repurpose the code:

{
    "SMB_MOUNT": true, # Used to automatically mount an SMB mountable disk like the W (isilon) at WUR
    "ZIP_FOLDERS": true, # Zip the folders before uploading, this is advised when sending data to the tape archive
    "ZIP_SPLIT_ABOVE_5TB": true, # S3API has a max filesize of 5TB. when true bigger files are split. if false, they are ignored.
    "TO_TAPE": true, # Wether or not to trigger the archive rule to move the data fromdisk to tape after uploading
    "NUM_ZIPPERS": 1, # Num of zip processes
    "NUM_IWORKERS": 1, # Numer of irods uploaded processes
    "SMB": {
        "SMB_USER": "<user>", # SMB username
        "SMB_PATH":"\\\\fs02mixedsmb.wurnet.nl\\TPE-STANDARD_PROJECTS$\\PROJECTS~NPEC_climaterooms\\", # Example path
        "SMB_LETTER": "W" # Device letter to mount the disk to
    },
    "IRODS_ENV_FILE": "irods_environment.json",
    "LOCAL_SOURCE_PATH": "C:\\iRODS_ingest\\test_data", # Folder containing the data to upload
    "LOCAL_ZIP_TEMP": "Z:\\test_zips", # Path to create the temporary zipfiles when uploading folders
    "LOCAL_ZIP_SPACE": "30GB", # Size of the temporary zip area in human readable size, to avoid overflowing disks
    "IRODS_TARGET_PATH": "", # ignored due to bug, see notes
    "METADATA_EXCEL": "test_metadata.xlsx" # Excel file with the list of files to upload and metadata
    "PROGRESS_FILE": "progres.csv" "optional: absolute location of the progress csv file. uses the current working directory if only a filename is entered. The default location is the directory of the code."
}

The default location of the config file is the code folder, the config file can also be passed as an argument: python main.py --config path/to/config.json

Zipping

The default zip implementations in python work, but they are awfully slow. As is windows itself. In the search for alternatives this comparison came up: https://peazip.github.io/peazip-compression-benchmark.html As winrar was one of the fastest and also has an API this was preferred and implemented. Note, the special file formats like 7z might be faster, but they also require every user to install that tool while zips are universal.

A simple benchmark comparison on a folder of 1.18 TB containing: 103,488 Files, 8,099 Folders.

  • Winrar zipfile of 666 GB created in: 6:47:00
  • Shutil make archive created a zip file of 666 GB in 1 dyg and 9:17:00

When a users installs winrar the zip implementation will detect it and use it instead of shutil.