Skip to content

Commit

Permalink
Merge branch 'main' into remove_reff_resolver
Browse files Browse the repository at this point in the history
  • Loading branch information
h-mayorquin authored Dec 6, 2024
2 parents d60f12d + c4afad3 commit 4f4e6e0
Show file tree
Hide file tree
Showing 13 changed files with 504 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ concurrency: # Cancel previous workflows on the same pull request
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
DANDI_API_KEY: ${{ secrets.DANDI_API_KEY }}

jobs:
run:
Expand All @@ -36,8 +35,8 @@ jobs:
git config --global user.email "CI@example.com"
git config --global user.name "CI Almighty"
- name: Install full requirements
- name: Install AWS requirements
run: pip install .[aws,test]

- name: Run subset of tests that use S3 live services
run: pytest -rsx -n auto tests/test_minimal/test_tools/aws_tools.py
- name: Run generic AWS tests
run: pytest -rsx -n auto tests/test_minimal/test_tools/aws_tools_tests.py
46 changes: 46 additions & 0 deletions .github/workflows/rclone_aws_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: Rclone AWS Tests
on:
schedule:
- cron: "0 16 * * 2" # Weekly at noon on Tuesday
workflow_dispatch:

concurrency: # Cancel previous workflows on the same pull request
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
RCLONE_DRIVE_ACCESS_TOKEN: ${{ secrets.RCLONE_DRIVE_ACCESS_TOKEN }}
RCLONE_DRIVE_REFRESH_TOKEN: ${{ secrets.RCLONE_DRIVE_REFRESH_TOKEN }}
RCLONE_EXPIRY_TOKEN: ${{ secrets.RCLONE_EXPIRY_TOKEN }}
DANDI_API_KEY: ${{ secrets.DANDI_API_KEY }}

jobs:
run:
name: ${{ matrix.os }} Python ${{ matrix.python-version }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
python-version: ["3.12"]
os: [ubuntu-latest]
steps:
- uses: actions/checkout@v4
- run: git fetch --prune --unshallow --tags
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Global Setup
run: |
python -m pip install -U pip # Official recommended way
git config --global user.email "CI@example.com"
git config --global user.name "CI Almighty"
- name: Install AWS requirements
run: pip install .[aws,test]

- name: Run RClone on AWS tests
run: pytest -rsx -n auto tests/test_on_data/test_yaml/yaml_aws_tools_tests.py
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Upcoming

## Features
* Added the `rclone_transfer_batch_job` helper function for executing Rclone data transfers in AWS Batch jobs. [PR #1085](https://github.com/catalystneuro/neuroconv/pull/1085)



## v0.6.4

## Deprecations
* Removed use of `jsonschema.RefResolver` as it will be deprecated from the jsonschema library [PR #1133](https://github.com/catalystneuro/neuroconv/pull/1133)
* Completely removed compression settings from most places[PR #1126](https://github.com/catalystneuro/neuroconv/pull/1126)
Expand Down Expand Up @@ -38,6 +45,8 @@
* Avoid running link test when the PR is on draft [PR #1093](https://github.com/catalystneuro/neuroconv/pull/1093)
* Centralize gin data preparation in a github action [PR #1095](https://github.com/catalystneuro/neuroconv/pull/1095)



# v0.6.4 (September 17, 2024)

## Bug Fixes
Expand Down
5 changes: 5 additions & 0 deletions docs/api/tools.aws.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. _api_docs_aws_tools:

AWS Tools
---------
.. automodule:: neuroconv.tools.aws
1 change: 1 addition & 0 deletions docs/api/tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ Tools
tools.signal_processing
tools.data_transfers
tools.nwb_helpers
tools.aws
136 changes: 136 additions & 0 deletions docs/user_guide/aws_demo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
NeuroConv AWS Demo
------------------

The :ref:`neuroconv.tools.aws <api_docs_aws_tools>` submodule provides a number of tools for deploying NWB conversions
within AWS cloud services. These tools are primarily for facilitating source data transfers from cloud storage
sources to AWS, where the NWB conversion takes place, following by immediate direct upload to the `Dandi Archive <https://dandiarchive.org/>`_.

The following is an explicit demonstration of how to use these to create a pipeline to run a remote data conversion.

This tutorial relies on setting up several cloud-based aspects ahead of time:

a. Download some of the GIN data from the main testing suite, see :ref:`example_data` for more
details. Specifically, you will need the ``spikeglx`` and ``phy`` folders.

b. Have access to a `Google Drive <https://wwww.drive.google.com>`_ folder to mimic a typical remote storage
location. The example data from (a) only takes up about 20 MB of space, so ensure you have that available. In
practice, any `cloud storage provider that can be accessed via Rclone <https://rclone.org/#providers>`_ can be used.

c. Install `Rclone <https://rclone.org>`_, run ``rclone config``, and follow all instructions while giving your
remote the name ``test_google_drive_remote``. This step is necessary to provide the necessary credentials to access
the Google Drive folder from other locations by creating a file called ``rclone.conf``. You can find the path to
file, which you will need for a later step, by running ``rclone config file``.

d. Have access to an `AWS account <https://aws.amazon.com/resources/create-account/>`_. Then, from
the `AWS console <https://aws.amazon.com/console/>`_, sign in and navigate to the "IAM" page. Here, you will
generate some credentials by creating a new user with programmatic access. Save your access key and secret key
somewhere safe (such as installing the `AWS CLI <https://aws.amazon.com/cli>`_ and running ``aws configure``
to store the values on your local device).

e. Have access to an account on both the `staging/testing server <https://gui-staging.dandiarchive.org/>`_ (you
will probably want one on the main archive as well, but please do not upload demonstration data to the primary
server). This request can take a few days for the admin team to process. Once you have access, you will need
to create a new Dandiset on the staging server and record the six-digit Dandiset ID.

.. warning::

*Cloud costs*. While the operations deployed on your behalf by NeuroConv are optimized to the best extent we can, cloud services can still become expensive. Please be aware of the costs associated with running these services and ensure you have the necessary permissions and budget to run these operations. While NeuroConv makes every effort to ensure there are no stalled resources, it is ultimately your responsibility to monitor and manage these resources. We recommend checking the AWS dashboards regularly while running these operations, manually removing any spurious resources, and setting up billing alerts to ensure you do not exceed your budget.

Then, to setup the remaining steps of the tutorial:

1. In your Google Drive, make a new folder for this demo conversion named ``demo_neuroconv_aws`` at the outermost
level (not nested in any other folders).

2. Create a file on your local device named ``demo_neuroconv_aws.yml`` with the following content:

.. code-block:: yaml
metadata:
NWBFile:
lab: My Lab
institution: My Institution
data_interfaces:
ap: SpikeGLXRecordingInterface
phy: PhySortingInterface
upload_to_dandiset: "< enter your six-digit Dandiset ID here >"
experiments:
my_experiment:
metadata:
NWBFile:
session_description: My session.
sessions:
- source_data:
ap:
file_path: spikeglx/Noise4Sam_g0/Noise4Sam_g0_imec0/Noise4Sam_g0_t0.imec0.ap.bin
metadata:
NWBFile:
session_start_time: "2020-10-10T21:19:09+00:00"
Subject:
subject_id: "1"
sex: F
age: P35D
species: Mus musculus
- metadata:
NWBFile:
session_start_time: "2020-10-10T21:19:09+00:00"
Subject:
subject_id: "002"
sex: F
age: P35D
species: Mus musculus
source_data:
phy:
folder_path: phy/phy_example_0/
3. Copy and paste the ``Noise4Sam_g0`` and ``phy_example_0`` folders from the :ref:`example_data` into this demo
folder so that you have the following structure...

.. code::
demo_neuroconv_aws/
¦ demo_output/
¦ spikeglx/
¦ +-- Noise4Sam_g0/
¦ +-- ... # .nidq streams
¦ ¦ +-- Noise4Sam_g0_imec0/
¦ ¦ +-- Noise4Sam_g0_t0.imec0.ap.bin
¦ ¦ +-- Noise4Sam_g0_t0.imec0.ap.meta
¦ ¦ +-- ... # .lf streams
¦ phy/
¦ +-- phy_example_0/
¦ ¦ +-- ... # The various file contents from the example Phy folder
4. Now run the following Python code to deploy the AWS Batch job:

.. code:: python
from neuroconv.tools.aws import deploy_neuroconv_batch_job
rclone_command = (
"rclone copy test_google_drive_remote:demo_neuroconv_aws /mnt/efs/source "
"--verbose --progress --config ./rclone.conf"
)
# Remember - you can find this via `rclone config file`
rclone_config_file_path = "/path/to/rclone.conf"
yaml_specification_file_path = "/path/to/demo_neuroconv_aws.yml"
job_name = "demo_deploy_neuroconv_batch_job"
efs_volume_name = "demo_deploy_neuroconv_batch_job"
deploy_neuroconv_batch_job(
rclone_command=rclone_command,
yaml_specification_file_path=yaml_specification_file_path,
job_name=job_name,
efs_volume_name=efs_volume_name,
rclone_config_file_path=rclone_config_file_path,
)
Voilà! If everything occurred successfully, you should eventually (~2-10 minutes) see the files uploaded to your
Dandiset on the staging server. You should also be able to monitor the resources running in the AWS Batch dashboard
as well as on the DynamoDB table.
1 change: 1 addition & 0 deletions docs/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ and synchronize data across multiple sources.
backend_configuration
yaml
docker_demo
aws_demo
3 changes: 2 additions & 1 deletion src/neuroconv/tools/aws/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from ._submit_aws_batch_job import submit_aws_batch_job
from ._rclone_transfer_batch_job import rclone_transfer_batch_job

__all__ = ["submit_aws_batch_job"]
__all__ = ["submit_aws_batch_job", "rclone_transfer_batch_job"]
113 changes: 113 additions & 0 deletions src/neuroconv/tools/aws/_rclone_transfer_batch_job.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
"""Collection of helper functions for assessing and performing automated data transfers related to AWS."""

import warnings
from typing import Optional

from pydantic import FilePath, validate_call

from ._submit_aws_batch_job import submit_aws_batch_job


@validate_call
def rclone_transfer_batch_job(
*,
rclone_command: str,
job_name: str,
efs_volume_name: str,
rclone_config_file_path: Optional[FilePath] = None,
status_tracker_table_name: str = "neuroconv_batch_status_tracker",
compute_environment_name: str = "neuroconv_batch_environment",
job_queue_name: str = "neuroconv_batch_queue",
job_definition_name: Optional[str] = None,
minimum_worker_ram_in_gib: int = 4,
minimum_worker_cpus: int = 4,
submission_id: Optional[str] = None,
region: Optional[str] = None,
) -> dict[str, str]:
"""
Submit a job to AWS Batch for processing.
Requires AWS credentials saved to files in the `~/.aws/` folder or set as environment variables.
Parameters
----------
rclone_command : str
The command to pass directly to Rclone running on the EC2 instance.
E.g.: "rclone copy my_drive:testing_rclone /mnt/efs"
Must move data from or to '/mnt/efs'.
job_name : str
The name of the job to submit.
efs_volume_name : str
The name of an EFS volume to be created and attached to the job.
The path exposed to the container will always be `/mnt/efs`.
rclone_config_file_path : FilePath, optional
The path to the Rclone configuration file to use for the job.
If unspecified, method will attempt to find the file in `~/.rclone` and will raise an error if it cannot.
status_tracker_table_name : str, default: "neuroconv_batch_status_tracker"
The name of the DynamoDB table to use for tracking job status.
compute_environment_name : str, default: "neuroconv_batch_environment"
The name of the compute environment to use for the job.
job_queue_name : str, default: "neuroconv_batch_queue"
The name of the job queue to use for the job.
job_definition_name : str, optional
The name of the job definition to use for the job.
If unspecified, a name starting with 'neuroconv_batch_' will be generated.
minimum_worker_ram_in_gib : int, default: 4
The minimum amount of base worker memory required to run this job.
Determines the EC2 instance type selected by the automatic 'best fit' selector.
Recommended to be several GiB to allow comfortable buffer space for data chunk iterators.
minimum_worker_cpus : int, default: 4
The minimum number of CPUs required to run this job.
A minimum of 4 is required, even if only one will be used in the actual process.
submission_id : str, optional
The unique ID to pair with this job submission when tracking the status via DynamoDB.
Defaults to a random UUID4.
region : str, optional
The AWS region to use for the job.
If not provided, we will attempt to load the region from your local AWS configuration.
If that file is not found on your system, we will default to "us-east-2", the location of the DANDI Archive.
Returns
-------
info : dict
A dictionary containing information about this AWS Batch job.
info["job_submission_info"] is the return value of `boto3.client.submit_job` which contains the job ID.
info["table_submission_info"] is the initial row data inserted into the DynamoDB status tracking table.
"""
docker_image = "ghcr.io/catalystneuro/rclone_with_config:latest"

if "/mnt/efs" not in rclone_command:
message = (
f"The Rclone command '{rclone_command}' does not contain a reference to '/mnt/efs'. "
"Without utilizing the EFS mount, the instance is unlikely to have enough local disk space."
)
warnings.warn(message=message, stacklevel=2)

rclone_config_file_path = rclone_config_file_path or pathlib.Path.home() / ".rclone" / "rclone.conf"
if not rclone_config_file_path.exists():
raise FileNotFoundError(
f"Rclone configuration file not found at: {rclone_config_file_path}! "
"Please check that `rclone config` successfully created the file."
)
with open(file=rclone_config_file_path, mode="r") as io:
rclone_config_file_stream = io.read()

region = region or "us-east-2"

info = submit_aws_batch_job(
job_name=job_name,
docker_image=docker_image,
environment_variables={"RCLONE_CONFIG": rclone_config_file_stream, "RCLONE_COMMAND": rclone_command},
efs_volume_name=efs_volume_name,
status_tracker_table_name=status_tracker_table_name,
compute_environment_name=compute_environment_name,
job_queue_name=job_queue_name,
job_definition_name=job_definition_name,
minimum_worker_ram_in_gib=minimum_worker_ram_in_gib,
minimum_worker_cpus=minimum_worker_cpus,
submission_id=submission_id,
region=region,
)

return info
Loading

0 comments on commit 4f4e6e0

Please sign in to comment.