This repository helps index S3 objects in Synapse from a S3 bucket attached to a Synapse project/folder. The code is written in R and offers two methods for running the pipeline:
- Executing the code in an R environment, e.g., RStudio in an EC2 instance (see Method 1: Running in an R Environment)
- Running the pipeline using a Docker container for easy setup and portability (see Method 2: Running with Docker)
- Synapse account with relevant access permissions
- Synapse authentication token
A Synapse authentication token is required for use of the Synapse APIs and CLI client. For help with Synapse, Synapse APIs, Synapse authentication tokens, etc., please refer to the Synapse documentation.
- Provision an EC2 Rstudio Notebook instance
- Upgrade pip in the terminal of the EC2 instance with
python -m pip install --upgrade pip
- Install the Synapse CLI client in the terminal with
pip install synapseclient
Note: If you are having issues during installation of the Synapse CLI client, consider upgrading pip with python -m pip install --upgrade pip
before attempting to reinstall synapseclient
. If you still have issues, force re-install synapseclient
from the terminal via pip install --upgrade --force-reinstall synapseclient
.
- Create a
.Renviron
file in the home folder with the environment variableSYNAPSE_AUTH_TOKEN='<personal-access-token>'
. Replace<personal-access-token>
with your actual token. Your personal access token should have View, Modify and Download permissions. If you don't have a Synapse personal access token, refer to the instructiocs here to get a new token: Personal Access Token in Synapse.
- Please follow the instructions to Create a Synapse personal access token (for AWS SSM Access)
- Please follow instructions 3-5 to set up SSM Access to an Instance. (Note: AWS CLI version that is installed on the EC2 offering is ver 2.x)
Note: After setting this step, you should be able to run data_sync.R, i.e be able to sync data between two buckets, and also from a bucket to the local EC2 instance.
- Clone this repository and switch to the new project
- Modify the parameters in params.R
- Run install_requirements.R
- Start a new R session (type
q()
in the R console)
- Run ingress_pipeline.sh in the terminal using the command
bash ~/<path to>/ingress_pipeline.sh
- Set up ingress_pipeline.sh on a cronjob of your required frequency
- Pull the docker image with
docker pull ghcr.io/sage-bionetworks/recover-s3-synindex:main
- Run a container with
docker run -e AWS_SYNAPSE_TOKEN=<aws-cli-token> -e SYNAPSE_AUTH_TOKEN=<synapse-auth-token> <image-name>
- If desired, setup a scheduled job (AWS Scheduled Jobs, cron, etc.) using the docker image (ghcr.io/sage-bionetworks/recover-s3-synindex) to run the pipeline at your desired frequency
Note: Replace <aws-cli-token>
and <synapse-auth-token>
with the actual token values. When provisioning a Scheduled Job, <aws-cli-token>
and <synapse-auth-token>
should be specified in the Secrets
and/or EnvVars
fields of the provisioning settings page.