Describes the All of Us NLP deliverables associated with data ingestion and quality control, intended to support alpha release requirements. This document is version controlled; you should read the version that lives in the branch or tag you need. The specification document should always be consistent with the implemented curation processes.
src
- Source code in Java
main
- Scripts for setup, maintenance, deployment, etc.test
- Unit tests.
docker
- Dockerfile with all tools necessary for running the package
config
- Cloud Build configuration.
Please reference this guide for development setup.
Ensure the required environment variables are set as indicated in the developer guide.
The following command can be used to build maven for different profiles
mvn clean install -U -P {profile}
where profile can bedirect
,spark
,flink
anddataflow
\
To deploy to Google Dataflow, use the following command
java -cp
target/curation-nlp-bundled-dataflow-1.2-SNAPSHOT.jar
org.allofus.curation.pipeline.CurationNLPMain
--runner=DataflowRunner
--gcpTempLocation={bucket}/gcp_tmp
--stagingLocation={bucket}/staging
--tempLocation={bucket}/tmp
--resourcesDir={bucket}/resources
--input={bucket}/input
--output={bucket}/output
--inputType=jsonl
--outputType=jsonl
--project={project}
--region={region}
--subnetwork={subnet}
--usePublicIps=false
--maxNumWorkers=5
--numberOfWorkerHarnessThreads=2
--workerMachineType=n1-highmem-4
--diskSizeGb=50
--experiments=use_runner_v2
--pipeline={pipeline}
--maxClampThreads=4
--maxOutputPartitionSeconds=60
--maxOutputBatchSize=100
[--streaming --enableStreamingEngine]
All actors calling APIs in production will use service accounts.