Skip to content

Latest commit

 

History

History
32 lines (19 loc) · 1.77 KB

README.md

File metadata and controls

32 lines (19 loc) · 1.77 KB

Sample app - RouterFS + lakeFS

The app will be used to write a Parquet file to two separate servers: S3 bucket and lakeFS server (using S3 gateway).

Running the app

Pre-requisites

  1. Before running the app, make sure you placed the hadoop-router-fs-0.1.0-jar-with-dependencies.jar jar file you built under your $SPARK_HOME/jars directory.
  2. cd sample_app.
  3. Run pip install -r requirements.txt.

Configurations

  1. Set the lakefs_ and aws_ variables in the code to reflect correct information. Alternatively, set the LAKEFS_ and AWS_ environment variables as specified in the code.
  2. Set (or don't) the repo_name, branch_name and path variables in the code (make sure that if path is set, it ends with a /).
  3. Set (or don't) the replace_prefix variable in the code to reflect the mapped prefix. Alternatively, set the S3A_REPLACE_PREFIX environment variable as specified in the code.
  1. Set the s3a_replace_prefix variable in the code to reflect the mapped prefix. Make sure this is the same value of replace_prefix under the spark_client file. Alternatively, set the S3A_REPLACE_PREFIX environment variable as specified in the code.
  2. Set the s3_bucket_s3a_prefix variable in the code to reflect the S3 bucket namespace to which the Parquet file will be written. This should be a valid and accessible S3 bucket prefix.

Run

spark-submit --packages "org.apache.hadoop:hadoop-aws:<your.hadoop.version>" main.py

Result

After running the app you should notice that the same Parquet file was written to two different locations (to the lakeFS server and directly to the configured S3 bucket) using a single mapping scheme.