Beam Example

A small sketchpad for Apache Beam pipelines with Python to be used as Flex template for Google Cloud Dataflow

It uses the Belarus Used Cars Prices Kaggle dataset to perform transform operations on its data and load it onto MongoDB Atlas and Google BigQuery (using an already existing table on MongoDB to enrich clean and valid data)

An IPython Notebook is included (cars_data_exploration.ipynb), which displays the data exploration drafts of the CSV dataset to have an idea of the its contents and treatment

Execution

Create a virtual environment and install the apache_beam and dnspython dependencies with PIP

To run locally, execute the main.py script using

python beam_example/main.py <arguments>

The required arguments are

--db_user=<database user>
--db_pass=<database password>
--db_host=<database host>
--gcp_project_id=<project id>
--gcp_bucket_id=<bucket id>

The input file path can be specified using the optional --input=<path to input data> argument

Deployment

To deploy on GCP execute the deploy.sh script. It requires the following variables:

GCP_REGION
GCP_PROJECT
GCP_ARTIFACT_REPO
GCP_BUCKET

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
beam_example		beam_example
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cars_data_exploration.ipynb		cars_data_exploration.ipynb
cloud_function.py		cloud_function.py
deploy.sh		deploy.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beam Example

Execution

Deployment

About

Releases 1

Packages

Languages

jcalvarezj/beam_example

Folders and files

Latest commit

History

Repository files navigation

Beam Example

Execution

Deployment

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages