Load Hop and other lineage data to Neo4j
Data lineage information is an indispensable part of data engineering projects. This repository provides a start to parsing your lineage information in a Neo4j graph database using Apache Hop.
⚠️ This repository is work in progress. Some areas of the code have been ported and may be largely untested.
⚠️ This repository currently runs a scheduled (batch) import of your infrastructure and data integration code. Apache Hop will gradually include more built-in lineage functionality
The areas this project covers currently are:
- AWS infrastructure (json based)
- GCP (BigQuery)
- data integration (Apache Hop) workflows and pipelines
- Git (commit history)
- Pentaho (prpt reports, mondrian, analyzer)
- RDBMS (database, schema, table, column)
Download a recent Apache Hop release build and unzip
Download the Neo4j Hop plugins, unzip to hop/plugins/transforms
Clone this repository:
git clone https://github.com/knowbi/knowbi-hop-meta-to-neo4j.git
Copy project-config.json.template
to project-config.json
and change the variables (in the json file or through Hop Gui):
Name | Value | Description (optional information) |
---|---|---|
NEO4J_COMM_HOST | ||
NEO4J_COMM_BOLT_PORT | 7687 | |
NEO4J_COMM_BROWSER_PORT | 7474 | |
NEO4J_COMM_USER | neo4j | |
NEO4J_COMM_PASS | ||
do.aws | Y | include AWS infrastructure? |
do.aws.dms | N | nclude AWS DMS (depends on do.aws) |
do.aws.rds | N | include AWS RDS (depends on do.aws) |
do.clean.neo4j.data | Y | clean Neo4j database before running (DELETES EVERYTHING!!) |
do.etl | Y | include Hop ETL parsing? |
do.gcp | N | include GCP infrastructure (BigQuery only for now) |
do.git | Y | include git commit history? |
do.pentaho | N | include Pentaho reports and cubes? |
do.pentaho.report | N | include Pentaho reports (depends on do.pentaho)? |
do.pentaho.mondrian.schema | N | include Pentaho/Mondrian schemas and cubes (depends on do.pentaho)? |
do.rdbms | N | include RDBMS schema parsing? |
do.rdbms.columns | Y | include columns in RDBMS schema parsing (if 'N', only database, schema, table parsing) |
git.tmp.dir | /tmp | temporary directory to store git parsing working files |
etl.kettle.properties.dir | directory to read kettle.properties file from (deprecated) | |
etl.dir | /customer/project/directory | directory to read Hop workflows and pipelines from |
neo4j.host | localhost | Neo4j database host |
neo4j.bolt.port | 7687 | Neo4j bolt port (default 7687) |
neo4j.browser.port | 7474 | Neo4j browser port (default 7474) |
neo4j.user | neo4j | Neo4j username (default neo4j) |
neo4j.pass | knowbi | Neo4j password |
pentaho.mondrian.analyzer.dir | Pentaho Analyzer reports directory | |
pentaho.mondrian.properties.dir | mondrian.properties path | |
pentaho.mondrian.schema.dir | Pentaho Mondrian schema files directory | |
pentaho.report.dir | Pentaho Reports (prpt) directory |