Skip to content

Load data lineage information to a Neo4j graph database with Apache Hop (Incubating)

License

Notifications You must be signed in to change notification settings

knowbi/knowbi-neo4j-knowledge-graph

Repository files navigation

Load Hop and other lineage data to Neo4j

(Meta)data Lineage in Neo4j through Apache Hop

Data lineage information is an indispensable part of data engineering projects. This repository provides a start to parsing your lineage information in a Neo4j graph database using Apache Hop.

⚠️ This repository is work in progress. Some areas of the code have been ported and may be largely untested.
⚠️ This repository currently runs a scheduled (batch) import of your infrastructure and data integration code. Apache Hop will gradually include more built-in lineage functionality

The areas this project covers currently are:

  • AWS infrastructure (json based)
  • GCP (BigQuery)
  • data integration (Apache Hop) workflows and pipelines
  • Git (commit history)
  • Pentaho (prpt reports, mondrian, analyzer)
  • RDBMS (database, schema, table, column)

Loading your lineage data to Neo4j

Download a recent Apache Hop release build and unzip

Download the Neo4j Hop plugins, unzip to hop/plugins/transforms

Clone this repository:

git clone https://github.com/knowbi/knowbi-hop-meta-to-neo4j.git

Copy project-config.json.template to project-config.json and change the variables (in the json file or through Hop Gui):

Name Value Description (optional information)
NEO4J_COMM_HOST
NEO4J_COMM_BOLT_PORT 7687
NEO4J_COMM_BROWSER_PORT 7474
NEO4J_COMM_USER neo4j
NEO4J_COMM_PASS
do.aws Y include AWS infrastructure?
do.aws.dms N nclude AWS DMS (depends on do.aws)
do.aws.rds N include AWS RDS (depends on do.aws)
do.clean.neo4j.data Y clean Neo4j database before running (DELETES EVERYTHING!!)
do.etl Y include Hop ETL parsing?
do.gcp N include GCP infrastructure (BigQuery only for now)
do.git Y include git commit history?
do.pentaho N include Pentaho reports and cubes?
do.pentaho.report N include Pentaho reports (depends on do.pentaho)?
do.pentaho.mondrian.schema N include Pentaho/Mondrian schemas and cubes (depends on do.pentaho)?
do.rdbms N include RDBMS schema parsing?
do.rdbms.columns Y include columns in RDBMS schema parsing (if 'N', only database, schema, table parsing)
git.tmp.dir /tmp temporary directory to store git parsing working files
etl.kettle.properties.dir directory to read kettle.properties file from (deprecated)
etl.dir /customer/project/directory directory to read Hop workflows and pipelines from
neo4j.host localhost Neo4j database host
neo4j.bolt.port 7687 Neo4j bolt port (default 7687)
neo4j.browser.port 7474 Neo4j browser port (default 7474)
neo4j.user neo4j Neo4j username (default neo4j)
neo4j.pass knowbi Neo4j password
pentaho.mondrian.analyzer.dir Pentaho Analyzer reports directory
pentaho.mondrian.properties.dir mondrian.properties path
pentaho.mondrian.schema.dir Pentaho Mondrian schema files directory
pentaho.report.dir Pentaho Reports (prpt) directory

Releases

No releases published

Packages

No packages published

Languages