GitHub - dtim-upc/NextiaJD_deprecated: Effective and scalable data discovery using profiles

A Scalable Data Discovery solution using profilies based on Apache Spark.

About • Key Features • How it works • Usage • Installation • Demo • Reproducibility

About

NextiaJD, from nextia in the Nahuatl language (the old Aztec language), is a scalable data discovery system. NextiaJD computes profiles, which are succint representations of the underlying characteristics of datasets and their attributes, to efficiently discover joinable attributes on datasets. We aim to automatically discover pairs of attributes in a massive collection of heterogeneous datasets (i.e., data lakes) that can be crossed.

Here, we provide you detailed information on how to run and evaluate NextiaJD. To learn more about the project, visit our website.

Key features

Attribute profiling built-in Spark
A fully distributed end-to-end framework for joinable attributes discovery.
Easy data discovery for everyone

How it works

We encourage you to read our paper to better understand what NextiaJD is and how can fit your scenarios.

The simple way to describe it:

You have one dataset and a collection of independent datasets. Then, you will like to find other datasets with attributes that performed a high quality join.

NextiaJD reduces the effort to do a manual exploration by predicting which attributes are candidates for a join based on some qualities defined.

We have as an example two scenarios:

In a data lake when a new dataset is ingested, a profile should be computed. Then, whenever a data analysts has a dataset, NextiaJD can find other datasets in the data lake that can be joined.
In a normal repository, when having a few datasets and we want to know how they can be crossed against one dataset.

Requirements

Spark 3.0.1
Scala 2.12.
Java 8 or 11

Installation

There are two options to install NextiaJD in your computer: building the jars from this repository using Maven or downloading the NextiaJD compiled jars (Recommended)

Build from sources

To install NextiaJD you need to follow the steps below:

Clone this project

$ git clone https://github.com/dtim-upc/NextiaJD

Go to the project root directory in a terminal
Run the command below. It will build the spark catalyst, spark sql and spark nextiajd jars through Maven. Note that this will take some time.
- Alternatively, you can build the whole Spark project as specified here

$ ./build/mvn clean package -pl :spark-catalyst_2.12,:spark-sql_2.12,:spark-nextiajd_2.12 -DskipTests

If the build succeeds, you can find the compiled jars under the following directories:
- /sql/nextiajd/target/spark-nextiajd_2.12-3.0.1.jar
- /sql/core/target/spark-sql_2.12-3.0.1.jar
- /sql/catalyst/target/spark-catalyst_2.12-3.0.1.jar
Then go to your Spark directory under the jars folder e.g. $SPARK_HOME/jars
Place the downloaded JARs inside the jars folder (replace any if necessary)
You are now ready to use NextiaJD

Download the compiled JARs

To install NextiaJD you need to follow the steps below:

First, you need to download the compiled jars using these links:
Go to your Spark directory under the jars folder e.g. $SPARK_HOME/jars
Place the downloaded JARs inside the jars folder (replace any if necessary)
You are now ready to use NextiaJD

Usage

Attribute profiling

To start a profiling we can use the method attributeProfile()from a DataFrame object. By default, once a profile is computed, it will be saved in the dataset directory. This allows to reuse the profile for future discoveries without having to compute it again. While you can use any dataset format, we recommend to use parquet files to compute profiles faster.

val dataset = spark.read.csv(...)  
# computes attribute profile
dataset.attributeProfile() 
# returns a dataframe with the profile information
dataset.getAttributeProfile()

Join Discovery

Our Join Discovery is focused on the quality result of a join statement. Thus, we defined a totally-ordered set of quality classes:

High: attributes pair with a containment similarity of 0.75 and a maximum cardinality proportion of 4.
Good: attributes pair with a containment similarity of 0.5 and a maximum cardinality proportion of 8.
Moderate: attributes pair with a containment similarity of 0.25 and a maximum cardinality proportion of 12.
Poor: attributes pair with a containment similarity of 0.1
None: otherwise

You can start a discovery by using the function discovery() from org.apache.spark.sql.NextiaJD. As an example the following code will start a discovery to find any attribute from our dataset that can be used for a join with some dataset from the repository.

import org.apache.spark.sql.NextiaJD.discovery
val dataset = spark.read.csv(...) 
val repository = # list of datasets  

import org.apache.spark.sql.NextiaJD.discovery

discovery(dataset, repository)

By default, we just show candidates attributes that performs a High and Good quality joins. If you want to explore Moderate and Poor results, the discovery function have the boolean parameters showModerate and showPoor. Once enable, the discovery only show results for the specified quality.

Demo (Zeppelin Notebook)

Check out the demo project for a quick example of how NextiaJD works. Bear in mind that, in order to access them you must first login with the following credentials (user: user1, password: nextiajd).

Note that we also have a step by step notebook which can also be found here

More information and a video can be found here

Reproducibility of Experiments

We performed differents experiments to evaluate the predictive performance and efficiency of NextiaJD. In the spirit of open research and experimental reproducibility, we provide detailed information on how to reproduce them. More information about it can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 27,444 Commits
.github		.github
R		R
assembly		assembly
bin		bin
build		build
common		common
conf		conf
core		core
data		data
dev		dev
docs		docs
examples		examples
external		external
graphx		graphx
hadoop-cloud		hadoop-cloud
launcher		launcher
licenses-binary		licenses-binary
licenses		licenses
mllib-local		mllib-local
mllib		mllib
project		project
python		python
repl		repl
resource-managers		resource-managers
sbin		sbin
sql		sql
streaming		streaming
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-binary		LICENSE-binary
NOTICE		NOTICE
NOTICE-binary		NOTICE-binary
README.md		README.md
appveyor.yml		appveyor.yml
pom.xml		pom.xml
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

A Scalable Data Discovery solution using profilies based on Apache Spark.

About

Key features

How it works

Requirements

Installation

Build from sources

Download the compiled JARs

Usage

Attribute profiling

Join Discovery

Demo (Zeppelin Notebook)

Reproducibility of Experiments

About

Licenses found

Releases

Packages

Languages

License

Licenses found

dtim-upc/NextiaJD_deprecated

Folders and files

Latest commit

History

Repository files navigation

A Scalable Data Discovery solution using profilies based on Apache Spark.

About

Key features

How it works

Requirements

Installation

Build from sources

Download the compiled JARs

Usage

Attribute profiling

Join Discovery

Demo (Zeppelin Notebook)

Reproducibility of Experiments

About

Resources

License

Licenses found

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages