Skip to content
This repository has been archived by the owner on Jul 8, 2021. It is now read-only.
forked from apache/spark

Effective and scalable data discovery using profiles

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Apache-2.0
LICENSE
Apache-2.0
LICENSE-binary
Notifications You must be signed in to change notification settings

dtim-upc/NextiaJD_deprecated

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

A Scalable Data Discovery solution using profilies based on Apache Spark.

AboutKey FeaturesHow it worksUsageInstallationDemoReproducibility

About

NextiaJD, from nextia in the Nahuatl language (the old Aztec language), is a scalable data discovery system. NextiaJD computes profiles, which are succint representations of the underlying characteristics of datasets and their attributes, to efficiently discover joinable attributes on datasets. We aim to automatically discover pairs of attributes in a massive collection of heterogeneous datasets (i.e., data lakes) that can be crossed.

Here, we provide you detailed information on how to run and evaluate NextiaJD. To learn more about the project, visit our website.

Key features

  • Attribute profiling built-in Spark
  • A fully distributed end-to-end framework for joinable attributes discovery.
  • Easy data discovery for everyone

How it works

We encourage you to read our paper to better understand what NextiaJD is and how can fit your scenarios.

The simple way to describe it:

NextiaJD

You have one dataset and a collection of independent datasets. Then, you will like to find other datasets with attributes that performed a high quality join.

NextiaJD reduces the effort to do a manual exploration by predicting which attributes are candidates for a join based on some qualities defined.

We have as an example two scenarios:

  • In a data lake when a new dataset is ingested, a profile should be computed. Then, whenever a data analysts has a dataset, NextiaJD can find other datasets in the data lake that can be joined.
  • In a normal repository, when having a few datasets and we want to know how they can be crossed against one dataset.

Requirements

  • Spark 3.0.1
  • Scala 2.12.
  • Java 8 or 11

Installation

There are two options to install NextiaJD in your computer: building the jars from this repository using Maven or downloading the NextiaJD compiled jars (Recommended)

Build from sources

To install NextiaJD you need to follow the steps below:

  • Clone this project
$ git clone https://github.com/dtim-upc/NextiaJD
  • Go to the project root directory in a terminal
  • Run the command below. It will build the spark catalyst, spark sql and spark nextiajd jars through Maven. Note that this will take some time.
    • Alternatively, you can build the whole Spark project as specified here
$ ./build/mvn clean package -pl :spark-catalyst_2.12,:spark-sql_2.12,:spark-nextiajd_2.12 -DskipTests 
  • If the build succeeds, you can find the compiled jars under the following directories:
    • /sql/nextiajd/target/spark-nextiajd_2.12-3.0.1.jar
    • /sql/core/target/spark-sql_2.12-3.0.1.jar
    • /sql/catalyst/target/spark-catalyst_2.12-3.0.1.jar
  • Then go to your Spark directory under the jars folder e.g. $SPARK_HOME/jars
  • Place the downloaded JARs inside the jars folder (replace any if necessary)
  • You are now ready to use NextiaJD

Download the compiled JARs

To install NextiaJD you need to follow the steps below:

  • First, you need to download the compiled jars using these links:
  • Go to your Spark directory under the jars folder e.g. $SPARK_HOME/jars
  • Place the downloaded JARs inside the jars folder (replace any if necessary)
  • You are now ready to use NextiaJD

Usage

Attribute profiling

To start a profiling we can use the method attributeProfile()from a DataFrame object. By default, once a profile is computed, it will be saved in the dataset directory. This allows to reuse the profile for future discoveries without having to compute it again. While you can use any dataset format, we recommend to use parquet files to compute profiles faster.

val dataset = spark.read.csv(...)  
# computes attribute profile
dataset.attributeProfile() 
# returns a dataframe with the profile information
dataset.getAttributeProfile()   

Join Discovery

Our Join Discovery is focused on the quality result of a join statement. Thus, we defined a totally-ordered set of quality classes:

  • High: attributes pair with a containment similarity of 0.75 and a maximum cardinality proportion of 4.
  • Good: attributes pair with a containment similarity of 0.5 and a maximum cardinality proportion of 8.
  • Moderate: attributes pair with a containment similarity of 0.25 and a maximum cardinality proportion of 12.
  • Poor: attributes pair with a containment similarity of 0.1
  • None: otherwise

You can start a discovery by using the function discovery() from org.apache.spark.sql.NextiaJD. As an example the following code will start a discovery to find any attribute from our dataset that can be used for a join with some dataset from the repository.

import org.apache.spark.sql.NextiaJD.discovery
val dataset = spark.read.csv(...) 
val repository = # list of datasets  

import org.apache.spark.sql.NextiaJD.discovery

discovery(dataset, repository)

By default, we just show candidates attributes that performs a High and Good quality joins. If you want to explore Moderate and Poor results, the discovery function have the boolean parameters showModerate and showPoor. Once enable, the discovery only show results for the specified quality.

Demo (Zeppelin Notebook)

Check out the demo project for a quick example of how NextiaJD works. Bear in mind that, in order to access them you must first login with the following credentials (user: user1, password: nextiajd).

Note that we also have a step by step notebook which can also be found here

More information and a video can be found here

Reproducibility of Experiments

We performed differents experiments to evaluate the predictive performance and efficiency of NextiaJD. In the spirit of open research and experimental reproducibility, we provide detailed information on how to reproduce them. More information about it can be found here.

About

Effective and scalable data discovery using profiles

https://www.essi.upc.edu/dtim/nextiajd/

Resources

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Apache-2.0
LICENSE
Apache-2.0
LICENSE-binary

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 73.8%
  • Java 9.3%
  • Python 7.3%
  • HiveQL 4.2%
  • R 2.7%
  • PLpgSQL 0.8%
  • Other 1.9%