Aurum helps users identify relevant content among multiple data sources that may consist of tabular files, such as CSV, and relational tables. These may be stored in relational database management systems (RDBMS), file systems, and they may live in cloud services, data lakes or other on-premise repositories.
Aurum helps you find data through different interfaces. The most flexible one is an API of primitives that can be composed to build queries that describe the data of interest. For example, you can write a query that says "find tables that contain a column with name 'ID' and have at least one column that looks like an input column". You can also query with very simple primitives, such as "find columns that contain the keyword 'caffeine'". You can also do more complex queries, such as figuring out what tables join with a table of interest. The idea is that the API is flexible enough to allow a wide range of use cases, and that it works over all data you feed to the system, regardless where these live.
Aurum consists of three independent modules that work together to achieve all the above. We explain briefly each module next:
-
DDProfiler: The ddprofiler is in charge of reading the data from wherever it lives (e.g., CSV files, tables, the cloud or an on-premise lake) and create a set of summaries that succintly represent the data in a way that allows us to discover it later. All the data summaries are stored in a store, which at the moment is elasticsearch.
-
Model Builder: The model builder is in charge of creating a model that can respond to the different user queries. To build this model, the networkbuildercoordinator.py will read the data summaries created by the profiler from the store (elasticsearch) and will output the model to another store, which for now is simply a pickle serialization.
-
Front-end API: Last, the front-end API contains the primitives and utilities to allow users to create discovery queries. The API is configured with the path to an existing model, which represents some underlying data. The API primitives are then combined and query both elasticsearch and the model to answer users' queries.
This project is a work-in-progress. We give some detail on how to use each module below. Note this will be changing often as part of the development.
git clone git@github.com:mitdbg/aurum-datadiscovery.git
cd aurum-datadiscovery
We explain next how to configure the modules to get a barebones installation. We do this in a series of 3 stages.
The profiler is built in Java (you can find it under /ddprofiler). The input are data sources (files and tables) to analyze and the output is stored in elasticsearch. Next, you can find instructions to build and deploy the profiler as well as to install and configure Elasticsearch.
You will need JVM 8 available in the system for this step. From the root directory go to 'ddprofiler' and do:
$> cd ddprofiler
$> bash build.sh
Download the software (note the currently supported version is 6.0.0) from:
https://www.elastic.co/products/elasticsearch
Uncompress it and then simply run from the root directory:
$> ./bin/elasticsearch
that will start the server in localhost:9200 by default, which is the address you should use to configure ddprofiler as we show next.
There are two different ways of interacting with the profiler. One is through a YAML file, which describes and configures the different data sources to profile. The second way is through an interactice interface which we are currently working on. We describe next the configuration of sources through the YAML file.
The jar file produced in the previous step accepts a number of flags, of which the most relevant one is:
--sources Which accepts a path to a YAML file in which to configure the access to the different data sources, e.g., folder with CSV files or JDBC-compatible RDBMS.
You can find an example template file here which contains documentation to explain how to use it.
A typical usage of the profiler from the command line will look like:
Example:
$> bash run.sh --sources <path_to_sources.yml>
You can consult all configuration parameters by appending --help or <?> as a parameter. In particular you may be interested in changing the default elasticsearch ports (consult --store.http.port and --store.port) in case your installation does not use the default ones.
Note that, although the YAML file accepts any number of data sources, at the moment we recommend to profile one single source at a time. Note, however, that you can run ddprofiler as many times as necessary using a YAML with a different data source. For example, if you want to index a repository of CSV files and a RDBMS, you will need to run ddprofiler two times, each one configured to read the data from each source. All data summaries will be created and stored in elasticsearch. Only make sure to edit the YAML file appropriately each time.
Once you have used the ddprofiler to create data summaries of all the data sources you want, the second stage will read those and create a model. We briefly explain next the requirements for running the model builder.
As typical with Python deployments, we recommend using a virtualenvironment (see virtualenv) so that you can quickly wipeout the environment if you no longer need it without affecting any system-wide dependencies.
Requires Python 3 (tested with 3.4.2, 3.5.0 and 3.5.1). Use requirements.txt to install all the dependencies:
$> pip install -r requirements.txt
In a vanilla linux (debian-based) system, the following packages will need to be installed system-wide:
sudo apt-get install \
pkg-config libpng-dev libfreetype6-dev `#(requirement of matplotlib)` \
libblas-dev liblapack-dev `#(speeding up linear algebra operations)` \
lib32ncurses5-dev
Some notes for MAC users:
If you run within a virtualenvironemtn, Matplotlib will fail due to a mismatch with the backend it wants to use. A way of fixing this is to create a file: ~/.matplotlib/matplotlibrc and add a single line: backend: TkAgg.
Note you need to use elasticsearch 6.0.0 in the current version.
The model builder is executed from 'networkbuildercoordinator.py', which takes exactly one parameter, --opath, that expects a path to an existing folder where you want to store the built model (in the form of Python pickle files). For example:
$> python networkbuildercoordinator.py --opath test/testmodel/
Once the model is built, it will be serialized and stored in the provided path.
The file ddapi.py is the core implementation of Aurum's API. One easy way to access it is to deserialize a desired model and constructing an API object with that model. The easiest way to do so is by importing init_system() function from main. Something like:
from main import init_system
api, reporting = init_system(<path_to_serialized_model>, reporting=False)
The last parameter of init_system, reporting, controls whether you want to create a reporting API that gives you access to statistics about the model. Feel free to say yes, but beware that it may take long times when the models are big.
The discovery API consists of a collection of primitives that can be combined together to write more complex data discovery queries. Consider a scenario in which you want to identify buildings at MIT. There is a discovery primitive to search for specific values in a column, e.g., "Stata Center". There is another primitive to find a column with a specific schema name, e.g., "Building Name". If you use any of them individually, you may find a lot of values, with only a subset being relevant, e.g., many organizations may have a table that contains a columns named "Building Name". Combining both of them makes the purpose more specific and therefore narrows down the qualifying data, hopefully yielding relevant results.
To use the discovery API it is useful to know about the primitives available and about two special objects that we use to connect the primitives together and help you navigate the results. These objects are the API Handler and the Discovery Result Set (DRS). We describe them both next:
API Handler: This is the object that you obtain when initializing the API, that is:
api, reporting = init_system(<path_to_serialized_model>, reporting=False)
The API Handler gives you access to the different primitives available in the system, so it should be the first object to inspect when learning how to use the system.
The Discovery Result Set (DRS) is an object that essentially represents data within the discovery system. For example, by creating a DRS over a table in a storage system, we are creating a reference to that table, that can be used with the primitives. If, for example, we want to identify columns similar to a column A of interest, we will need to obtain first a reference to column A that we can use in the API. That reference is the DRS, and we provide several primitives to obtain these references. Then, if we run a similarity primitive on column A, the results will also be available in a DRS object --- this is what allows to arbitrarily combine primitives together.
DRS objects have a few functions that help to inspect their content, for example, to print the tables they represent or the columns they represent. The more nuanced aspect of DRS is that they have an internal state that determines whether they represent tables or columns. This is the most important aspect to understand about the Aurum discovery API, really. We explain it in some detail next:
Consider the intersection primitive, which helps in combining two DRS by taking their intersection, e.g., similar content and similar schema. It is possible to intersect at the table (tables that appear in both DRS) or column level (columns that appear in both of them), and this can be achieved by setting the status of the input DRS to table or column.
Soon...
Soon...
Soon...