For full documentation please check ML-Navigator
ML-Navigator is a tutorial-based Machine Learning framework. The main component of ML-Navigator is the flow. A flow is a collection of compact methods/functions that can be stuck together with guidance texts.
The flow functions as a map which shows the road from point A to point B. The guidance texts function as navigator instructions that help the user to figure out the next step after executing the current step.
Like the car navigator, the user is not forced to follow the path. At any point, the user can take a break to explore data, modify the features and make any necessary changes. The user can always come back to the main path which the flow defines.
The flows are created by the community for the community.
ML-Navigator is a free editable maps collection of the data science world.
ML-Navigator standardizes exchanging knowledge among data scientists. Junior data scientists can learn and apply data science best-practice in a white-box mode. Senior data scientists can automate many repetitive processes and share experience effectively. Enterprises can standardize data science among different departments.
Data Science has been attracting smart people who have different backgrounds,
experiences, and field knowledge. The transformation journey from other disciplines into
data science is not a straight forward process. It is time and effort consuming based on
the motivation, the background, and the industry where the data scientist wants to work.
The data science new joiners follow multiple paths to sharp their data science skills.
Some of these paths are:
- Online courses: Some E-learning platforms, e.g., LinkedIn-learning, provide practical courses to solve specific data science problems. Other platforms, such as Coursera, offer theory-based courses. In LinkedIn-learning platform alone, there are 400+ courses related to data science. Selecting the best courses among all of those numerous courses is a challenge for newbies. Moreover, the theory-based courses require solid mathematical knowledge, especially in calculus and linear algebra.
- Data Science online platforms: Such platforms, like Kaggle, offer playground or prize-based competitions. Junior data scientists can learn a lot by applying their knowledge and reading kernels, which data scientists write to share their experience. Poorly written code and lack of documentation can be frustrating for newbies who want to learn what happened behind the scenes.
- Manuals of well-known data science frameworks: There are many open-source frameworks which provide an industry-proven implementation of many methods that have been used by data scientists. Many of these frameworks don't share the same syntax. Data scientists may need to learn new syntax each time they switch to a new framework.
- Learning from senior data scientists: Onboarding junior data scientists may require time which senior data scientists don't always have.
A new joiner is a person who wants to move into data science from a different discipline. A new joiner can also be a person who wants to be a part of the data team but not a full-time data scientist, e.g., developers with sufficient coding skills. ML-Navigator provides the data science new joiners the path to analyze real data. It helps the user to navigate through predefined flows, which are End-2-End data science pipelines. The user can load a specific flow and follow the instructions starting from reading data until training the model. The user can start with the most straightforward flow and later use more complicated flows to train accurate models if needed.
Experienced data scientists may be interested in automating many processes that they follow frequently. They can build a flow for each specific problem type. The flow can be created from scratch or by modifying or combining other flows. They can share their flows with the community and exchange their experience with other data scientists.
ML-navigator can standardize the data science experience in large enterprises. Junior data scientists can be productive and efficient from the first day. The onboarding process can be fast, concrete, but not abstracted.
Data scientists may use AutoML to produce multiple types of models as an alternative to digging deep in data and gaining new knowledge. AutoML can create a large number of models. However, it doesn't guarantee that the user gets the model that satisfies the quality requirements. It needs a long time for testing a wide range of hyperparameters values. Model reproducibility can be an issue when creating models using AutoML.
To install the ML-Navigator Package you need to have Python 3.6
:
You can install ML-Navigator using the pip
tool directly:
pip install ML-Navigator
To install the ML-Navigator Package from the Github repo:
-
clone the git repository:
$ git clone https://github.com/KI-labs/ML-Navigator.git
$ cd ML-Navigator
-
create a directory under the name "data" and move your data files to it e.g. "train.csv" and "test.csv"
-
create a virtual environment
$ pip install virtualenv
$ virtualenv venv
$ source venv/bin/activate
-
After setting up the virtual environment, you can install the package using pip command as follows:
$ pip install .
IMPORTANT!!!!!
On macOS you may have a problem with loading lightgbm library, than please install with brew:
brew install lightgbm
The structure of the directories looks like the following
.
βββ LICENSE
βββ setup.py
βββ MANIFEST.in
βββ data
βΒ Β βββ flow_0
βΒ Β βββ flow_1
βΒ Β βββ flow_2
βΒ Β βββ flow_3
βββ feature_engineering
βΒ Β βββ __init__.py
βΒ Β βββ feature_generator.py
βΒ Β βββ test.py
βββ flows
βΒ Β βββ __init__.py
βΒ Β βββ example.yaml
βΒ Β βββ flow_0.drawio
βΒ Β βββ flow_0.json
βΒ Β βββ flow_1.drawio
βΒ Β βββ flow_1.json
βΒ Β βββ flow_2.drawio
βΒ Β βββ flow_2.json
βΒ Β βββ flow_3.drawio
βΒ Β βββ flow_3.json
βΒ Β βββ flows.py
βΒ Β βββ utils.py
βΒ Β βββ text_helper.py
βΒ Β βββ yaml_reader.py
βββ images
βΒ Β βββ flow_0_record_middle_size.gif
βΒ Β βββ logo.png
βββ prediction
βΒ Β βββ __init__.py
βΒ Β βββ model_predictor.py
βββ preprocessing
βΒ Β βββ README.md
βΒ Β βββ __init__.py
βΒ Β βββ data_clean.py
βΒ Β βββ data_explorer.py
βΒ Β βββ data_science_help_functions.py
βΒ Β βββ data_transformer.py
βΒ Β βββ data_type_detector.py
βΒ Β βββ json_preprocessor.py
βΒ Β βββ test_loading_data.py
βΒ Β βββ test_preprocessing.py
βΒ Β βββ utils.py
βββ readme.md
βββ requirements.txt
βββ training
βΒ Β βββ __init__.py
βΒ Β βββ model_evaluator.py
βΒ Β βββ optimizer.py
βΒ Β βββ test_split.py
βΒ Β βββ test_training.py
βΒ Β βββ training.py
βΒ Β βββ utils.py
βΒ Β βββ validator.py
βββ tutorials
βΒ Β βββ flow_0.png
βΒ Β βββ flow_0.ipynb
βΒ Β βββ flow_1.ipynb
βΒ Β βββ flow_1.png
βΒ Β βββ flow_2.ipynb
βΒ Β βββ flow_2.png
βΒ Β βββ flow_3.ipynb
βΒ Β βββ flow_3.png
βββ venv
βββ visualization
βββ __init__.py
βββ visualization.py
Create a directory under the name "data" inside the project root directory.
To run the tutorials, you can download the "train.csv" and "test.csv" datasets from Kaggle website:
* inside `./data/flow_0` and `./data/flow_1` store the data from the [House Prices: Advanced Regression Techniques competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
* inside `./data/flow_2` directory store the data from the: [TMDB Box Office Prediction competition](https://www.kaggle.com/c/tmdb-box-office-prediction)
Please check the tutorials directory: Tutorials
For more information please check the documentation: ML-Navigator
It is the core module of the ML-Navigator framework. It contains the flows that are defined as YAML files. The Flow class has multiple methods that get their functionalities from calling other packages (self-implemented or external).
It contains the main functions and classes that are used to prepare the data for further processing such as discovering the type of data in each column, encode categorical features and scale numeric features.
It contains functions and classes to produce new features. For example, one-hot encoding.
It contains functions and classes to produce graphs. For example, there are functions for comparing the statistical properties of different datasets, visualize the count of the missing values, and drawing histograms.
It contains the functions and classes to train Machine Learning models. Currently, there are two regression models: Ridge linear regression (scikit-learn) and LightGBM.
It contains the functions and classes to predict the target using the pre-trained models. Currently, all trained models are saved locally.
It contains the logs messages that are produced by different modules of the framework.
It contains the trained models saved in pkl format.
The virtual environment that should be created by the user
It contains the data, e.g. train.csv
and test.csv
.
It contains a list of all packages that are required to run the framework.
It is straightforward. You need to point to the location of your data and the name of the datasets after loading a particular flow. Currently, the framework supports only reading CSV files. Here is an example:
path = './data'
files_list = ['train.csv','test.csv']
For the first version, we have not prepared a tool for creating a flow in a cool way yet. However, this is one of the main focus in the near future.
The flow should contain two elements:
-
Visualization: I show a flow as a flowchart. I use a free online tool called draw.io to draw a chart. Feel free to use any other tool or method to visualize a flow. You can use the
drawio
files which are provided in the flows modules to create a visualization of new flows. -
guidance text: I use YAML files to define the guidance instructions for the flows. Currently, this method is not scalable, and it requires setting the steps manually. In the future, the flows will be created using a user interface, and they will be saved in a database using a unique key for each flow.
Each method in the Flows
class in the flows/flow.py
has an ID. Each ID is defined as a string using the variable function_id
. For example, the method load_data
has function_id = 0
.
In the flows/flow_instructions_database.yaml
there are the instruction texts. The instruction texts are defined using an ID (integer). Each text has two variables:
function
: it describes the function or the method which is defined inflows/flow.py
guide
: it the guidance text that describe how to use the defined function or the method.
To build a flow you need to create a JSON file, e.g. flow_0.json
, that map the function_id
as a key and the ID of the guidance text that is defined in flows/flow_instructions_database.yaml
.
IMPORTANT!!!!!
In the flow_x.json
:
function_id
refers to the id of the current running function
The ID of the guidance text in the flow_instructions_database.yaml
refers to the function that should be executed after the current running function that has the function_id
.
{"function_id of running function": "the ID of the guidance text of the function that should be executed next"}
An example for the mapping:
{"0": 1}
where function_id = 0
refers to the load_data
method and value 1 refers to the ID of the method Encode categorical features
guidance text that should run after the load_data
method.
{
"0": 1,
"1": 2,
"2": 3,
"4": 1000
}
The translation of the JSON object is as follows:
{
"the current function: load the data": "the next function: Encode categorical features",
"the current function: Encode categorical features": "the next function: Scale numeric features",
"the current function: Scale numeric features": "the next function: Train a model",
"the current function: Train a model": "the next function: Finish or noting which indicates the end of the flow"
}
You can create your own method inside the Flows
class in the flows/flow.py
and assign a unique ID to it by defining the variable function_id
.
Inside the flow_instructions_database.yaml
you can create your own guidance text for already exiting methods or for the new methods. You should assign a unique ID for the
new created guidance texts. Please include the function
and quide
keys to help other users understanding and find your guidance text easily.
The key function
is optional but quide
is required. You can create multiple new guidance texts for the same defined function but each guidance text should have a unique ID.
When creating a flow, the essential information that should be added at the end of each step is what the next step is. Adding an example, which shows how to perform the next level and what are the required variables is beneficial to the user.
Your contributions are always welcome and appreciated. Following are the things you can do to contribute to this project.
If you think you have encountered a bug, and I should know about it, feel free to report it here and I will take care of it.
You can also request for a feature here, and if it will viable, it will be picked for development.
It can't get better then this, your pull request will be appreciated by the community. You can get started by picking up any open issues from here and make a pull request.
If you want to submit a flow, please provide the following in your pull request:
-
flow_x.drawio
andflow_x.png
where x is an integer that has not been given for other flows yet. Please check the flows module./flows
-
flow_x.json
where x is is an integer that has not been given for other flows yet and has the same value inflow_x.drawio
-
flow_x.ipynb
where x is is an integer that has not been given for other flows yet and has the same value inflow_x.drawio
and inflow_x.json
.
The Jupyter Notebookflow_x.ipynb
should work end-2-end without any errors.
- Make a PR to master branch.
- Comply with the best practices and guidelines.
- It must pass all continuous integration checks and get positive reviews.
- After this, changes will be merged.
Copyright 2019 KI labs GmbH
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.