Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SapientML to automl benchmark #630

Merged
merged 3 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions docs/website/frameworks.html
Original file line number Diff line number Diff line change
Expand Up @@ -944,6 +944,85 @@ <h3 class="paper-title">
</svg>
</label>
</div>
<div class="accordion acard">
<div class="framework-header">
<img src="img/logos/Sapientml_favicon.ico" height="28px" />
<h3>SapientML</h3>
<div class="framework-links">
<a href="https://github.com/sapientml/sapientml" target="_blank"
><img src="img/logos/GitHub-Mark-64px.png" height="24px"
/></a>
<a href="https://sapientml.readthedocs.io/en/latest/#" target="_blank"
>📖</a
>
</div>
</div>
<div>
SapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.
</div>
<input type="checkbox" id="more-SapientML" class="accordion-input" />
<div class="accordion-content">
<div class="paper">
<h3 class="paper-title">
SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions
</h3>
<div class="paper-authors">
Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad
</div>
<div class="paper-abstract">
Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large,
complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML,
SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
</div>
<div class="paper-links">
<div class="hover-expand">
<strong>2022</strong>
<div>
ICSE '22: Proceedings of the 44th International Conference on Software Engineering,May 2022,Pages 1932–1944
</div>
</div>
<a href="https://arxiv.org/pdf/2202.10451.pdf" target="_blank"
>PDF</a
>
<a
href="https://arxiv.org/abs/2202.10451"
target="_blank"
>arxiv</a
>
</div>
</div>
</div>
<label for="more-SapientML">
<svg
xmlns="http://www.w3.org/2000/svg"
class="accordion-chevron-down accordion-icon"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
d="M19 9l-7 7-7-7"
/>
</svg>
<svg
xmlns="http://www.w3.org/2000/svg"
class="accordion-chevron-up accordion-icon"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
d="M5 15l7-7 7 7"
/>
</svg>
</label>
</div>
</section>
</div>
</body>
Expand Down
25 changes: 25 additions & 0 deletions frameworks/SapientML/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from amlb.benchmark import TaskConfig
from amlb.data import Dataset
from amlb.utils import call_script_in_same_dir


def setup(*args, **kwargs):
call_script_in_same_dir(__file__, "setup.sh", *args, **kwargs)


def run(dataset: Dataset, config: TaskConfig):
from frameworks.shared.caller import run_in_venv

data = dict(
train=dict(path=dataset.train.data_path("csv")),
test=dict(path=dataset.test.data_path("csv")),
target=dict(name=dataset.target.name, classes=dataset.target.values),
problem_type=dataset.type.name,
)
return run_in_venv(
__file__,
"exec.py",
input_data=data,
dataset=dataset,
config=config,
)
97 changes: 97 additions & 0 deletions frameworks/SapientML/exec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import logging
import os
import tempfile as tmp

from frameworks.shared.callee import call_run, result
from frameworks.shared.utils import Timer
from sapientml import SapientML
from sapientml.util.logging import setup_logger
from sklearn.preprocessing import OneHotEncoder

os.environ["JOBLIB_TEMP_FOLDER"] = tmp.gettempdir()
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"


log = logging.getLogger(__name__)


def run(dataset, config):
import re

import pandas as pd

log.info(f"\n**** Sapientml ****\n")

is_classification = config.type == "classification"
is_multiclass = dataset.problem_type = "multiclass"
training_params = {k: v for k, v in config.framework_params.items() if not k.startswith("_")}

train_path, test_path = dataset.train.path, dataset.test.path
target_col = dataset.target.name

# Read parquet using pandas
X_train = pd.read_csv(train_path)
X_test = pd.read_csv(test_path)

# Removing unwanted sybols from column names (exception case)
X_train.columns = [re.sub("[^A-Za-z0-9_.]+", "", col) for col in X_train.columns]
X_test.columns = [re.sub("[^A-Za-z0-9_.]+", "", col) for col in X_test.columns]
target_col = re.sub("[^A-Za-z0-9_.]+", "", target_col)

# y_train and y_test
y_train = X_train[target_col].reset_index(drop=True)
y_test = X_test[target_col].reset_index(drop=True)

# Drop target col from X_test
X_test.drop([target_col], axis=1, inplace=True)

# Sapientml
output_dir = config.output_dir + "/" + "outputs" + "/" + config.name + "/" + str(config.fold)
predictor = SapientML([target_col], task_type="classification" if is_classification else "regression")
PGijsbers marked this conversation as resolved.
Show resolved Hide resolved

# Fit the model
with Timer() as training:
predictor.fit(X_train, output_dir=output_dir)
log.info(f"Finished fit in {training.duration}s.")

# predict
with Timer() as predict:
predictions = predictor.predict(X_test)
log.info(f"Finished predict in {predict.duration}s.")

if is_classification:

predictions[target_col] = predictions[target_col].astype(str)
predictions[target_col] = predictions[target_col].str.lower()
predictions[target_col] = predictions[target_col].str.strip()
y_test = y_test.to_frame()
y_test[target_col] = y_test[target_col].astype(str)
y_test[target_col] = y_test[target_col].str.lower()
y_test[target_col] = y_test[target_col].str.strip()

if is_classification:
probabilities = OneHotEncoder(handle_unknown="ignore").fit_transform(predictions.to_numpy())
probabilities = pd.DataFrame(probabilities.toarray(), columns=dataset.target.classes)

return result(
output_file=config.output_predictions_file,
predictions=predictions,
truth=y_test,
probabilities=probabilities,
training_duration=training.duration,
predict_duration=predict.duration,
)
else:
return result(
output_file=config.output_predictions_file,
predictions=predictions,
truth=y_test,
training_duration=training.duration,
predict_duration=predict.duration,
)


if __name__ == "__main__":
call_run(run)
1 change: 1 addition & 0 deletions frameworks/SapientML/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
requests
PGijsbers marked this conversation as resolved.
Show resolved Hide resolved
26 changes: 26 additions & 0 deletions frameworks/SapientML/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env bash
PGijsbers marked this conversation as resolved.
Show resolved Hide resolved
HERE=$(dirname "$0")
VERSION=${1:-"stable"}
REPO=${2:-"https://github.com/sapientml/sapientml"}
PKG=${3:-"sapientml"}
if [[ "$VERSION" == "latest" ]]; then
VERSION="main"
fi

#create local venv
. ${HERE}/../shared/setup.sh ${HERE} true

PIP install -r ${HERE}/requirements.txt
if [[ "$VERSION" == "stable" ]]; then
PIP install --no-cache-dir -U ${PKG}
elif [[ "$VERSION" =~ ^[0-9] ]]; then
PIP install --no-cache-dir -U ${PKG}==${VERSION}
else
# PIP install --no-cache-dir -e git+${REPO}@${VERSION}#egg=${PKG}
TARGET_DIR="${HERE}/lib/${PKG}"
rm -Rf ${TARGET_DIR}
git clone --depth 1 --single-branch --branch ${VERSION} --recurse-submodules ${REPO} ${TARGET_DIR}
PIP install -U -e ${TARGET_DIR}
fi

PY -c "import pkg_resources; print(pkg_resources.get_distribution('sapientml').version)" >> "${HERE}/.setup/installed"
5 changes: 5 additions & 0 deletions resources/frameworks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,11 @@ FEDOT:
# params:
# _save_artifacts: ['leaderboard', 'models', 'info']

SapientML:
description: |
SapientML is an AutoML tool that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.
project: https://github.com/sapientml/sapientml

#######################################
### Non AutoML reference frameworks ###
#######################################
Expand Down