From 49f4a22f0fa5a80fec3e7af7b7b78a705f677d18 Mon Sep 17 00:00:00 2001
From: Kosaku Kimura <kimura.kosaku@fujitsu.com>
Date: Mon, 29 Jul 2024 14:03:39 +0900
Subject: [PATCH 1/3] Add SapientML to automl benchmark

Signed-off-by: Kosaku Kimura <kimura.kosaku@fujitsu.com>
---
 docs/website/frameworks.html          | 79 +++++++++++++++++++++
 frameworks/SapientML/__init__.py      | 25 +++++++
 frameworks/SapientML/exec.py          | 99 +++++++++++++++++++++++++++
 frameworks/SapientML/requirements.txt |  3 +
 frameworks/SapientML/setup.sh         |  8 +++
 resources/config.yaml                 |  2 +-
 resources/frameworks.yaml             |  5 ++
 7 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 frameworks/SapientML/__init__.py
 create mode 100644 frameworks/SapientML/exec.py
 create mode 100644 frameworks/SapientML/requirements.txt
 create mode 100755 frameworks/SapientML/setup.sh
diff --git a/docs/website/frameworks.html b/docs/website/frameworks.html
index a2247ee4c..5e7b18451 100644
--- a/docs/website/frameworks.html
+++ b/docs/website/frameworks.html
@@ -944,6 +944,85 @@ <h3 class="paper-title">
             </svg>
           </label>
         </div>
+        <div class="accordion acard">
+          <div class="framework-header">
+            <img src="img/logos/Sapientml_favicon.ico" height="28px" />
+            <h3>SapientML</h3>
+            <div class="framework-links">
+              <a href="https://github.com/sapientml/sapientml" target="_blank"
+                ><img src="img/logos/GitHub-Mark-64px.png" height="24px"
+              /></a>
+              <a href="https://sapientml.readthedocs.io/en/latest/#" target="_blank"
+                >📖</a
+              >
+            </div>
+          </div>
+          <div>
+            SapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.
+          </div>
+          <input type="checkbox" id="more-SapientML" class="accordion-input" />
+          <div class="accordion-content">
+            <div class="paper">
+              <h3 class="paper-title">
+                SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions
+              </h3>
+              <div class="paper-authors">
+                Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad
+              </div>
+              <div class="paper-abstract">
+                Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large,
+                complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML,
+                SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
+              </div>
+              <div class="paper-links">
+                <div class="hover-expand">
+                  <strong>2022</strong>
+                  <div>
+                    ICSE '22: Proceedings of the 44th International Conference on Software Engineering,May 2022,Pages 1932–1944
+                  </div>
+                </div>
+                <a href="https://arxiv.org/pdf/2202.10451.pdf" target="_blank"
+                  >PDF</a
+                >
+                <a
+                  href="https://arxiv.org/abs/2202.10451"
+                  target="_blank"
+                  >arxiv</a
+                >
+              </div>
+            </div>
+          </div>
+          <label for="more-SapientML">
+            <svg
+              xmlns="http://www.w3.org/2000/svg"
+              class="accordion-chevron-down accordion-icon"
+              fill="none"
+              viewBox="0 0 24 24"
+              stroke="currentColor"
+              stroke-width="2"
+            >
+              <path
+                stroke-linecap="round"
+                stroke-linejoin="round"
+                d="M19 9l-7 7-7-7"
+              />
+            </svg>
+            <svg
+              xmlns="http://www.w3.org/2000/svg"
+              class="accordion-chevron-up accordion-icon"
+              fill="none"
+              viewBox="0 0 24 24"
+              stroke="currentColor"
+              stroke-width="2"
+            >
+              <path
+                stroke-linecap="round"
+                stroke-linejoin="round"
+                d="M5 15l7-7 7 7"
+              />
+            </svg>
+          </label>
+        </div>
       </section>
     </div>
   </body>
diff --git a/frameworks/SapientML/__init__.py b/frameworks/SapientML/__init__.py
new file mode 100644
index 000000000..db17acd2b
--- /dev/null
+++ b/frameworks/SapientML/__init__.py
@@ -0,0 +1,25 @@
+from amlb.benchmark import TaskConfig
+from amlb.data import Dataset
+from amlb.utils import call_script_in_same_dir
+
+
+def setup(*args, **kwargs):
+    call_script_in_same_dir(__file__, "setup.sh", *args, **kwargs)
+
+
+def run(dataset: Dataset, config: TaskConfig):
+    from frameworks.shared.caller import run_in_venv
+
+    data = dict(
+        train=dict(path=dataset.train.data_path("csv")),
+        test=dict(path=dataset.test.data_path("csv")),
+        target=dict(name=dataset.target.name, classes=dataset.target.values),
+        problem_type=dataset.type.name,
+    )
+    return run_in_venv(
+        __file__,
+        "exec.py",
+        input_data=data,
+        dataset=dataset,
+        config=config,
+    )
diff --git a/frameworks/SapientML/exec.py b/frameworks/SapientML/exec.py
new file mode 100644
index 000000000..1c54c673a
--- /dev/null
+++ b/frameworks/SapientML/exec.py
@@ -0,0 +1,99 @@
+import logging
+import os
+import tempfile as tmp
+
+from amlb.benchmark import TaskConfig
+from amlb.data import Dataset
+from frameworks.shared.callee import call_run, result
+from frameworks.shared.utils import Timer
+from sapientml import SapientML
+from sapientml.util.logging import setup_logger
+from sklearn.preprocessing import OneHotEncoder
+
+os.environ["JOBLIB_TEMP_FOLDER"] = tmp.gettempdir()
+os.environ["OMP_NUM_THREADS"] = "1"
+os.environ["OPENBLAS_NUM_THREADS"] = "1"
+os.environ["MKL_NUM_THREADS"] = "1"
+
+
+log = logging.getLogger(__name__)
+
+
+def run(dataset, config):
+    import re
+
+    import pandas as pd
+
+    log.info(f"\n**** Sapientml ****\n")
+
+    is_classification = config.type == "classification"
+    is_multiclass = dataset.problem_type = "multiclass"
+    training_params = {k: v for k, v in config.framework_params.items() if not k.startswith("_")}
+
+    train_path, test_path = dataset.train.path, dataset.test.path
+    target_col = dataset.target.name
+
+    # Read parquet using pandas
+    X_train = pd.read_csv(train_path)
+    X_test = pd.read_csv(test_path)
+
+    # Removing unwanted sybols from column names (exception case)
+    X_train.columns = [re.sub("[^A-Za-z0-9_.]+", "", col) for col in X_train.columns]
+    X_test.columns = [re.sub("[^A-Za-z0-9_.]+", "", col) for col in X_test.columns]
+    target_col = re.sub("[^A-Za-z0-9_.]+", "", target_col)
+
+    # y_train and y_test
+    y_train = X_train[target_col].reset_index(drop=True)
+    y_test = X_test[target_col].reset_index(drop=True)
+
+    # Drop target col from X_test
+    X_test.drop([target_col], axis=1, inplace=True)
+
+    # Sapientml
+    output_dir = config.output_dir + "/" + "outputs" + "/" + config.name + "/" + str(config.fold)
+    predictor = SapientML([target_col], task_type="classification" if is_classification else "regression")
+
+    # Fit the model
+    with Timer() as training:
+        predictor.fit(X_train, output_dir=output_dir)
+    log.info(f"Finished fit in {training.duration}s.")
+
+    # predict
+    with Timer() as predict:
+        predictions = predictor.predict(X_test)
+    log.info(f"Finished predict in {predict.duration}s.")
+
+    if is_classification:
+
+        predictions[target_col] = predictions[target_col].astype(str)
+        predictions[target_col] = predictions[target_col].str.lower()
+        predictions[target_col] = predictions[target_col].str.strip()
+        y_test = y_test.to_frame()
+        y_test[target_col] = y_test[target_col].astype(str)
+        y_test[target_col] = y_test[target_col].str.lower()
+        y_test[target_col] = y_test[target_col].str.strip()
+
+    if is_classification:
+        probabilities = OneHotEncoder(handle_unknown="ignore").fit_transform(predictions.to_numpy())
+        probabilities = pd.DataFrame(probabilities.toarray(), columns=dataset.target.classes)
+
+        return result(
+            output_file=config.output_predictions_file,
+            predictions=predictions,
+            truth=y_test,
+            probabilities=probabilities,
+            training_duration=training.duration,
+            predict_duration=predict.duration,
+        )
+    else:
+        return result(
+            output_file=config.output_predictions_file,
+            predictions=predictions,
+            truth=y_test,
+            training_duration=training.duration,
+            predict_duration=predict.duration,
+        )
+
+
+if __name__ == "__main__":
+    call_run(run)
diff --git a/frameworks/SapientML/requirements.txt b/frameworks/SapientML/requirements.txt
new file mode 100644
index 000000000..98b5e3045
--- /dev/null
+++ b/frameworks/SapientML/requirements.txt
@@ -0,0 +1,3 @@
+sapientml
+openml
+boto3==1.26.98
\ No newline at end of file
diff --git a/frameworks/SapientML/setup.sh b/frameworks/SapientML/setup.sh
new file mode 100755
index 000000000..4de55b470
--- /dev/null
+++ b/frameworks/SapientML/setup.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+HERE=$(dirname "$0")
+
+#create venv
+. ${HERE}/.setup/setup_env
+. ${HERE}/../shared/setup.sh ${HERE} true
+PIP install --upgrade pip
+PIP install --no-cache-dir -r $HERE/requirements.txt
\ No newline at end of file
diff --git a/resources/config.yaml b/resources/config.yaml
index d9976b1f6..4d0c5a457 100644
--- a/resources/config.yaml
+++ b/resources/config.yaml
@@ -102,7 +102,7 @@ openml:                # configuration namespace for openML.
 
 versions:              # configuration namespace for versions enforcement (libraries versions are usually enforced in requirements.txt for the app and for each framework).
   pip:
-  python: 3.9          # the Python minor version that will be used by the application in containers and cloud instances, also used as a based version for virtual environments created for each framework.
+  python: 3.11          # the Python minor version that will be used by the application in containers and cloud instances, also used as a based version for virtual environments created for each framework.
 
 container: &container          # parent configuration namespace for container modes.
   force_branch: true           # set to true if image can only be built from a clean branch, with same tag as defined in `project_repository`.
diff --git a/resources/frameworks.yaml b/resources/frameworks.yaml
index 7a71e54ce..7c9965276 100644
--- a/resources/frameworks.yaml
+++ b/resources/frameworks.yaml
@@ -215,6 +215,11 @@ FEDOT:
 #  params:
 #    _save_artifacts: ['leaderboard', 'models', 'info']
 
+SapientML:
+  description: |
+    SapientML is an AutoML tool that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.
+  project: https://github.com/sapientml/sapientml
+
 #######################################
 ### Non AutoML reference frameworks ###
 #######################################

From 5b55360a8d634ebd0c99fda4de94b87666fa4123 Mon Sep 17 00:00:00 2001
From: Kosaku Kimura <kimusaku@gmail.com>
Date: Wed, 9 Oct 2024 14:02:38 +0900
Subject: [PATCH 2/3] fix setup.sh (#2)

* WIP

Signed-off-by: Kosaku Kimura <kimura.kosaku@fujitsu.com>

* Issue fix in setup.sh for SapientML

* Updation of config file as Sapientml supports Python version 3.9

Signed-off-by: HimanshuRRai <himanshu.rai@fujitsu.com>

---------

Signed-off-by: Kosaku Kimura <kimura.kosaku@fujitsu.com>
Signed-off-by: HimanshuRRai <himanshu.rai@fujitsu.com>
Co-authored-by: muhammed-nafi-k-a <muhammednafi.a@fujitsu.com>
Co-authored-by: HimanshuRRai <himanshu.rai@fujitsu.com>
---
 frameworks/SapientML/exec.py          |  2 --
 frameworks/SapientML/requirements.txt |  4 +---
 frameworks/SapientML/setup.sh         | 26 ++++++++++++++++++++++----
 resources/config.yaml                 |  2 +-
 4 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/frameworks/SapientML/exec.py b/frameworks/SapientML/exec.py
index 1c54c673a..f22fb62d2 100644
--- a/frameworks/SapientML/exec.py
+++ b/frameworks/SapientML/exec.py
@@ -2,8 +2,6 @@
 import os
 import tempfile as tmp
 
-from amlb.benchmark import TaskConfig
-from amlb.data import Dataset
 from frameworks.shared.callee import call_run, result
 from frameworks.shared.utils import Timer
 from sapientml import SapientML
diff --git a/frameworks/SapientML/requirements.txt b/frameworks/SapientML/requirements.txt
index 98b5e3045..663bd1f6a 100644
--- a/frameworks/SapientML/requirements.txt
+++ b/frameworks/SapientML/requirements.txt
@@ -1,3 +1 @@
-sapientml
-openml
-boto3==1.26.98
\ No newline at end of file
+requests
\ No newline at end of file
diff --git a/frameworks/SapientML/setup.sh b/frameworks/SapientML/setup.sh
index 4de55b470..47193fbf3 100755
--- a/frameworks/SapientML/setup.sh
+++ b/frameworks/SapientML/setup.sh
@@ -1,8 +1,26 @@
 #!/usr/bin/env bash
 HERE=$(dirname "$0")
+VERSION=${1:-"stable"}
+REPO=${2:-"https://github.com/sapientml/sapientml"}
+PKG=${3:-"sapientml"}
+if [[ "$VERSION" == "latest" ]]; then
+    VERSION="main"
+fi
 
-#create venv
-. ${HERE}/.setup/setup_env
+#create local venv
 . ${HERE}/../shared/setup.sh ${HERE} true
-PIP install --upgrade pip
-PIP install --no-cache-dir -r $HERE/requirements.txt
\ No newline at end of file
+
+PIP install -r ${HERE}/requirements.txt
+if [[ "$VERSION" == "stable" ]]; then
+    PIP install --no-cache-dir -U ${PKG}
+elif [[ "$VERSION" =~ ^[0-9] ]]; then
+    PIP install --no-cache-dir -U ${PKG}==${VERSION}
+else
+#    PIP install --no-cache-dir -e git+${REPO}@${VERSION}#egg=${PKG}
+    TARGET_DIR="${HERE}/lib/${PKG}"
+    rm -Rf ${TARGET_DIR}
+    git clone --depth 1 --single-branch --branch ${VERSION} --recurse-submodules ${REPO} ${TARGET_DIR}
+    PIP install -U -e ${TARGET_DIR}
+fi
+
+PY -c "import pkg_resources; print(pkg_resources.get_distribution('sapientml').version)" >> "${HERE}/.setup/installed"
\ No newline at end of file
diff --git a/resources/config.yaml b/resources/config.yaml
index 4d0c5a457..d9976b1f6 100644
--- a/resources/config.yaml
+++ b/resources/config.yaml
@@ -102,7 +102,7 @@ openml:                # configuration namespace for openML.
 
 versions:              # configuration namespace for versions enforcement (libraries versions are usually enforced in requirements.txt for the app and for each framework).
   pip:
-  python: 3.11          # the Python minor version that will be used by the application in containers and cloud instances, also used as a based version for virtual environments created for each framework.
+  python: 3.9          # the Python minor version that will be used by the application in containers and cloud instances, also used as a based version for virtual environments created for each framework.
 
 container: &container          # parent configuration namespace for container modes.
   force_branch: true           # set to true if image can only be built from a clean branch, with same tag as defined in `project_repository`.

From 0f3739b9913aac2b6d11d3e12438d58b5f46408b Mon Sep 17 00:00:00 2001
From: HimanshuRRai <himanshu.rai@fujitsu.com>
Date: Thu, 10 Oct 2024 15:45:08 +0000
Subject: [PATCH 3/3] Remove requirement.txt file from master branch

Signed-off-by: HimanshuRRai <himanshu.rai@fujitsu.com>
---
 frameworks/SapientML/requirements.txt | 1 -
 frameworks/SapientML/setup.sh         | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)
 delete mode 100644 frameworks/SapientML/requirements.txt

diff --git a/frameworks/SapientML/requirements.txt b/frameworks/SapientML/requirements.txt
deleted file mode 100644
index 663bd1f6a..000000000
--- a/frameworks/SapientML/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-requests
\ No newline at end of file
diff --git a/frameworks/SapientML/setup.sh b/frameworks/SapientML/setup.sh
index 47193fbf3..2ce34ed8e 100755
--- a/frameworks/SapientML/setup.sh
+++ b/frameworks/SapientML/setup.sh
@@ -10,7 +10,7 @@ fi
 #create local venv
 . ${HERE}/../shared/setup.sh ${HERE} true
 
-PIP install -r ${HERE}/requirements.txt
+# PIP install -r ${HERE}/requirements.txt
 if [[ "$VERSION" == "stable" ]]; then
     PIP install --no-cache-dir -U ${PKG}
 elif [[ "$VERSION" =~ ^[0-9] ]]; then