xinai is a lightweight toolkit for AI research, designed to facilitate answering complex questions that require preprocessing large datasets and performing out-of-core training. It offers an end-to-end pipeline for data handling, model training, and interpretability analysis, with a focus on scalability and reproducibility.
- Scalable data preprocessing using Apache Spark
- Distributed model training with Horovod
- Integration with MLflow for experiment tracking and reproducibility
- Interpretability analysis tools using Captum
- Support for out-of-core training for large datasets and models
- Comprehensive testing suite for ensuring reliability
This toolkit is particularly suited for researchers and data scientists working on questions such as:
- "How does model drift impact attention and response variance in transformer architectures?"
- "What are the effects of different preprocessing techniques on model interpretability?"
- "How do attention patterns in large language models evolve during fine-tuning on domain-specific tasks?"
By providing a robust framework for handling large-scale data and models, this toolkit aims to accelerate research in AI interpretability and promote more transparent and understandable AI systems.
xinai/
├── MLproject
├── conda.yaml
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── data/
│ └── raw/
│ └── sample_data.csv
├── notebooks/
│ └── exploratory_analysis.ipynb
├── src/
│ ├── __init__.py
│ ├── data_preprocessing.py
│ ├── model_training.py
│ ├── model_evaluation.py
│ ├── interpretability_analysis.py
│ ├── setup_spark.py
│ ├── setup_horovod.py
│ └── utils.py
├── config/
│ └── model_config.yaml
├── tests/
│ ├── __init__.py
│ ├── test_data_preprocessing.py
│ ├── test_model_training.py
│ └── test_model_evaluation.py
├── models/
│ └── .gitkeep
├── scripts/
│ ├── start-spark-master.sh
│ ├── start-spark-worker.sh
│ └── stop-spark-cluster.sh
└── .gitignore
- Clone this repository:
git clone https://github.com/stoille/xinai.git cd xinai
-
Clone this repository:
git clone https://github.com/stoille/xinai.git cd xinai
-
Create and activate the conda environment:
conda env create -f conda.yaml conda activate xinai_env
-
Install the project in editable mode:
pip install -e .
-
Install Spark on all nodes. Download from the Apache Spark website.
-
Configure the Spark master:
./sbin/start-master.sh
-
Configure Spark workers and connect them to the master:
./sbin/start-slave.sh spark://master:7077
-
Set the following environment variables:
export SPARK_MASTER_HOST=<master-ip> export SPARK_MASTER_PORT=7077
-
Ensure OpenMPI or another MPI implementation is installed on all nodes.
-
Install Horovod with CUDA support:
HOROVOD_GPU_OPERATIONS=CUDA pip install horovod[pytorch]
-
Test Horovod installation:
horovodrun -np 4 python -c "import horovod.torch as hvd; hvd.init()"
-
Install MLflow:
pip install mlflow
-
Start the MLflow server:
mlflow server --host 0.0.0.0 --port 5000
-
Set the MLflow tracking URI:
export MLFLOW_TRACKING_URI=http://<mlflow-server-ip>:5000
To preprocess the data:
mlflow run . -e preprocess
To train the model:
mlflow run . -e train
To evaluate the model:
mlflow run . -e evaluate
To run the interpretability analysis:
mlflow run . -e analyze
To run the tests:
python -m unittest discover tests
You can find an exploratory Jupyter notebook in the notebooks/
directory. To run it:
-
Start Jupyter:
jupyter notebook
-
Navigate to
notebooks/exploratory_analysis.ipynb
and open it.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests.
This project is licensed under the BSD-3 License - see the LICENSE.md file for details.