-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from owczr/develop
Updates and small improvements
- Loading branch information
Showing
47 changed files
with
2,251 additions
and
238 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,4 +6,5 @@ notebooks/ | |
docs/ | ||
.pytest_cache/ | ||
.github/ | ||
|
||
logs/ | ||
*.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,4 +4,5 @@ test/ | |
LIDC-IDRI/ | ||
.vscode/ | ||
__pycache__/ | ||
.env | ||
.env | ||
logs/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,50 @@ | ||
# Lung Cancer Detection | ||
|
||
## Table of Contents | ||
- [About](#about) | ||
- [Usage](#usage) | ||
- [License](#license) | ||
## Table Of Contents | ||
1. [About](#about) | ||
2. [Project Structure](#project-structure) | ||
3. [Usage](#usage) | ||
4. [License](#license) | ||
|
||
## About | ||
Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intellingence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH. | ||
Lung Cancer Detection is a project made as part of Engineers Thesis *"Applications of artificial intelligence in oncology on computer tomography dataset"* by **Jakub Owczarek**, under the guidance of Thesis Advisor dr. hab. inz **Mariusz Mlynarczuk** prof. AGH. | ||
<br> | ||
|
||
The goal of this projet is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and measure the performence of deep learning models pre-trained on Image Net by using transfer learning methods. | ||
The goal of this project is to process the [LIDC-IDRI](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254) dataset and evaluate the performance of deep learning models pre-trained on Image Net by leveraging transfer learning. | ||
|
||
## Usage | ||
## Project Structure | ||
This repository contains the following directories: | ||
|
||
TODO: Fill in how to use this project locally and on Azure ML | ||
- *docs* - contains markdown files with more specific descriptions of the project components | ||
- *notebooks* - contains Jupyter Notebooks that were used for experiments, analysis, visualizations, etc | ||
- *scripts* - this directory is the actual workhorse and contains two notable subdirectories: | ||
|
||
- *azure* - contains scripts for Azure Virtual Machine and Azure Machine Learning | ||
- *local* - contains scripts that were used for local development | ||
|
||
- *src* - contains main components of the project: | ||
|
||
- *azure* - contains utilities specific to Azure services | ||
- *dataset* - contains `DatasetLoader` component used to feed data during model training | ||
- *model* - contains model builder and director classes | ||
- *preprocessing* - contains classes used for LIDC-IDRI dataset preprocessing | ||
- *config.py* - some constants used throughout the project | ||
|
||
- *tests* - contains (few) tests for the project components | ||
|
||
## Usage | ||
This project was created with Azure in mind and therefore the main scripts are meant for usage on Azure. | ||
|
||
![usage_img](docs/assets/usage.png) | ||
|
||
### 1. Preprocessing | ||
1. First step is to download the LIDC-IDRI dataset on Azure Virtual Machine. The `azure/virtual_machine/download_dataset.sh` script is meant for this task. | ||
2. Then, it's time to preprocess this dataset to a format suitable for supervised deep learning model training. The `azure/virtual_machine/process_dataset.py` script is meant for this task. Additionally, in the same directory is `train_test_split.py`, which should be used to split processed data. | ||
3. Finally, the preprocessed dataset can be uploaded with the `upload_dataset_2.sh` script to Azure Blob Storage. There is also `upload_dataset.sh` script, but it doesn't use the `azcopy` utility and is too slow. | ||
|
||
### 2. Model training | ||
1. With preprocessed dataset on Azure Blob Storage, the Virtual Machine will be no longer necessary. From this dataset an Azure Machine Learning data asset can be created, which can be utilized during model training. | ||
2. Now to run the actual model training under `scripts/azure/machine_learing` is the `run_training_job.py` script. This script can be used to create a job on AML, to build, compile and train desired model. | ||
|
||
## License | ||
This project is licensed under the MIT License - see the LICENSE.md file for details |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
import os | ||
import logging | ||
from datetime import datetime | ||
|
||
import click | ||
import mlflow | ||
import numpy as np | ||
import tensorflow as tf | ||
from azure.ai.ml.entities import Model | ||
from azure.ai.ml.constants import AssetTypes | ||
|
||
from src.model.director import ModelDirector | ||
from src.dataset.dataset_loader import DatasetLoader | ||
from src.config import ( | ||
RANDOM_SEED, | ||
EARLY_STOPPING_CONFIG, | ||
REDUCE_LR_CONFIG, | ||
MODELS, | ||
BUILDERS, | ||
CALLBACKS, | ||
METRICS, | ||
config_logging | ||
) | ||
|
||
config_logging() | ||
logging.basicConfig(level=logging.INFO) | ||
logger = logging.getLogger("azure") | ||
|
||
|
||
def get_compiled_model(model, optimizer, loss): | ||
builder = BUILDERS[model]() | ||
|
||
director = ModelDirector(builder) | ||
model_nn = director.make() | ||
logger.info(f"Built model_nn with {str(builder)}") | ||
|
||
optimizer_cls = { | ||
"adam": tf.keras.optimizers.Adam, | ||
"sgd": tf.keras.optimizers.SGD, | ||
}[optimizer]() | ||
|
||
loss_cls = { | ||
"binary_crossentropy": tf.keras.losses.BinaryCrossentropy, | ||
"categorical_crossentropy": tf.keras.losses.CategoricalCrossentropy, | ||
}[loss]() | ||
|
||
metrics = [metric() for metric in METRICS] | ||
|
||
model_nn.compile(optimizer=optimizer_cls, loss=loss_cls, metrics=metrics, run_eagerly=False) | ||
logger.info("Compiled model") | ||
|
||
return model_nn | ||
|
||
|
||
def get_compiled_distributed_model(model, optimizer, loss): | ||
strategy = tf.distribute.MultiWorkerMirroredStrategy() | ||
|
||
with strategy.scope(): | ||
model_nn = get_compiled_model(model, optimizer, loss) | ||
|
||
return model_nn | ||
|
||
@click.command() | ||
@click.option( | ||
"--model", type=click.Choice(MODELS), default="mobilenet", help="Model to train" | ||
) | ||
@click.option( | ||
"--train", type=click.Path(exists=True), help="Path to the training dataset" | ||
) | ||
@click.option("--test", type=click.Path(exists=True), help="Path to the test dataset") | ||
@click.option( | ||
"--optimizer", | ||
type=click.Choice(["adam", "sgd"]), | ||
default="adam", | ||
help="Optimizer to use", | ||
) | ||
@click.option( | ||
"--loss", | ||
type=click.Choice(["binary_crossentropy", "categorical_crossentropy"]), | ||
default="binary_crossentropy", | ||
help="Loss function to use", | ||
) | ||
@click.option("--epochs", type=click.INT, default=10, help="Number of epochs to train for") | ||
@click.option("--batch_size", type=click.INT, default=64, help="Batch size for dataset loaders") | ||
@click.option("--job_name", type=click.STRING, help="Azure Machine Learning job name") | ||
@click.option("--distributed", is_flag=True, help="Use distributed startegy") | ||
def run(model, train, test, optimizer, loss, epochs, batch_size, job_name, distributed): | ||
mlflow.set_experiment("lung-cancer-detection") | ||
mlflow_run = mlflow.start_run(run_name=f"train_{model}_{datetime.now().strftime('%Y%m%d%H%M%S')}") | ||
|
||
mlflow.log_param("optimizer", optimizer) | ||
mlflow.log_param("loss", loss) | ||
mlflow.log_param("epochs", epochs) | ||
mlflow.log_param("batch_size", batch_size) | ||
mlflow.log_param("random_seed", RANDOM_SEED) | ||
|
||
logger.info(f"Started training run at {datetime.now()}") | ||
logger.info( | ||
f"Run parameters - optimizer: {optimizer}, loss: {loss}" | ||
) | ||
|
||
if not distributed: | ||
model_nn = get_compiled_model(model, optimizer, loss) | ||
else: | ||
model_nn = get_compiled_distributed_model(model, optimizer, loss) | ||
|
||
train_loader = DatasetLoader(train) | ||
test_loader = DatasetLoader(test) | ||
|
||
train_loader.set_seed(RANDOM_SEED) | ||
test_loader.set_seed(RANDOM_SEED) | ||
|
||
train_dataset = train_loader.get_dataset() | ||
test_dataset = test_loader.get_dataset() | ||
logger.info("Loaded train and test datasets") | ||
|
||
history = model_nn.fit(train_dataset, epochs=epochs, callbacks=CALLBACKS) | ||
logger.info("Trained model") | ||
|
||
for metric, values in history.history.items(): | ||
for step, value in enumerate(values): | ||
mlflow.log_metric(f"{metric}", value, step=step) | ||
|
||
results = model_nn.evaluate(test_dataset, return_dict=True) | ||
logger.info("Evaluated model") | ||
|
||
for metric, value in results.items(): | ||
mlflow.log_metric(f"Final {metric}", value) | ||
|
||
logger.info(f"Finished training at {datetime.now()}") | ||
|
||
try: | ||
mlflow.tensorflow.save_model( | ||
model=model_nn, | ||
path=os.path.join(job_name, model), | ||
) | ||
except TypeError as e: | ||
logger.error(f"Saving model raised an error:\n{e}") | ||
|
||
mlflow.tensorflow.log_model( | ||
model=model_nn, | ||
registered_model_name=model, | ||
artifact_path=model, | ||
) | ||
|
||
mlflow.end_run() | ||
|
||
|
||
if __name__ == "__main__": | ||
run() # pylint: disable=no-value-for-parameter |
Oops, something went wrong.