Kaggle-PANDA-1st-place-solution

This is the 1st place solution of the PANDA Competition, where the specific writeup is here.

The codes and models are created by Team PND, @yukkyo and @kentaroy47.

Our model and codes are open sourced under CC-BY-NC 4.0. Please see LICENSE for specifics.

You can skip some steps (because some outputs are already in input dir).

Used in

Nature Medicine: W.Bulten, Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge

npj Precision Oncology: Y.Tolkach, An international multi-institutional validation study of the algorithm for prostate cancer detection and Gleason grading

Cancers: Label distribution learning for automatic cancer grading of histopathological images of prostate cancer

Slide describing our solution!

https://docs.google.com/presentation/d/1Ies4vnyVtW5U3XNDr_fom43ZJDIodu1SV6DSK8di6fs/

1. Environment

You can choose using docker or not.

1.1 Don't use docker (haven't tested..)

Ubuntu 18.04
Python 3.7.2
CUDA 10.2
NVIDIA/apex == 1.0 installed

# main dependency
$ pip install -r docker/requirements.txt
# arutema code dependency
$ pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git
$ pip install efficientnet_pytorch

1.2 Use docker (Recommended)

# build
$ sh docker/build.sh

# run
$ sh docker/run.sh

# exec
$ sh docker/exec.sh

2. Preparing

2.1 Get Data

Download only train_images and train_masks.

$ cd input
$ kaggle download ...
$ unzip ...

(skip) 2.2 Grouping imageids by image hash threshold

If you want to do it on your own: https://www.kaggle.com/yukkyo/imagehash-to-detect-duplicate-images-and-grouping
We will just place the output of the script as:
- input/duplicate_imgids_imghash_thres_090.csv

(skip) 2.3 Split kfold

$ cd src
$ python data_process/s00_make_k_fold.py

Is constant with fixed seed
output:
- input/train-5kfold.csv

2.4 Make tile pngs for training

$ cd src
$ python data_process/s07_simple_tile.py --mode 0
$ python data_process/s07_simple_tile.py --mode 2
$ python data_process/a00_save_tiles.py
$ cd ../input
$ cd numtile-64-tilesize-192-res-1-mode-0
$ unzip train.zip -d train
$ cd ..
$ cd numtile-64-tilesize-192-res-1-mode-2
$ unzip train.zip -d train
$ cd ..

3. Train base model for removing noise(expected TitanRTX x 1)

Each fold needs about 18 hours.

$ cd src
$ python train.py --config configs/final_1.yaml --kfold 1
$ python train.py --config configs/final_1.yaml --kfold 2
$ python train.py --config configs/final_1.yaml --kfold 3
$ python train.py --config configs/final_1.yaml --kfold 4
$ python train.py --config configs/final_1.yaml --kfold 5

output:
- output/model/final_1
  - Each weights and train logs

4. Predict to local validation for removing noise

Each fold needs about 1 hour.

$ cd src
$ python kernel.py --kfold 1
$ python kernel.py --kfold 2
$ python kernel.py --kfold 3
$ python kernel.py --kfold 4
$ python kernel.py --kfold 5

outputs are prediction results of the hold-out train data:
- output/model/final_1/local_preds~~~.csv

5. Remove noise

$ cd src
$ python data_process/s12_remove_noise_by_local_preds.py

output:
- output/model/final_1
  - local_preds_final_1_efficientnet-b1.csv
    - Concatenated prediction results of the hold-out data
    - This is used to clean labels
  - local_preds_final_1_efficientnet-b1_removed_noise_thresh_16.csv
    - Used to train Model 1
    - Base label cleaning results
  - local_preds_final_1_efficientnet-b1_removed_noise_thresh_rad_13_08_ka_15_10.csv
    - Used to train Model 2
    - Label cleaned to remove 20% Radboud labels
FYI: we used this csv at final sub on competition: (did not fix seed at time)
- input/train-5kfold_remove_noisy_by_0622_rad_13_08_ka_15_10.csv

6. Re-train 5-fold models with noise removed

You can replace output/train-5kfold_remove_noisy.csv to input/train-5kfold_remove_noisy_by_0622_rad_13_08_ka_15_10.csv in config
Only 1,4,5 folds are used for final inference
Each fold needs about 15 hours.

Training model 2(fam_taro model):

$ cd src
# only best LB folds are trained
$ python train.py --config configs/final_2.yaml --kfold 1
$ python train.py --config configs/final_2.yaml --kfold 4
$ python train.py --config configs/final_2.yaml --kfold 5

Training model 1(arutema model):

Please run train_famdata-kfolds.ipynb on jupyter notebook or

# go to home
$ python train_famdata-kfolds.py

I haven't tested .py, so please try .ipynb for operation.

The final models are saved to models.

Each fold will take 4 hours.

Trained models

Models reproducing 1st place score is saved in ./final_models

7. Submit on Kaggle Notebook

kernels
- final sub on competition:
  - score: public 0.904, private 0.940 (1st)
  - url: https://www.kaggle.com/yukkyo/latesub-pote-fam-aru-ensemble-0722-ew-1-0-0?scriptVersionId=39271011
- reproduced results (seed fixed as in this scripts, you can reproduce)
  - score: public 0.894, private 0.939 (1st)
  - url: https://www.kaggle.com/kyoshioka47/late-famrepro-fam-reproaru-ensemble-0725?scriptVersionId=39879219
  - submitted_notebook.ipynb
- Simple 5-fold model to get private 0.935(3rd)
  - url: https://www.kaggle.com/kyoshioka47/5-fold-effb0-with-cleaned-labels-pb-0-935
You can change paths by changing bellow.
- You must change Kaggle Dataset path for using your reproduced weights

### Model 2
# Line [7]
class Config:
    def __init__(self, on_kernel=True, kfold=1, debug=False):
        ...
        ...
        ...

        # You can change weight name. But not need on this README
        self.weight_name = "final_2_efficientnet-b1_kfold_{}_latest.pt"
        self.weight_name = self.weight_name.format(kfold)

        ...
        ...
        ...

    def get_weight_path(self):
        if self.on_kernel:
            # You should change this path to your Kaggle Dataset path
            return os.path.join("../input/030-weight", self.weight_name)
        else:
            dir_name = self.weight_name.split("_")[0]
            return os.path.join("../output/model", dir_name, self.weight_name)
       
### Model 1
# Line [13]
def load_models(model_files):
    models = []
    for model_f in model_files:
        ## You should change this path to your Kaggle Dataset path
        model_f = os.path.join("../input/latesubspanda", model_f)
        ...

model_files = [
    'efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold0.pth',
]

model_files2 = [
    'efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold0.pth',
    "efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold1.pth",
    "efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold2.pth",
    "efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold3.pth",
    "efficientnet-b0famlabelsmodelsub_avgpool_tile36_imsize256_mixup_final_epoch20_fold4.pth"
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle-PANDA-1st-place-solution

Used in

Slide describing our solution!

1. Environment

1.1 Don't use docker (haven't tested..)

1.2 Use docker (Recommended)

2. Preparing

2.1 Get Data

(skip) 2.2 Grouping imageids by image hash threshold

(skip) 2.3 Split kfold

2.4 Make tile pngs for training

3. Train base model for removing noise(expected TitanRTX x 1)

4. Predict to local validation for removing noise

5. Remove noise

6. Re-train 5-fold models with noise removed

Training model 2(fam_taro model):

Training model 1(arutema model):

Trained models

7. Submit on Kaggle Notebook

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docker		docker
final_models		final_models
input		input
output/model		output/model
src		src
ttach		ttach
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
submitted_notebook.ipynb		submitted_notebook.ipynb
team_PND_documentation.pdf		team_PND_documentation.pdf
train_famdata-kfolds.ipynb		train_famdata-kfolds.ipynb
train_famdata-kfolds.py		train_famdata-kfolds.py

License

kentaroy47/Kaggle-PANDA-1st-place-solution

Folders and files

Latest commit

History

Repository files navigation

Kaggle-PANDA-1st-place-solution

Used in

Slide describing our solution!

1. Environment

1.1 Don't use docker (haven't tested..)

1.2 Use docker (Recommended)

2. Preparing

2.1 Get Data

(skip) 2.2 Grouping imageids by image hash threshold

(skip) 2.3 Split kfold

2.4 Make tile pngs for training

3. Train base model for removing noise(expected TitanRTX x 1)

4. Predict to local validation for removing noise

5. Remove noise

6. Re-train 5-fold models with noise removed

Training model 2(fam_taro model):

Training model 1(arutema model):

Trained models

7. Submit on Kaggle Notebook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages