VisualSiteDiary

An image captioning model built upon a pretrained ViT model (mPLUG) that provides human-readable captions to decipher daily progress and work activities from construction photologs

Demo for daily construction report

This demo is built to generate daily construction report based on multiple images, which could be CCTV images with a fixed camera viewpoint or random images from an engineer.

Introduction

This paper presents VisualSiteDiary, a Vision Transformer-based image captioning model which creates human-readable captions for daily progress and work activity log, and enhances image retrieval tasks. As a model for deciphering construction photologs, VisualSiteDiary incorporates pseudo-region features, utilizes high-level knowledge in pretraining, and fine-tunes for diverse captioning styles. To validate VisualSiteDiary, a new image captioning dataset, VSD, is presented. This dataset includes many realistic yet challenging cases commonly observed in commercial building projects. Experimental results using five different metrics demonstrate that VisualSiteDiary provides superior-quality captions compared to the state-of-the-art image captioning models.

Pre-trained models and datasets

Pre-trained models
- We provide trained VisualSiteDiary checkpoints for convenience.

Model	Visual Backbone	Image Enc Layers	Text Dec Layers	Download
visualsitediary.total	vit-b-16	12	12	visualsitediary.total
visualsitediary.compact	vit-b-16	12	12	visualsitediary.compact
visualsitediary.detailed	vit-b-16	12	12	visualsitediary.detailed

VSD Datasets (Need to make a request for the images through the authors of each paper)

	ACID	ACTV	SAFE	SODA
image	4,000	964	1,762	1,089
text	8,000	1,928	3,524	2,178

We share our captions on each image dataset in the `construction_dataset' folder.

Result

	B@4	METEOR	ROUGE	CIDEr	SPICE	Avg.
VisualSiteDiary	58.0	39.9	70.6	333.7	53.0	111.0
mPLUG	57.2	39.6	70.4	331.5	52.8	110.3
GiT	44.8	32.8	60.1	237.2	43.4	83.2
CLIP	44.7	32.9	60.3	235.9	42.9	83.3

Requirements

PyTorch version >= 1.11.0
Install other libraries via

pip install -r requirements.txt

Fine-tuning

If you want to train your own model you can follow the instructions below.

Download the Construction image datasets from the original paper.
Modify configs/VSD_all.yaml so the directories of the images and json files are correct.
Download and upload ViT-B-16.tar to the main root (./)
Download and upload mPLUG_base.pth to the main root (./)
Download language evalution tool (language_evalution).
Follow the steps 1~4 below.

Step 1. First, we begin with the HP process. Run the following script to finetune an mPLUG model:

python -c "import language_evaluation; language_evaluation.download('coco')"
python run_VSD.py \
--config ./configs/VSD_all.yaml \
--checkpoint ./mPLUG_base.pth \
--do_two_optim \
--lr 1e-5 \
--min_length 8 \
--max_length 25 \
--max_input_length 25 \
--eval_start_epoch 0 \
--save_for_HP True \
--use_PR False

The outputs will be saved as f'./output/result/train_loader_epoch_{i}.json'

Step 2. Create the HP labels based on the model output from step 1 by running the following script.

python -c "import language_evaluation; language_evaluation.download('coco')"
python create_HP_labels.py \
--source_prediction_train f'./output/result/train_loader_epoch_{i}.json' \
--source_prediction_val f'./output/result/val_loader_epoch_{i}.json'  \
--save_dir './construction_dataset/'

The HP labels will be saved at './construction_dataset/'.

Step 3. Run HP by running the following script.

python -c "import language_evaluation; language_evaluation.download('coco')"
python run_VSD_HP.py \
--config ./configs/VSD_HP.yaml \
--checkpoint ./mPLUG_base.pth \
--do_two_optim \
--lr 1e-5 \
--min_length 8 \
--max_length 25 \
--max_input_length 25 \
--eval_start_epoch 0 \

The HP-pretrained model will be saved at './HP_pretrained.pth'

Step 4. Finetune the HP-pretrained model with PR on the VSD image captioning dataset by running the following script:

python -c "import language_evaluation; language_evaluation.download('coco')"
python run_VSD.py \
--config ./configs/VSD_all.yaml \
--checkpoint ./HP_pretrained.pth \
--do_two_optim \
--lr 1e-5 \
--min_length 8 \
--max_length 25 \
--max_input_length 25 \
--eval_start_epoch 0 \
--save_for_HP False \
--use_PR True

Model Inference

Colab instruction

Please refer the above link for further detail. You should be able to follow the cells to download any requried dependencies and upload your own image to test our model checkpoints for different prediction styles.

Citation

If you use our work, please cite:

@article{jung2024,
  title={VisualSiteDiary: A Detector-Free Vision Transformer Model for Captioning Photologs for Daily Construction Reporting},
  author={Jung, Yoonhwa and Cho, Ikhyun and Hsu, Shun-Hsiang and Golparvar-Fard, Mani},
  journal = {Automation in Construction},
  volume = {165},
  pages = {105483},
  year = {2024},
  issn = {0926-5805},
  doi = {https://doi.org/10.1016/j.autcon.2024.105483}
}

Acknowledgement

The implementation of VisualSiteDiary relies on resources from mPLUG, and S2-Transformer. We thank the original authors for their open-sourcing.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
configs		configs
construction_dataset		construction_dataset
dataset		dataset
demo		demo
language_evaluation_2		language_evaluation_2
media		media
models		models
optim		optim
scheduler		scheduler
scripts		scripts
vgTools/utils		vgTools/utils
vqaTools		vqaTools
.gitignore		.gitignore
Create_HP_labels.py		Create_HP_labels.py
README.md		README.md
caption_mplug.py		caption_mplug.py
demo.py		demo.py
pipeline.png		pipeline.png
requirements.txt		requirements.txt
run_VSD.py		run_VSD.py
run_VSD_HP.py		run_VSD_HP.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisualSiteDiary

Demo for daily construction report

Introduction

Pre-trained models and datasets

Result

Requirements

Fine-tuning

Model Inference

Colab instruction

Citation

Acknowledgement

About

Releases

Packages

Contributors 3

Languages

joonv2/VisualSiteDiary

Folders and files

Latest commit

History

Repository files navigation

VisualSiteDiary

Demo for daily construction report

Introduction

Pre-trained models and datasets

Result

Requirements

Fine-tuning

Model Inference

Colab instruction

Citation

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages