Abstract. In this work, we construct an advanced recommender system for clothing retrieval by image-content queries. Where users give an image of clothing and ask for the modification by text, the system yields the answer by an image according to their request. Employing the Transformers-based image and text feature extractors. Learning the composition features by supervised Deep Metric Learning, and satisfying the rotational symmetry constraint on complex feature space, our ComposeTransformers retrieves 55.42% of relevant images on the total of 2,646 test images on the database when performing 1200 queries and taking top 50 search results.
Keywords: Vision Transformer, BERT, multi-modal search.
⭐ For detail of report, watch this article.
⭐ For slide of the seminar, watch here.
Annotations for modules of source code:
- requirements: includes necessary libraries.
- config: a configuration file used for both training and inference phase.
- Fashion200k: The folder containing all data and annotations, it is not available at the moment.
- dataloader: code for dataloader (including image and text pre-processing).
- img_text_composition_model: containing image-text composition module.
- logger: logger of the training phase.
- tester: for testing performance of the retrieval model.
- trainer: code for training phase with the ability to track loss and evaluation metrics during this progress.
- triplet_loss: soft triplet loss module.
- utils: utility functions.
- ComposeTransformers_Notebook: notebook for training & evaluation and inference.
- IMAGE_FTRS: including extracted feature and path for all images in the sub-dataset.
Ask me in issue if you look foward to the pre-trained model
Precision (%) | |||
---|---|---|---|
P@1 | P@10 | P@50 | P@100 |
0.25 | 5.5 | 31.6 | 58.0 |
Recall (%) | |||
---|---|---|---|
R@1 | R@10 | R@50 | R@100 |
4.9 | 22.6 | 55.4 | 75.7 |