MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality

1. Introduction

We investigate a specific variant of multimodal search called "multimodal search of target modality". This problem involves enhancing a query in a specific target modality (such as video) by incorporating information from auxiliary modalities (such as text). The objective is to retrieve relevant objects whose content in the target modality aligns with the specified multimodal query. For example, we can search for images by using a reference image along with an auxiliary image and text. In our paper titled "MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality," we present an efficient and scalable framework for Multimodal Search of Target Modality, called MUST. The evaluation results demonstrate that MUST significantly improves search accuracy, achieving an average improvement of 93% compared to baseline methods. Furthermore, it outperforms the baseline methods in terms of speed, being more than 10 times faster. Additionally, MUST showcases its ability to scale to datasets with sizes exceeding 10 million records.

This repo contains the code, datasets, optimal parameters, and other relevant details utilized in the experiments.

2. Baseline

Multi-streamed retrieval (MR). MR is a traditional strategy for solving hybrid queries in IR and DB communities [VLDB'20, SIGMOD'21]. We adapt this framework to handle the MSTM problem and enhance it by incorporating advanced unimodal and multimodal encoders, such as CLIP [CVPR'22].
Joint embedding (JE). JE is a mainstream method for addressing multimodal fusion in CV community. We use three representative multimodal encoders: TIRG [CVPR'19], CLIP [CVPR'22], and MPC[CVPR'22].

3. MUST Overview

In MUST, we design three pluggable components: (1) Embedding; (2) Vector weight learning; (3) Indexing and searching.

4. Datasets

Dataset	# Modality	# Object	# Query	Type	Source
CelebA (link)	2	191,549	34,326	Image; Text	real-world
MIT-States (link)	2	53,743	72,732	Image; Text	real-world
Shopping*	2	96,009	47,658	Image; Text	real-world
MS-COCO(link)	3	19,711	1237	Imagex2; Text	real-world
CelebA+ (link)	4	191,549	34,326	Imagex3; Text	real-world
ImageText1M (link)	2	1,000,000	1,000	Image; Text	semi-synthetic
AudioText1M (link)	2	992,272	200	Audio; Text	semi-synthetic
VideoText1M (link)	2	1,000,000	10,000	Video; Text	semi-synthetic
ImageText16M (link)	2	16,000,000	10,000	Image; Text	semi-synthetic

*Please contact the author of the dataset to get access to the images.

5. Parameters

We obtain embedding vectors by utilizing the identical training hyperparameters as specified in the original papers of the encoders. The encoder configuration remains consistent across all three frameworks. Regarding the vector weight learning module, we set the default learning rate to 0.002 and conduct training for 700 iterations. For a comprehensive analysis of additional parameters and the output weights of the module on different datasets, please refer to the appendix.

6. Usage

(1) Prerequistes

PyTorch
Pybind
GCC 4.9+ with OpenMP
CMake 2.8+

(2) Run

(i) Embedding

Refer to TIRG, CLIP, and MPC.

We convert vectors of all objects and query inputs to fvecs format or ivecs format, and ground-truth data to ivecs format. For the description of fvecs and ivecs format, see here.

(ii) Vector weight learning

cd ./vector_weight_learning
python setup.py install
python main.py

(iii) Indexing and searching

cd ./scripts
./run release build_<framework> # index build
./run release search_<framework> # search

7. Demo

8. Acknowledgements

We used the implementation of our embedding from TIRG, CLIP, and MPC. We implemented our indexing components and search codes based on CGraph. We appreciate their inspiration and the references provided for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
appendix		appendix
embedding		embedding
figures		figures
indexing_and_search		indexing_and_search
vector_weight_learning		vector_weight_learning
.DS_Store		.DS_Store
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality

1. Introduction

2. Baseline

3. MUST Overview

4. Datasets

5. Parameters

6. Usage

(1) Prerequistes

(2) Run

7. Demo

8. Acknowledgements

About

Releases

Packages

Languages

License

ZJU-DAILY/MUST

Folders and files

Latest commit

History

Repository files navigation

MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality

1. Introduction

2. Baseline

3. MUST Overview

4. Datasets

5. Parameters

6. Usage

(1) Prerequistes

(2) Run

7. Demo

8. Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages