We investigate a specific variant of multimodal search called "multimodal search of target modality". This problem involves enhancing a query in a specific target modality (such as video) by incorporating information from auxiliary modalities (such as text). The objective is to retrieve relevant objects whose content in the target modality aligns with the specified multimodal query. For example, we can search for images by using a reference image along with an auxiliary image and text. In our paper titled "MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality," we present an efficient and scalable framework for Multimodal Search of Target Modality, called MUST. The evaluation results demonstrate that MUST significantly improves search accuracy, achieving an average improvement of 93% compared to baseline methods. Furthermore, it outperforms the baseline methods in terms of speed, being more than 10 times faster. Additionally, MUST showcases its ability to scale to datasets with sizes exceeding 10 million records.
This repo contains the code, datasets, optimal parameters, and other relevant details utilized in the experiments.
-
Multi-streamed retrieval (MR). MR is a traditional strategy for solving hybrid queries in IR and DB communities [VLDB'20, SIGMOD'21]. We adapt this framework to handle the MSTM problem and enhance it by incorporating advanced unimodal and multimodal encoders, such as CLIP [CVPR'22].
-
Joint embedding (JE). JE is a mainstream method for addressing multimodal fusion in CV community. We use three representative multimodal encoders: TIRG [CVPR'19], CLIP [CVPR'22], and MPC[CVPR'22].
In MUST, we design three pluggable components: (1) Embedding; (2) Vector weight learning; (3) Indexing and searching.
Dataset | # Modality | # Object | # Query | Type | Source |
---|---|---|---|---|---|
CelebA (link) | 2 | 191,549 | 34,326 | Image; Text | real-world |
MIT-States (link) | 2 | 53,743 | 72,732 | Image; Text | real-world |
Shopping* | 2 | 96,009 | 47,658 | Image; Text | real-world |
MS-COCO(link) | 3 | 19,711 | 1237 | Imagex2; Text | real-world |
CelebA+ (link) | 4 | 191,549 | 34,326 | Imagex3; Text | real-world |
ImageText1M (link) | 2 | 1,000,000 | 1,000 | Image; Text | semi-synthetic |
AudioText1M (link) | 2 | 992,272 | 200 | Audio; Text | semi-synthetic |
VideoText1M (link) | 2 | 1,000,000 | 10,000 | Video; Text | semi-synthetic |
ImageText16M (link) | 2 | 16,000,000 | 10,000 | Image; Text | semi-synthetic |
*Please contact the author of the dataset to get access to the images.
We obtain embedding vectors by utilizing the identical training hyperparameters as specified in the original papers of the encoders. The encoder configuration remains consistent across all three frameworks. Regarding the vector weight learning module, we set the default learning rate to 0.002 and conduct training for 700 iterations. For a comprehensive analysis of additional parameters and the output weights of the module on different datasets, please refer to the appendix.
PyTorch
Pybind
GCC 4.9+ with OpenMP
CMake 2.8+
(i) Embedding
We convert vectors of all objects and query inputs to fvecs
format or ivecs
format, and ground-truth data to ivecs
format. For the description of fvecs
and ivecs
format, see here.
(ii) Vector weight learning
cd ./vector_weight_learning
python setup.py install
python main.py
(iii) Indexing and searching
cd ./scripts
./run release build_<framework> # index build
./run release search_<framework> # search
We used the implementation of our embedding from TIRG, CLIP, and MPC. We implemented our indexing components and search codes based on CGraph. We appreciate their inspiration and the references provided for this project.