How to accelerate the load/releasing collection? #38173

xiaobingxia-at · 2024-12-03T05:19:08Z

xiaobingxia-at
Dec 3, 2024

Hi, I have a Milvus Cluster (2.4.17) with 100 collections, each collection has 10 partitions (using partition key).
Each collection has ~10m vectors, in total this cluster has ~1B vectors. (HNSW + MMAP + local nvme ssd disk)
For some reason, we rotated query nodes of the EKS cluster, so all query nodes got recreated. And then all ingestion failed with the following errors:
<MilvusException: (code=503, message=node offline[node=11]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_44_454283181056069954v0])

All searches failed with the following errors:
<MilvusException: (code=503, message=failed to search: segment lacks[segment=454283181884829810]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_21_454283181056068380v0])> <MilvusException: (code=503, message=failed to search: node offline[node=13]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_53_454283181056070695v0])>

So I'm trying to release and load all 100 collections again. However, it takes ~20 minutes to load each collection.
So in order to release and load all 100 collections, it would take >30 hours.

The cluster has 10 query nodes, each query node is "i4i.4xlarge", which is 16 cores + 128GB Memory + 3.5TB nvme disk.
Per dashboard, each query node's received networking bytes is ~100mb/s when doing re-loading the collection.

Is there any way to make query nodes to load all collections faster?
Thanks!

yhmo · 2024-12-03T07:00:26Z

yhmo
Dec 3, 2024
Collaborator

The error "node offline[node=11]" indicates a query node is down. A similar issue, full log might be helpful: #32492

1B vectors is a huge dataset. The raw data size could be more than 4TB if the dimension is 1024. If you have deployed a monitoring system, you can observe these metrics to view the loading progress:

3 replies

xiaobingxia-at Dec 3, 2024
Author

Do you recommend to host 10B 1536-d vectors in one Milvus cluster?
(100 collections, each collection has 10 partitions. So each collection has 100m vectors)

yhmo Dec 3, 2024
Collaborator

10B 1536dim is a big challenge. Personally, I would recommend multiple clusters to host different collections(100m vectors per collection).

xiaobingxia-at Dec 3, 2024
Author

Thanks!

xiaobingxia-at · 2024-12-03T07:03:29Z

xiaobingxia-at
Dec 3, 2024
Author

Update:
After I recreate all query pods, the data fetching rates comes back to 500mb+/s, and everything gets back to normal.

0 replies

xiaofan-luan · 2024-12-03T11:01:54Z

xiaofan-luan
Dec 3, 2024
Maintainer

there is a fix on milvus load cpu usage.
#38157

0 replies

xiaofan-luan · 2024-12-03T11:07:50Z

xiaofan-luan
Dec 3, 2024
Maintainer

we would expect all data to be loaded under less than 10 minutes.
10B generally should not be a big problem, and ideally it should happen all node crash at the same time.
Check the bottlenect of each node loading, I would expect to see the load bandwidtch to be 500MB.

Check the fix and give us some profiling result on your case

2 replies

xiaobingxia-at Dec 3, 2024
Author

which version is that fix located?

xiaobingxia-at Dec 6, 2024
Author

@xiaofan-luan gentle ping, which version is that fix located?

xiaofan-luan · 2024-12-08T13:46:04Z

xiaofan-luan
Dec 8, 2024
Maintainer

We are still working on it and this might be released with 2.4.18 and 2.5.x

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to accelerate the load/releasing collection? #38173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to accelerate the load/releasing collection? #38173

xiaobingxia-at Dec 3, 2024

Replies: 5 comments · 5 replies

yhmo Dec 3, 2024 Collaborator

xiaobingxia-at Dec 3, 2024 Author

yhmo Dec 3, 2024 Collaborator

xiaobingxia-at Dec 3, 2024 Author

xiaobingxia-at Dec 3, 2024 Author

xiaofan-luan Dec 3, 2024 Maintainer

xiaofan-luan Dec 3, 2024 Maintainer

xiaobingxia-at Dec 3, 2024 Author

xiaobingxia-at Dec 6, 2024 Author

xiaofan-luan Dec 8, 2024 Maintainer

xiaobingxia-at
Dec 3, 2024

Replies: 5 comments 5 replies

yhmo
Dec 3, 2024
Collaborator

xiaobingxia-at Dec 3, 2024
Author

yhmo Dec 3, 2024
Collaborator

xiaobingxia-at Dec 3, 2024
Author

xiaobingxia-at
Dec 3, 2024
Author

xiaofan-luan
Dec 3, 2024
Maintainer

xiaofan-luan
Dec 3, 2024
Maintainer

xiaobingxia-at Dec 3, 2024
Author

xiaobingxia-at Dec 6, 2024
Author

xiaofan-luan
Dec 8, 2024
Maintainer