How to accelerate the load/releasing collection? #38173
Replies: 5 comments 5 replies
-
The error "node offline[node=11]" indicates a query node is down. A similar issue, full log might be helpful: #32492 1B vectors is a huge dataset. The raw data size could be more than 4TB if the dimension is 1024. If you have deployed a monitoring system, you can observe these metrics to view the loading progress: |
Beta Was this translation helpful? Give feedback.
-
Update: |
Beta Was this translation helpful? Give feedback.
-
there is a fix on milvus load cpu usage. |
Beta Was this translation helpful? Give feedback.
-
we would expect all data to be loaded under less than 10 minutes. Check the fix and give us some profiling result on your case |
Beta Was this translation helpful? Give feedback.
-
We are still working on it and this might be released with 2.4.18 and 2.5.x |
Beta Was this translation helpful? Give feedback.
-
Hi, I have a Milvus Cluster (2.4.17) with 100 collections, each collection has 10 partitions (using partition key).
Each collection has ~10m vectors, in total this cluster has ~1B vectors. (HNSW + MMAP + local nvme ssd disk)
For some reason, we rotated query nodes of the EKS cluster, so all query nodes got recreated. And then all ingestion failed with the following errors:
<MilvusException: (code=503, message=node offline[node=11]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_44_454283181056069954v0])
All searches failed with the following errors:
<MilvusException: (code=503, message=failed to search: segment lacks[segment=454283181884829810]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_21_454283181056068380v0])> <MilvusException: (code=503, message=failed to search: node offline[node=13]: channel not available[channel=milvus-1000p-10m-bases-hnsw-mmap-rootcoord-dml_53_454283181056070695v0])>
So I'm trying to release and load all 100 collections again. However, it takes ~20 minutes to load each collection.
So in order to release and load all 100 collections, it would take >30 hours.
The cluster has 10 query nodes, each query node is "i4i.4xlarge", which is 16 cores + 128GB Memory + 3.5TB nvme disk.
Per dashboard, each query node's received networking bytes is ~100mb/s when doing re-loading the collection.
Is there any way to make query nodes to load all collections faster?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions