Decoupled mode, dimensionality explosion #7378

SeibertronSS · 2024-06-26T16:40:06Z

I implemented a continuous batching backend in C++, which supports streaming back the results of LLM. However, sometimes when the results of LLM are returned to the postprocessing for decoding, the dimension of the token will be multiplied by 2 after each step, and finally I will get an output_id with tens of millions of dimensions. I don't know if it's because Triton caches the answer or my buffer is not cleared in time. The answer output to postprocessing is like [8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 99808, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 8908, 8908, 234, 8908, 8908, 234, 114, 8908, 8908, 234, 8908, 8908, 234, 114, 103081, 99662, 99808, 99219, 9909]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupled mode, dimensionality explosion #7378

Decoupled mode, dimensionality explosion #7378

SeibertronSS commented Jun 26, 2024

Decoupled mode, dimensionality explosion #7378

Decoupled mode, dimensionality explosion #7378

Comments

SeibertronSS commented Jun 26, 2024