Skip to content

Releases: huggingface/optimum-tpu

v0.2.0

20 Nov 13:06
1fc59ce
Compare
Choose a tag to compare

This is the first release of Optimum TPU that includes support for Jetstream Pytorch engine as backend for Test Generation Inference (TGI).
JetStream is a throughput and memory optimized engine for LLM inference on TPUs, and its Pytorch implementation allows for a seamless integration in the TGI code. The supported models (for now Llama 2 and Llama 3, Gemma 1 and Mixtral, and serving inference on these models resulted has given results close to 10x in terms of tokens/sec compared to the previously used backend (Pytorch XLA/transformers).
On top of that, it is possible to use quantization to serve using even less resources while maintaining a similar throughput and quality.
Details follow.

What's Changed

New Contributors

Full Changelog: v0.1.5...v0.2.0

v0.1.5

08 Aug 14:56
426d7be
Compare
Choose a tag to compare

This release is essentially the same as the previous one (v0.1.4), but it allows correct PyPI package publication.

v0.1.4

23 Jul 16:20
7f5b0cc
Compare
Choose a tag to compare

These changes focus on improving support for instruct models and solve an issue appearing when using those models through the web ui interface with invalid settings.

What's Changed

Full Changelog: v0.1.3...v0.1.4

v0.1.3

09 Jul 10:31
e09a66b
Compare
Choose a tag to compare

Cleanup of previous fixed and lower batch size to prevent memory issues on Inference Endpoints with some models.

What's Changed

Full Changelog: v0.1.2...v0.1.3

v0.1.2

08 Jul 08:31
fd29591
Compare
Choose a tag to compare

What's Changed

This Release contains only few small fixes, mainly for Inference Endpoints.

Full Changelog: v0.1.1...v0.1.2

v0.1.1

25 Jun 14:20
7050cf4
Compare
Choose a tag to compare

TPU first release, allowing to have TPU Text Generation Inference and Inference Endpoints container images available.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/optimum-tpu/commits/v0.1.1