Releases: huggingface/optimum-tpu
v0.2.0
This is the first release of Optimum TPU that includes support for Jetstream Pytorch engine as backend for Test Generation Inference (TGI).
JetStream is a throughput and memory optimized engine for LLM inference on TPUs, and its Pytorch implementation allows for a seamless integration in the TGI code. The supported models (for now Llama 2 and Llama 3, Gemma 1 and Mixtral, and serving inference on these models resulted has given results close to 10x in terms of tokens/sec compared to the previously used backend (Pytorch XLA/transformers).
On top of that, it is possible to use quantization to serve using even less resources while maintaining a similar throughput and quality.
Details follow.
What's Changed
- Update colab examples by @wenxindongwork in #86
- ci(docker): update torch-xla to 2.4.0 by @tengomucho in #89
✈️ Introduce Jetstream/Pytorch in TGI by @tengomucho in #88- 🦙 Llama3 on TGI - Jetstream Pytorch by @tengomucho in #90
- ☝️ Update Jetstream Pytorch revision by @tengomucho in #91
- Correct extra token, start preparing docker image for TGI/Jetstream Pt by @tengomucho in #93
- Fix generation using Jetstream Pytorch by @tengomucho in #94
- Fix slow tests by @tengomucho in #95
- 🧹 Cleanup and fixes for TGI by @tengomucho in #96
- Small TGI enhancements by @tengomucho in #97
- fix(TGI Jetstream Pt): prefill should be done with max input size by @tengomucho in #98
- 💎 Gemma on TGI Jetstream Pytorch by @tengomucho in #99
- Fix ci nightly jetstream by @tengomucho in #101
- CI ephemeral TPUs by @tengomucho in #102
- 🍃 Added Mixtral on TGI / Jetstream Pytorch by @tengomucho in #103
- Add CLI to install dependencies by @tengomucho in #104
- ⛰ CI: mount hub cache and fix issues with cli by @tengomucho in #106
- fix(docker): correct jetstream installation in TGI docker image by @tengomucho in #107
- ✏️ docs: Add training guide and improve documentation consistency by @baptistecolle in #110
- Quantization Jetstream Pytorch by @tengomucho in #111
- fix: graceful shutdown was not working with entrypoint, exec launcher by @co42 in #112
- fix(doc): correct link to deploy page by @tengomucho in #115
- More Jetstream Pytorch fixes, prepare for release by @tengomucho in #116
New Contributors
- @wenxindongwork made their first contribution in #86
- @baptistecolle made their first contribution in #110
- @co42 made their first contribution in #112
Full Changelog: v0.1.5...v0.2.0
v0.1.5
This release is essentially the same as the previous one (v0.1.4), but it allows correct PyPI package publication.
v0.1.4
These changes focus on improving support for instruct models and solve an issue appearing when using those models through the web ui interface with invalid settings.
What's Changed
- Fix secret leak workflow by @tengomucho in #72
- Handle selector exception by @tengomucho in #73
- chore(tgi): update TGI base image by @tengomucho in #75
- Fix instruct models UI issue by @tengomucho in #78
Full Changelog: v0.1.3...v0.1.4
v0.1.3
Cleanup of previous fixed and lower batch size to prevent memory issues on Inference Endpoints with some models.
What's Changed
- Few more Inference Endpoints fixes by @tengomucho in #69
- feat(cache): use optimized StaticCache class for XLA by @tengomucho in #70
- Lower TGI IE batch size by @tengomucho in #71
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
This Release contains only few small fixes, mainly for Inference Endpoints.
- Several Inference Endpoint fixes by @tengomucho in #66
- More Inference Endpoints features and fixes by @tengomucho in #68
Full Changelog: v0.1.1...v0.1.2
v0.1.1
TPU first release, allowing to have TPU Text Generation Inference and Inference Endpoints container images available.
What's Changed
- Basic TGI server on XLA by @tengomucho in #1
- Enable CI/CD by @tengomucho in #2
- Fix TGI Dockerfile by @shub-kris in #3
- Add static KV cache and test on Gemma-2B by @tengomucho in #4
- Small optimizations by @tengomucho in #5
- Enable compilation by @tengomucho in #6
- Revert "fix: attention mask should be 1 or 0" by @tengomucho in #8
- feat: use dynamic batching when generating by @tengomucho in #9
- Repo layout by @tengomucho in #10
- Add PyPI release workflow by @regisss in #11
- Xla parallel proxy by @tengomucho in #12
- Add documentation to the repository by @mfuntowicz in #13
- Adopt naming convention of transformers API by @mfuntowicz in #14
- Fix main doc build workflow by @regisss in #15
- Improve readme by @mfuntowicz in #16
- Fix layout in README by @mfuntowicz in #17
- Fix rule and instructions for TGI by @mfuntowicz in #18
- Fix typo in index.mdx by @mfuntowicz in #19
- Added some links to Cloud TPU documentation by @mikegre-google in #20
- Parallel sharding by @tengomucho in #21
- Bump version to 0.1.0.dev1 by @mfuntowicz in #24
- Bump version to 0.1.0.dev2 by @mfuntowicz in #25
- Fix TGI missing import by @mfuntowicz in #27
- Forward arguments from TGI launcher to the model by @mfuntowicz in #28
- Fix optimum-tpu pip install instructions by @mfuntowicz in #29
- Fix tests with do_sample=True by @tengomucho in #30
- Sharding in tgi by @tengomucho in #31
- Fix missing '=' to assign environment variables in the default case w… by @mfuntowicz in #33
- Include two different stages for building TGI image: by @mfuntowicz in #34
- Llama support by @tengomucho in #32
- chore(ci): added workflow for nightly tests by @tengomucho in #35
- fix(build): setup.py removed from build_dist dependencies by @tengomucho in #36
- Try again to fix nightly builds by @tengomucho in #37
- Basic Llama2 Tuning by @tengomucho in #39
- Bug doc builder by @pagezyhf in #40
- Fix typo ; Update llama_tuning.md by @furkanakkurt1335 in #42
- Update to Pytorch 2.3.0 and transformers v4.40.2 by @tengomucho in #41
- Fine tuning with FSDP v2 by @tengomucho in #44
- Minor fix for mispelled stage in TGI dockerfile. by @thealmightygrant in #46
- Align to Transformers 4.41.1 by @tengomucho in #45
- chore(training): Allow training on torch xla > 2.3.0, add warning by @tengomucho in #48
- fix(build): add missing setuptools_scm section by @tengomucho in #49
- fix(logging): correct logging usage by @tengomucho in #50
- fix(tests): fix decode sample expected outputs again by @tengomucho in #52
- fix(doc): update server and port when serving TGI by @tengomucho in #53
- fix(ci): correct secrets leak workflow check by @tengomucho in #55
- Add Mistral support 💨 by @tengomucho in #54
- Mistral nits by @tengomucho in #57
- chore: bump to version v0.1.0a1 by @tengomucho in #60
- feat(TGI): add release docker image build and push to registry workflow by @tengomucho in #62
- chore: bump to version v0.1.1 by @tengomucho in #63
New Contributors
- @tengomucho made their first contribution in #1
- @shub-kris made their first contribution in #3
- @regisss made their first contribution in #11
- @mfuntowicz made their first contribution in #13
- @mikegre-google made their first contribution in #20
- @pagezyhf made their first contribution in #40
- @furkanakkurt1335 made their first contribution in #42
- @thealmightygrant made their first contribution in #46
Full Changelog: https://github.com/huggingface/optimum-tpu/commits/v0.1.1