xFasterTransformer provides C++, Python(Pytorch) examples to help users learn the API usage. Web demos of some models based on Gradio are provided. All of the examples and web demo support multi-rank.
C++ example support automatic identification model and tokenizer which is implemented by SentencePiece, excluding Opt model which tokenizer is a hard code.
Python(PyTorch) example achieves end-to-end inference of the model with streaming output combining the transformer's tokenizer.
A web demo based on Gradio is provided in repo.
Support list:
- ChatGLM
- ChatGLM2
- ChatGLM3
- ChatGLM4
- Llama2
- Llama3
- Gemma
- Yi
- Baichuan2
- Qwen
- Qwen2