flash-attention

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

sora llm llms vllm llm-inference awesome-llm flash-attention flash-attention-2 tensorrt-llm paged-attention deepseek open-sora flash-attention-3

Updated Sep 27, 2024

DefTruth / CUDA-Learn-Notes

Star

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

cuda pytorch triton gemm softmax cuda-programming layernorm gemv elementwise rmsnorm flash-attention flash-attention-2 warp-reduce block-reduce flash-attention-3

Updated Sep 30, 2024
Cuda

flashinfer-ai / flashinfer

Star

FlashInfer: Kernel Library for LLM Serving

gpu cuda pytorch tvm llm-inference flash-attention large-large-models

Updated Sep 30, 2024
Cuda

InternLM / InternEvo

Star

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

pytorch multi-modal gemma pipeline-parallelism transformers-models tensor-parallelism llava llm-training internlm flash-attention zero3 llm-framework sequence-parallelism internlm2 ring-attention deepspeed-ulysses llama3 910b

Updated Sep 29, 2024
Python

CoinCheung / gdGPT

Star

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

nlp bloom pipeline pytorch deepspeed llm full-finetune model-parallization flash-attention llama2 baichuan2-7b chatglm3-6b mixtral-8x7b

Updated Feb 5, 2024
Python

alexzhang13 / flashattention2-custom-mask

Star

Triton implementation of FlashAttention2 that adds Custom Masks.

deep-learning triton attention cuda-kernels attention-mechanism triton-lang flash-attention flash-attention-2

Updated Aug 14, 2024
Python

RulinShao / FastCkpt

Star

Python package for rematerialization-aware gradient checkpointing

gradient-checkpointing flash-attention

Updated Oct 31, 2023
Python

Naman-ntc / FastCode

Star

Utilities for efficient fine-tuning, inference and evaluation of code generation models

transformers efficient inference code-generation finetuning flash-attention

Updated Oct 3, 2023
Python

Bruce-Lee-LY / flash_attention_inference

Star

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

gpu cuda inference nvidia cutlass mha multi-head-attention llm tensor-core large-language-model flash-attention flash-attention-2

Updated Sep 7, 2024
C++

Bruce-Lee-LY / decoding_attention

Star

Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.

gpu cuda inference nvidia mha multi-head-attention llm large-language-model flash-attention cuda-core decoding-attention flashinfer

Updated Sep 27, 2024
C++

kklemon / FlashPerceiver

Star

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

nlp deep-learning transformer attention-mechanism perceiver flash-attention

Updated Oct 10, 2023
Python

kyegomez / FlashMHA

Sponsor

Star

An simple pytorch implementation of Flash MultiHead Attention

artificial-intelligence transformer attention artificial-neural-networks attention-mechanisms attentionisallyouneed gpt4 flash-attention

Updated Feb 5, 2024
Jupyter Notebook

graphcore-research / flash-attention-ipu

Star

Poplar implementation of FlashAttention for IPU

deep-learning transformers pytorch ipu graphcore poplar flash-attention flash-attention-2

Updated Mar 12, 2024
C++

gietema / attention

Star

Toy Flash Attention implementation in torch

torch flash-attention flash-attention-2 flash-attention-3

Updated Sep 22, 2024
Python

akbar2habibullah / Homunculus-Project

Star

My project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.

python machine-learning deep-learning jupyter-notebook pytorch transformer bitnet pytorch-lightning vision-transformer large-language-models low-rank-adaptation flash-attention

Updated Sep 20, 2024
Python

MasterSkepticista / gpt2

Star

Training GPT-2 on FineWeb-Edu in JAX/Flax

flax jax gpt2 flash-attention fineweb

Updated Aug 20, 2024
Python

Improve this page

Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention

Here are 19 public repositories matching this topic...

QwenLM / Qwen

ymcui / Chinese-LLaMA-Alpaca-2

InternLM / InternLM

DefTruth / Awesome-LLM-Inference

DefTruth / CUDA-Learn-Notes

flashinfer-ai / flashinfer

InternLM / InternEvo

CoinCheung / gdGPT

alexzhang13 / flashattention2-custom-mask

RulinShao / FastCkpt

Naman-ntc / FastCode

Bruce-Lee-LY / flash_attention_inference

Bruce-Lee-LY / decoding_attention

kklemon / FlashPerceiver

kyegomez / FlashMHA

graphcore-research / flash-attention-ipu

gietema / attention

akbar2habibullah / Homunculus-Project

MasterSkepticista / gpt2

Improve this page

Add this topic to your repo