Skip to content

SalesforceAIResearch/ThinK


Python 3.10+ License GitHub star chart

Paper | Installation | Eviction | Quantization

We provide three implementations. ThinK_eager contains the code for eager attention, ThinK_flash utilizes FlashAttention and TinK_KIVI which intergrates with KV quantization. Please note that the current implementations may not be fully optimized, and we are actively working on improving their efficiency. We use LongBench to evaluate the performance.

✅ TODO

  • Support More Models
  • Support Multi-GPUs
  • Optimize Efficiency

Installation

Step 1: Clone this repository

Step 2: Setup Environments

conda create -n think python=3.10
conda activate think
pip install -r requirements.txt

Evaluation

Eviction

Evaluate on LongBench: You can first modify the hyperparameters in scripts/scripts_longBench/eval.sh(e.g., pruning_ratio)

cd ThinK_flash
sh ./scripts/scripts_longBench/eval.sh

Results:

sh ./scripts/scripts_longBench/metrics.sh

Quantization

cd ThinK_kivi

Set up the environments as per the instructions from KIVI, adding an additional argument, pruning_ratio. Currently, only LLaMA-2 is supported.

Notes

Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.

Citation

@article{xu2024think,
  title={ThinK: Thinner Key Cache by Query-Driven Pruning},
  author={Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
  journal={arXiv preprint arXiv:2407.21018},
  year={2024}
}

Acknowledgement

This repo builds on the SnapKV, PyramidKV, KIVI repos.

About

ThinK: Thinner Key Cache by Query-Driven Pruning

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages