Skip to content
Change the repository type filter

All

    Repositories list

    • Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
      Python
      Apache License 2.0
      1871.3k399Updated Nov 12, 2024Nov 12, 2024
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      Apache License 2.0
      4364.1k20522Updated Nov 12, 2024Nov 12, 2024
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      Apache License 2.0
      12810Updated Nov 12, 2024Nov 12, 2024
    • GTA

      Public
      Official repository for paper "GTA: A Benchmark for General Tool Agents" (NeurIPS 2024 D&B Track)
      Python
      Apache License 2.0
      54310Updated Nov 6, 2024Nov 6, 2024
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2
      Python
      Apache License 2.0
      12210Updated Oct 29, 2024Oct 29, 2024
    • 46700Updated Oct 25, 2024Oct 25, 2024
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      Apache License 2.0
      11800Updated Oct 22, 2024Oct 22, 2024
    • Python
      Apache License 2.0
      0100Updated Sep 23, 2024Sep 23, 2024
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      1016430Updated Sep 1, 2024Sep 1, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      MIT License
      52000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      Apache License 2.0
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      3320Updated Aug 7, 2024Aug 7, 2024
    • CIBench

      Public
      Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
      Python
      Apache License 2.0
      1800Updated Jul 19, 2024Jul 19, 2024
    • Jupyter Notebook
      69440Updated Jul 17, 2024Jul 17, 2024
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      18460Updated Jul 12, 2024Jul 12, 2024
    • .github

      Public
      1000Updated May 31, 2024May 31, 2024
    • DevBench

      Public
      A Comprehensive Benchmark for Software Development.
      Python
      Apache License 2.0
      58410Updated May 30, 2024May 30, 2024
    • CodeBench

      Public
      0200Updated May 21, 2024May 21, 2024
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      25000Updated Apr 22, 2024Apr 22, 2024
    • T-Eval

      Public
      [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
      Python
      Apache License 2.0
      14230342Updated Apr 3, 2024Apr 3, 2024
    • Code for the paper "Evaluating Large Language Models Trained on Code"
      Python
      MIT License
      343200Updated Mar 14, 2024Mar 14, 2024
    • Apache License 2.0
      24030Updated Mar 8, 2024Mar 8, 2024
    • A multi-language code evaluation tool.
      Python
      Apache License 2.0
      71701Updated Jan 26, 2024Jan 26, 2024
    • evalplus

      Public
      EvalPlus for rigourous evaluation of LLM-synthesized code
      Python
      Apache License 2.0
      109100Updated Dec 20, 2023Dec 20, 2023
    • A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
      Python
      Apache License 2.0
      80764120Updated Dec 15, 2023Dec 15, 2023
    • LawBench

      Public
      Benchmarking Legal Knowledge of Large Language Models
      Python
      Apache License 2.0
      3826230Updated Nov 13, 2023Nov 13, 2023
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      613910Updated Nov 2, 2023Nov 2, 2023
    • Sphinx Theme for OpenCompass - Modified from PyTorch
      CSS
      MIT License
      138000Updated Aug 30, 2023Aug 30, 2023