Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Tracker: Evaluation Benchmarks #237

Open
1 of 27 tasks
rmusser01 opened this issue Sep 8, 2024 · 1 comment
Open
1 of 27 tasks

Feature Tracker: Evaluation Benchmarks #237

rmusser01 opened this issue Sep 8, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request Feature-Addition

Comments

@rmusser01
Copy link
Owner

rmusser01 commented Sep 8, 2024

Title.

Benchmarks:
Evaluation Methodologies

  • G-Eval
  • QAG

Coding Ability

  • Aider Benchmark
  • CodeMMLU

Confabulation Rate

  • TruthfulQA
  • f

Context Length

  • Ruler
  • InfiniteBench
  • Babilong
  • LongICLBench
  • HelloBench
  • Snorkel Working Memory Test

Creative Writing

  • EQ Bench
  • f

Pop Culture

  • f

Reasoning

  • MMLU Pro
  • ARC public

Role Playing

  • Conversational Relevancy
  • Role Adherence
  • Knowledge Retention
  • Conversation Competeness

Summarization

  • DeepEval
  • Salesforce
  • F

Tool Calling

Toxicity Testing

  • DeepEval
  • f

Vibes

  • AidanBench
  • f
@rmusser01 rmusser01 added this to the Continual-Improvements milestone Sep 8, 2024
@rmusser01
Copy link
Owner Author

Using an LLM as a Response Judge

Some metrics cannot be defined objectively and are particularly useful for more subjective or complex criteria. We care about correctness, faithfulness, and relevance.

    Answer Correctness - Is the generated answer correct compared to the reference and thoroughly answers the user's query?
    Answer Relevancy - Is the generated answer relevant and comprehensive?
    Answer Factfulness - Is the generated answer factually consistent with the context document?

@rmusser01 rmusser01 self-assigned this Nov 17, 2024
@rmusser01 rmusser01 added enhancement New feature or request Feature-Addition labels Nov 17, 2024
@rmusser01 rmusser01 pinned this issue Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Feature-Addition
Projects
None yet
Development

No branches or pull requests

1 participant