AI Observability & Evaluation
-
Updated
Oct 8, 2024 - Jupyter Notebook
AI Observability & Evaluation
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
Python SDK for running evaluations on LLM generated responses
🐢 Open-Source Evaluation & Testing for ML models & LLMs
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Realign is an evaluation and experimentation framework for AI applications.
Community Plugin for Genkit to use Promptfoo
An open source library for asynchronous querying of LLM endpoints
The prompt engineering, prompt management, and prompt evaluation tool for Python
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
Generate ideal question-answers for testing RAG
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
Add a description, image, and links to the llm-eval topic page so that developers can more easily learn about it.
To associate your repository with the llm-eval topic, visit your repo's landing page and select "manage topics."