llm-eval

Here are 30 public repositories matching this topic...

Arize-ai / phoenix

AI Observability & Evaluation

datasets hacktoberfest mlops ai-monitoring ml-observability ai-observability ai-roi model-observability llmops llm-eval aiengineering

Updated Oct 8, 2024
Jupyter Notebook

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Oct 8, 2024
TypeScript

iterative / datachain

Star

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps

ai cv embeddings data-analytics data-wrangling multimodal mlops llm llm-eval

Updated Oct 8, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Oct 4, 2024
Python

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for ML models & LLMs

Updated Oct 8, 2024
Python

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Oct 1, 2024
TypeScript

honeyhiveai / realign

Star

Realign is an evaluation and experimentation framework for AI applications.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Sep 30, 2024
Python

yukinagae / genkitx-promptfoo

Star

Community Plugin for Genkit to use Promptfoo

plugin testing firebase ai evaluation prompt prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo genkit genkitx genkit-plugin

Updated Oct 7, 2024
TypeScript

alan-turing-institute / prompto

Star

An open source library for asynchronous querying of LLM endpoints

python nlp machine-learning natural-language-processing deep-learning transformers transformer hacktoberfest hut23 large-language-models llms llm-eval llm-evaluation

Updated Oct 3, 2024
Python

prompt-foundry / python-sdk

Star

The prompt engineering, prompt management, and prompt evaluation tool for Python

python python3 open-ai llm prompt-engineering prompt-management llm-eval llm-evaluation prompt-evaluation

Updated Sep 17, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Sep 12, 2024
Python

yukinagae / genkit-promptfoo-sample

Star

Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

testing evaluation prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo genkit

Updated Sep 11, 2024
TypeScript

yukinagae / promptfoo-sample

Star

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

testing evaluation prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo

Updated Sep 10, 2024

prompt-foundry / typescript-sdk

Star

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

typescript gpt open-ai gpt-3 gpt-4 llm prompt-engineering llmops prompt-testing prompt-manager prompt-management llm-eval llm-test llm-ops llm-evaluation prompt-evaluation

Updated Sep 14, 2024
TypeScript

Auto-Playground / ragrank

Star

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

machine-learning evaluation language-model rag llm prompt-engineering llmops llm-eval

Updated Aug 26, 2024
Python

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

yuzu-ai / ShinRakuda

Star

Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.

japanese llm llm-eval llm-evaluation llm-evaluation-framework