evaluation

Star

Here are 1,240 public repositories matching this topic...

mrgloom / awesome-semantic-segmentation

Star

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

Updated May 8, 2021

explodinggradients / ragas

Star

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Updated Nov 20, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Updated Nov 22, 2024
TypeScript

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Nov 22, 2024
TypeScript

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Nov 21, 2024
Python

Knetic / govaluate

Star

Arbitrary expression evaluation for golang

go parsing evaluation expression

Updated May 31, 2024
Go

MichaelGrupp / evo

Star

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated Oct 31, 2024
Python

sdiehl / write-you-a-haskell

Star

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

Updated Jan 11, 2021
Haskell

viebel / klipse

Sponsor

Star

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

react javascript ruby python scheme clojure lua clojurescript reactjs common-lisp ocaml brainfuck evaluation prolog codemirror-editor reasonml interactive-snippets code-evaluation klipse-plugin

Updated Oct 1, 2024
HTML

CLUEbenchmark / SuperCLUE

Star

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

evaluation chinese gpt-4 foundation-models chatgpt

Updated May 23, 2024

zzw922cn / Automatic_Speech_Recognition

Star

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio deep-learning tensorflow paper end-to-end evaluation cnn lstm speech-recognition rnn automatic-speech-recognition feature-vector data-preprocessing phonemes timit-dataset layer-normalization rnn-encoder-decoder chinese-speech-recognition

Updated Mar 24, 2023
Python

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Nov 22, 2024
Python

microsoft / promptbench

Star

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

Updated Oct 28, 2024
Python

ianarawjo / ChainForge

Sponsor

Star

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models prompt-engineering llms llmops

Updated Oct 29, 2024
TypeScript

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

huggingface / evaluate

Star

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

machine-learning evaluation

Updated Sep 17, 2024
Python

Helicone / helicone

Star

🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23