#

evaluations

Here are 67 public repositories matching this topic...

Scale3-Labs / langtrace

Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorDBs and more.. Integrate using Typescript, Python. 🚀💻📊

open-source ai evaluations tracing openai gpt datasets observability open-telemetry llm prompt-engineering langchain llmops llm-framework

Updated Nov 17, 2025
TypeScript

microsoft / promptpex

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated Mar 21, 2026
TeX

evalkit / evalkit

The TypeScript LLM Evaluation Library

nodejs javascript typescript ai evaluations devtools openai evaluation-metrics gpt-3 gpt-4 llm

Updated Nov 11, 2025
TypeScript

RailtownAI / railtracks

An agentic framework that helps developers build resilient agentic systems

python evaluations ai-agents llms

Updated Mar 23, 2026
Python

yjyddq / RiOSWorld

[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

agent benchmark evaluations safety cua ai-agent gui-agent computer-use-agent osworld

Updated Dec 2, 2025
HTML

log10-io / log10

Python client library for improving your LLM app accuracy

python debugging ai monitoring evaluations feedback logging artificial-intelligence openai agents autonomous-agents fine-tuning llms rlhf llmops anthropic

Updated Feb 11, 2025
Python

dreadnode / AIRTBench-Code

Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

security benchmarking benchmark research ai evaluations hacking artificial-intelligence cybersecurity ctf agents offensive-security ai-agents benchmark-datasets llm cyber-evals

Updated Mar 22, 2026
Jupyter Notebook

boxbeam / Crunch

The fastest java expression compiler/evaluator

evaluations evaluating-mathematical-expressions

Updated Feb 10, 2026
Java

AgentEvalHQ / AgentEval

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET

testing agent framework evaluations net workflows red-teaming agentic evals

Updated Mar 22, 2026
C#

evaleval / every_eval_ever

Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.

evaluations infra

Updated Mar 23, 2026
Python

aws-samples / sample-gen-ai-evaluations-workshop

This workshop teaches systematic approaches to evaluating Generative AI workloads for production use. You'll learn to build evaluation frameworks that go beyond basic metrics to ensure reliable model performance while optimizing cost and performance.

evaluations generative-ai

Updated Mar 2, 2026
Jupyter Notebook

LLM-Evaluation-s-Always-Fatiguing / leaf-playground

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

agent automation evaluations agents agent-based-simulation chatgpt llm-evaluation

Updated Jun 18, 2024
Python

ZainabZaman / IELTS_PracticeAndEvaluation

IELTS listening, speaking, reading and writing modules practice and evaluation with IELTS band calculation based on speech and text analysis and evaluation.

python evaluations azure text-analysis openai speech-processing ielts ielts-writing ielts-listening ielts-learning gpt-3 ielts-speaking ielts-exam

Updated Dec 13, 2023
Python

justinwetch / SkillEval

A visual workbench for A/B testing AI skills. Upload two skill files, run them through a batch of test prompts, and let an AI judge score the results.

ai evaluations skills ai-agents claude anthropic evals

Updated Mar 12, 2026
JavaScript

lechmazur / sycophancy

LLM benchmark and leaderboard for narrator-bias sycophancy, opposite-narrator contradictions, and judgment consistency.

benchmark evaluations consistency leaderboard contradiction llm sycophancy narrator-bias

Updated Mar 9, 2026

apartresearch / 3cb

3cb: Catastrophic Cyber Capabilities Benchmarking of Large Language Models

testing machine-learning evaluations audit cybersecurity ai-safety

Updated Oct 30, 2024
Python

yisaienkov / evaluations

This library implements various metrics (including Kaggle Competition, Medicine) for evaluating ML, DL, AI models, and algorithms. 📐📊📈📉📏

python evaluations metrics python-library pypi python3 kaggle kaggle-competition metrics-library

Updated Dec 8, 2022
Python

fwdai / reticle

Postman for AI - design, evaluate, and debug LLM interactions with full transparency.

desktop-app ai evaluations desktop developer-tools ai-agents tauri ai-tools llm prompt-engineering ai-testing ai-tool llm-tools agentic-ai

Updated Mar 16, 2026
TypeScript

reliability-checklist

Maitreyapatel / reliability-checklist

NLP tool for wide-range model reliability evaluations

evaluations nlp-library language-model robustness reliability-benchmarking

Updated Jun 18, 2023
Python

ComputerScienceHouse / conditional

CSH Evals, the modern way.

flask csh evaluations hacktoberfest

Updated Mar 23, 2026
Python

Improve this page

Add a description, image, and links to the evaluations topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluations topic, visit your repo's landing page and select "manage topics."