KG AI Benchmark

A React + TypeScript playground for benchmarking local LLMs hosted in LM Studio (or any OpenAI-compatible runtime). The project now ships with a full dashboard, profile management, diagnostics workflow, and an embedded 100-question GATE PYQ dataset so you can launch end-to-end evaluations without additional scaffolding.

Features

⚡️ Vite-powered React 19 + TypeScript setup with strict linting
🧭 Tabbed dashboard (Dashboard · Profiles · Runs · Run Detail) powered by a shared benchmark context
🧪 Level 1/Level 2 diagnostics against LM Studio with JSON-mode fallback and log history
📋 Question selector with filter/search + evaluation engine for MCQ/MSQ/NAT/TRUE_FALSE question types
📊 Recharts-based analytics (accuracy vs latency trends, pass/fail vs latency, KPI tiles)

Getting started

npm install
npm run dev

The development server runs at http://localhost:5173.

Available scripts

Script	Description
`npm run dev`	Start the Vite development server
`npm run lint`	Run ESLint with the configured TypeScript rules
`npm run build`	Type-check and build the production bundle
`npm run preview`	Preview the production build locally

Usage workflow

Create a profile – open the Profiles tab, click “New profile”, and supply the LM Studio base URL (e.g., http://127.0.0.1:1234), model identifier, API key (if required), and prompt settings.
Run diagnostics – execute Level 1 (handshake) then Level 2 (readiness). The UI records logs, flags JSON-mode fallbacks, and blocks benchmarks until readiness passes.
Launch a benchmark – switch to the Runs tab, click “New run”, filter/select questions from the embedded PYQ dataset, and start the run. Progress streams live; results persist to Supabase so you can pick up on any device.
Analyze results – open any run to inspect accuracy, latency, token usage, and per-question responses/explanations. Dashboard trend lines summarize the most recent completions.

Roadmap

Harden Supabase schemas/policies (per-user scoping, migrations) and backfill analytics views.
Add cancellation controls, progress indicators, and screenshot/export helpers.
Extend evaluation to descriptive/FILL_BLANK questions with rubric scoring.
Support dataset import/export to drive custom benchmark suites.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
public		public
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MULTI_MODEL_IMPLEMENTATION_REVIEW.md		MULTI_MODEL_IMPLEMENTATION_REVIEW.md
PLAN.md		PLAN.md
README.md		README.md
eslint.config.js		eslint.config.js
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
requirement-doc.md		requirement-doc.md
tailwind.config.js		tailwind.config.js
tsconfig.app.json		tsconfig.app.json
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KG AI Benchmark

Features

Getting started

Available scripts

Usage workflow

Roadmap

License

About

Uh oh!

Releases

Packages

Languages

License

Complete-Coding/KG-AI-Benchmark

Folders and files

Latest commit

History

Repository files navigation

KG AI Benchmark

Features

Getting started

Available scripts

Usage workflow

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages