Examples
Examples for common scenarios
Basic
Uses an entity extraction use-case to check for valid JSON outputs.
Tool calling
Uses an LLM to grade the output responses and ensure that they do not contain “as a AI language model” in them.
Spider (TypeScript)
Runs a subset of the Spider dataset to demo text-to-SQL and relevant scorer functions in TypeScript.
Spider (Python)
Runs a subset of the Spider dataset to demo text-to-SQL and relevant scorer functions in Python.
RAG
Tests a Retrieval-augmented Generation application built with LlamaIndex, scored on metrics from Ragas.
OpenAI Assistants
Runs Empirical on an OpenAI Assistant.
HumanEval
Uses a custom Python scoring function to run the HumanEval benchmark, which is a popular dataset for code generation tasks.
Chat bot with LLM scorer
Uses an LLM to grade the output responses and ensure that they do not contain “as a AI language model” in them.