You can configure a custom Python evaluator by specifying a py-script evaluator in the scorers section of the configuration. The path key should be the path to the Python script.

"scorers": [
  {
    "type": "py-script",
    "path": "score.py",
    "name": "my-custom-scorer"
  }
]

In the script, you need to define an evaluate method, with the following signature:

  • Arguments
    • output: dict with key value to get the output value (string) and key metadata to get metadata (dict); see output object
    • inputs: dict of key-value pairs from the dataset sample
  • Returns
    • List of results: each result is dict with score (number between 0 to 1), message (optional, string) and name (optional, string)
score.py
def evaluate(output, inputs):
    model_response = output["value"]
    metadata = output["metadata"]
    # ... score the model response
    return {
        "score": 1,
        "message": "Optional reasoning for this score"
    }

Multiple scores

It is possible for the Python script to return an array of scores. Use name to distinguish between them.

score.py
def evaluate(output, inputs):
    model_response = output["value"]
    metadata = output["metadata"]
    # ... score the model response
    return [
      { "score": 1, "name": "syntax-score" },
      { "score": 0, "name": "semantic-score", "message": "failure reason"}
    ]

Example

The HumanEval example uses this scorer.

Python Path

The Python script is executed on your machine using python available in PATH. This determines the Python version that is used.

The Python script can use any Python modules (built-in or third party). If you are using third-party libraries or want to use a specific version of Python, override the Python path while running the CLI.

npx empiricalrun --python-path PATH_TO_PYTHON_BINARY

Limitations

  • The Python script must complete execution within 20 seconds