AI Extensions: Evals

What are AI Evals?

There are two things in one:

It is a way to show users of AI extension how to use it and what it can do. Users will see suggested prompts when they @-mention your extension:
It is a way for developers to test that the AI Extension works reliably. It is like integration tests, but for AI: it allows to iterate on AI extension implementation and prompts and ensure that previously tested scenarios aren't broken:

Structure

AI eval consists of 3 parts:

input is a text prompt that you expect from users of your AI Extension. It should include @ mention the name of your extension (name from package.json)
mocks – mocked results of tool calls. It is required to give AI the context, i.e. if you write an eval for @todo-list What are my todos? you need to provide the actual list in get-todos mock.
expected – array of expectations, similar to expect statements in unit / integration tests (you’ll find the list of all supported expectations below)

Add evals to ai.evals array in package.json and run ray evals from your extension directory to run and see the results. Note: you have to be authenticated as a member of Raycast AI Extensions Beta organization

Example:

{
  "input": "@todo-list Mark the posting the announcement as completed",
  "mocks": {
    "get-todos": {
      "todos": [
        {
          "id": "aef13ef3-9c37-463e-9c93-3573325c0231",
          "text": "Post the announcement"
        }
      ]
    },
    "toggle-todo": {
      "success": "true"
    }
  },
  "expected": [
    {
      "callsTool": "get-todos"
    },
    {
      "callsTool": {
        "name": "toggle-todo",
        "arguments": {
          "id": "aef13ef3-9c37-463e-9c93-3573325c0231"
        }
      }
    }, 
    {
      "meetsCriteria": "Tells that item was successfully marked as completed"
    }
  ]
}

You can find more examples in ai-extensions-beta repository.

Expectations

includes to check that AI response includes some substring (case-insensitive). Example {"includes": "added"}
matches to check that AI response matches some regexp. Example (to check that response contains markdown link): "matches": "\\\\[([^\\\\]]+)\\\\]\\\\(([^\\\\s\\\\)]+)(?:\\\\s+\\"([^\\"]+)\\")?\\\\)"
meetsCriteria to check that AI response meets some plain-text criteria (validated using AI). Useful when AI varies the response and it is hard to match it using includes or matches. Example: "meetsCriteria": "Tells that label with this name doesn't exist"
callsTool to check that during the request AI called some AI tool included into your AI extension. There are two forms:
- Short form to check if AI tool with specific name was called. Example: { "callsTool": "get-todos" }
- Long form to check tool arguments: callsTool: { name: "name", arguments: { arg1: matcher, arg2: matcher}}. Matches could be complex and combine any supported rules:
  - eq (used by default for any value that is not object or array)
  - includes
  - matches
  - and (used by default if array is used)
  - or
  - not
  Example:
```
{
  "callsTool": {
    "name": "create-comment",
    "arguments": {
      "issueId": "ISS-1",
      "body": {
        "includes": "waiting for design"
      }
    }
  }
}
```