AI coding assistants are changing how we set up developer tools. The best tools will leverage this to help developers get running in minutes, not hours. I explored this topic over the holidays by shipping simboba, a lightweight evals framework for LLM agents.

The traditional way of setting up a tool is having the developer pick between a few options on the command line. Then the developer does the work of setting up the tool with their code base. AI coding assistants have completely changed how the latter happens. Your user is going to use Cursor, Claude Code or their AI of choice to do it. Instead of fighting this, the tool should own it and make this easier for them.

I explored this problem for something we needed inside our own product: evals. We’ve found it most helpful to keep evals as simple Python scripts. Every time we wrote evals I found that we needed to write a little bit of boilerplate code, and spend a lot of time on the actual annotated dataset. I wanted a library that helped with these 2 bits, and helped me leverage AI for it (I’m a big user of Claude Code).

I set out to build Simboba with these principles in mind:

Gets you running a customised eval for your product in under 5 minutes
Lives in your codebase so that you have version history via Git
Handles multi-turn conversations
Allows you to run probabilistic and deterministic evals

To achieve (1), I needed to make sure that tools like Claude Code had enough context about the tool and the user’s code base when setting up the tool for the user. After a few iterations, I found the following patterns to work really well.

Clear data model

Define a clear data model that the AI can use when setting up your tool. For the evals product, the two most important data models were for datasets and for running the eval scripts. I used Pydantic to define these models.

class MessageInput(BaseModel):
    """A single message in a conversation."""
    role: str
    message: str
    attachments: list[dict] = Field(default_factory=list)
    metadata: Optional[dict] = None

class CaseCreate(BaseModel):
    """Request model for creating a case."""
    inputs: list[MessageInput]
    expected_outcome: str
    expected_metadata: Optional[dict] = None

class AgentResponse(BaseModel):
    output: str
    metadata: Optional[dict] = None

When Claude Code sees these models, it understands exactly what shape the data needs to be. It can create datasets that match the schema without guessing.

Your docs are executable

Provide a clear set of instructions for AI models to read when accessing the tool. This sounds obvious, but without it Claude Code would jump straight to generating synthetic test cases that don't reflect how the product actually works.

Anything that is wrong, misleading or confusing compounds because the AI assistant reads this and assumes it's true. I cannot tell you how often I've been frustrated by a package because Claude Code is following documentation that is out of date.

I added a section to simboba's README specifically for AI coding assistants:

Instructions for AI coding assistants
If you are helping a user set up Boba, please use the following instructions to guide you:
[Set up instructions go here]

A simple example to show the end to end flow really helps AI coding assistants do a good job. Here's what I added to simboba's README for running an eval script:

from simboba import Boba, MessageInput

boba = Boba()

def agent(inputs: list[MessageInput]) -> str:
  return "Hi there! How can I help?"

if __name__ == "__main__":
  result = boba.run(agent, dataset="my-first-eval")
  print(f"{result['passed']}/{result['total']} passed")

Your own AI instructions will also be read. Every AI coding tool has an instructions file: claude.md for Claude Code, Cursor rules for Cursor, and so on. These are intended for contributors to leverage AI when building the tool itself. I found that they also serve as useful context when a developer's AI assistant is figuring out how to set up the package.

Guide the AI's choices

If your tool has multiple ways to do something, instruct the AI on when to use each. For example, simboba supports three eval modes: LLM-as-judge on output alone, LLM-as-judge on output and metadata, or a hybrid approach (LLM evaluates output, metadata is checked deterministically). I included examples of each mode in the README so that AI assistants could see when to use each.

# Mode 1: No metadata - LLM judges output only
boba.eval(input="Hello", output="Hi!", expected="Should greet")

# Mode 2: LLM evaluates output + metadata together
boba.eval(
    input="What's my order status?",
    output="Your order #123 is shipped.",
    expected="Should look up order status",
    expected_metadata={"tool_calls": ["get_orders"]},
    actual_metadata={"tool_calls": ["get_orders"]},
)

# Mode 3: LLM evaluates + deterministic check (both must pass)
def check_tool_calls(expected, actual):
    if not expected or not actual:
        return True
    return set(expected.get("tool_calls", [])) == set(actual.get("tool_calls", []))

boba.eval(
    input="What's my order status?",
    output="Your order #123 is shipped.",
    expected="Should look up order status",
    expected_metadata={"tool_calls": ["get_orders"]},
    actual_metadata={"tool_calls": ["get_orders"]},
    metadata_checker=check_tool_calls,
)

The new bar for developer experience

Here are the stakes: if your tool is hard for AI to set up, developers won't struggle through it. They'll ask Claude Code to build a custom solution instead. Every second spent trying to set something up is time considered lost because Claude Code could do it instead.

The dev tools that win the next few years will be the ones designed for AI to read, understand, and use. The good news is this isn't hard. It just requires thinking about a new user: the AI assistant sitting between your tool and the developer.

Dev-tools have a new user: AI

Clear data model

Your docs are executable

Guide the AI's choices

The new bar for developer experience