Dev-tools have a new user: AI
January 10, 2025AI coding assistants are changing how we set up developer tools. The best tools will leverage this to help developers get running in minutes, not hours. I explored this topic over the holidays by shipping simboba, a lightweight evals framework for LLM agents.
The traditional way of setting up a tool is having the developer pick between a few options on the command line. Then the developer does the work of setting up the tool with their code base. AI coding assistants have completely changed how the latter happens. Your user is going to use Cursor, Claude Code or their AI of choice to do it. Instead of fighting this, the tool should own it and make this easier for them.
I explored this problem for something we needed inside our own product: evals. We’ve found it most helpful to keep evals as simple Python scripts. Every time we wrote evals I found that we needed to write a little bit of boilerplate code, and spend a lot of time on the actual annotated dataset. I wanted a library that helped with these 2 bits, and helped me leverage AI for it (I’m a big user of Claude Code).
I set out to build Simboba with these principles in mind:
- Gets you running a customised eval for your product in under 5 minutes
- Lives in your codebase so that you have version history via Git
- Handles multi-turn conversations
- Allows you to run probabilistic and deterministic evals
To achieve (1), I needed to make sure that tools like Claude Code had enough context about the tool and the user’s code base when setting up the tool for the user. After a few iterations, I found the following patterns to work really well.
Clear data model
Define a clear data model that the AI can use when setting up your tool. For the evals product, the two most important data models were for datasets and for running the eval scripts. I used Pydantic to define these models.
class MessageInput(BaseModel):"""A single message in a conversation."""role: strmessage: strattachments: list[dict] = Field(default_factory=list)metadata: Optional[dict] = Noneclass CaseCreate(BaseModel):"""Request model for creating a case."""inputs: list[MessageInput]expected_outcome: strexpected_metadata: Optional[dict] = Noneclass AgentResponse(BaseModel):output: strmetadata: Optional[dict] = None
When Claude Code sees these models, it understands exactly what shape the data needs to be. It can create datasets that match the schema without guessing.
Your docs are executable
Provide a clear set of instructions for AI models to read when accessing the tool. This sounds obvious, but without it Claude Code would jump straight to generating synthetic test cases that don't reflect how the product actually works.
Anything that is wrong, misleading or confusing compounds because the AI assistant reads this and assumes it's true. I cannot tell you how often I've been frustrated by a package because Claude Code is following documentation that is out of date.
I added a section to simboba's README specifically for AI coding assistants:
Instructions for AI coding assistantsIf you are helping a user set up Boba, please use the following instructions to guide you:[Set up instructions go here]
A simple example to show the end to end flow really helps AI coding assistants do a good job. Here's what I added to simboba's README for running an eval script:
from simboba import Boba, MessageInputboba = Boba()def agent(inputs: list[MessageInput]) -> str:return "Hi there! How can I help?"if __name__ == "__main__":result = boba.run(agent, dataset="my-first-eval")print(f"{result['passed']}/{result['total']} passed")
Your own AI instructions will also be read. Every AI coding tool has an instructions file: claude.md for Claude Code, Cursor rules for Cursor, and so on. These are intended for contributors to leverage AI when building the tool itself. I found that they also serve as useful context when a developer's AI assistant is figuring out how to set up the package.
Guide the AI's choices
If your tool has multiple ways to do something, instruct the AI on when to use each. For example, simboba supports three eval modes: LLM-as-judge on output alone, LLM-as-judge on output and metadata, or a hybrid approach (LLM evaluates output, metadata is checked deterministically). I included examples of each mode in the README so that AI assistants could see when to use each.
# Mode 1: No metadata - LLM judges output onlyboba.eval(input="Hello", output="Hi!", expected="Should greet")# Mode 2: LLM evaluates output + metadata togetherboba.eval(input="What's my order status?",output="Your order #123 is shipped.",expected="Should look up order status",expected_metadata={"tool_calls": ["get_orders"]},actual_metadata={"tool_calls": ["get_orders"]},)# Mode 3: LLM evaluates + deterministic check (both must pass)def check_tool_calls(expected, actual):if not expected or not actual:return Truereturn set(expected.get("tool_calls", [])) == set(actual.get("tool_calls", []))boba.eval(input="What's my order status?",output="Your order #123 is shipped.",expected="Should look up order status",expected_metadata={"tool_calls": ["get_orders"]},actual_metadata={"tool_calls": ["get_orders"]},metadata_checker=check_tool_calls,)
The new bar for developer experience
Here are the stakes: if your tool is hard for AI to set up, developers won't struggle through it. They'll ask Claude Code to build a custom solution instead. Every second spent trying to set something up is time considered lost because Claude Code could do it instead.
The dev tools that win the next few years will be the ones designed for AI to read, understand, and use. The good news is this isn't hard. It just requires thinking about a new user: the AI assistant sitting between your tool and the developer.