Research Explainer ยท Koc, Verre, Blank & Morgan (2025)

Your IDE should watch your AI's metrics; not just your code's syntax

A conceptual framework for wiring real-time LLM telemetry (traces, evaluations, prompt versions) directly into the code editor through the Model Context Protocol, turning prompt engineering from guesswork into a data-driven feedback loop.

Key Contribution

This paper proposes that AI-first IDEs should become observability dashboards, not just text editors. Using the Model Context Protocol (MCP) as a universal broker, the authors describe three progressively autonomous design patterns for feeding LLM metrics, traces, and evaluation scores back into the developer's workflow. It is a theoretical architecture paper: no benchmarks, no measured improvements, but a blueprint for how prompt engineering could work if we treated prompts with the same rigour as compiled code.

Large language models are non-deterministic, opaque, and stubbornly difficult to debug. When a traditional application breaks, you read the stack trace. When an LLM-powered feature breaks, you stare at a prompt and wonder which of its 400 tokens is making the model hallucinate. The feedback cycle is painfully slow: edit the prompt, run it against a few examples, squint at the outputs, revise, repeat.

Observability platforms for LLMs do exist (Comet's Opik, OpenTelemetry extensions, various logging dashboards), but they live outside the IDE. You write the prompt in VS Code, switch to a browser tab to check your traces, then switch back to edit. Koc and colleagues argue this context-switching is the bottleneck. If the IDE itself could surface token counts, latency, hallucination scores, and evaluation results the moment you run a prompt, you would iterate far faster.

Their proposed fix: wire the IDE directly into a telemetry backend using the Model Context Protocol (MCP), an open standard Anthropic released in late 2024 for connecting AI models to external tools. The paper re-interprets MCP as a "Metrics, Control, Prompt" broker, a single API through which an IDE can log traces, query performance data, version prompts, and (eventually) send control commands to running agents.

The core of the paper is a progression of three patterns, each adding more automation to the telemetry feedback loop. They move from a developer reading metrics in real time, to a CI pipeline that auto-tests prompts against historical baselines, to fully autonomous agents that monitor production systems and propose prompt patches without human intervention.

The authors are careful to note that these are not incremental steps on a single ladder. All three can coexist. A team might use Pattern 1 during local development, Pattern 2 in their merge pipeline, and Pattern 3 in production monitoring. The unifying thread is MCP: every pattern reads from and writes to the same telemetry store, so a trace captured during a developer's sandbox test is queryable by a CI job or a monitoring agent later.

1. Local Metrics-in-the-Loop

The developer runs a prompt inside the IDE and immediately sees token counts, latency, evaluation scores, and trace excerpts. They can ask the IDE's LLM assistant questions like "What are common errors in my last 10 runs?" and get answers grounded in stored telemetry. The loop is: write prompt, run, read metrics, revise.

2. CI-Integrated Optimization

Prompt quality checks run automatically on every commit, just like unit tests for code. The CI job executes a suite of test queries, logs outputs to the MCP server, and compares evaluation metrics against historical baselines. If a relevance score drops by more than 10%, the build fails. Optionally, an optimizer (DSPy's MIPRO, PromptWizard) can run headless to suggest a better prompt and open a pull request.

3. Autonomous Monitor Agents

In production, a separate LLM (or script) subscribes to the telemetry stream via MCP, watches for patterns like repeated tool-call failures or rising hallucination scores, diagnoses the root cause, and proposes a prompt patch. The patch feeds back through the CI pipeline for validation before deployment. This is the most speculative pattern: the authors acknowledge it is "forward-looking."

The paper uses Comet's open-source Opik MCP server as its reference implementation. The architecture has four components: the AI application (which streams traces to a telemetry store), the store itself (Opik), the MCP server (which exposes a query API over the store), and the MCP client (embedded in the IDE). The IDE's tool-calling LLM uses the MCP client to fetch metrics on demand, then generates human-readable summaries or prompt recommendations.

Prompt management is a first-class feature. The MCP server stores versioned prompt templates keyed by name, so when a CI optimizer finds a better variant, it can save it centrally and every team member (or agent) can pull the latest version. The authors compare this to how logging frameworks and A/B testing platforms in web development let analytics modules plug in without modifying the core application. MCP plays the same enabling role for LLM development: decouple the collection of telemetry from the optimization logic, and any optimizer (current or future) can hook into the same data stream.

The "Control" dimension is the least developed. Today it mostly means saving prompts or initiating evaluations through the API. The authors sketch more ambitious possibilities (A/B testing prompt versions, halting a runaway agent based on trace analysis) but acknowledge these remain aspirational.

The authors state explicitly that this is a "theoretical and architectural insight paper." There are no benchmarks, no user studies, no measured performance improvements. They do not claim that telemetry-aware IDEs make developers faster or prompts better. They propose that they should, and they describe the plumbing that would make it testable.

This matters because the paper's value is entirely in the design patterns and the conceptual framing. If you are looking for evidence that wiring metrics into an IDE actually improves prompt quality, you will not find it here. The authors are upfront about this gap: they call for future work on user studies (comparing telemetry-integrated IDEs to standard ones), convergence benchmarks (how quickly optimizers reach good prompts with live data versus static datasets), and evaluation metric design (which telemetry signals actually predict real performance problems).

The paper is also a product paper in spirit. The authors work at Comet ML, and Opik is Comet's open-source LLM evaluation platform. The architecture is illustrated through Opik's MCP server throughout. This does not invalidate the ideas, but the reader should note the commercial context.

The core insight is genuinely useful: prompts are code, and code deserves observability. Traditional software engineering spent decades building the toolchain (debuggers, profilers, CI runners, APM dashboards) that makes complex systems manageable. LLM development is still in the "print-statement debugging" phase. This paper sketches a plausible path toward bringing that same infrastructure to prompt engineering.

The comparison to the Language Server Protocol (LSP) is the most telling analogy. LSP standardised how IDEs talk to language tooling, which meant every editor got autocomplete, go-to-definition, and linting for free once a single server was built. If MCP (or something like it) becomes the standard way IDEs talk to LLM telemetry backends, the same kind of ecosystem-wide lift becomes possible. Every optimizer, every evaluation library, every monitoring agent could plug into the same socket.

Whether that future arrives depends on adoption, and adoption depends on whether the tooling actually helps developers ship better AI products. The paper sets up the hypothesis. Proving it is someone else's job.

Bottom Line

Koc et al. make a case that IDEs for AI development should be observability platforms first and text editors second. The Model Context Protocol provides a plausible architectural backbone for this: a single API connecting the editor, CI pipeline, and production monitoring to a shared store of prompt versions, traces, and metrics. The three design patterns (local metrics, CI-integrated testing, autonomous monitors) form a coherent progression. No empirical evidence yet, but the blueprint is clear and the analogy to established software engineering practices is strong.

Reference

Koc, V., Verre, J., Blank, D., & Morgan, A. (2025). Mind the Metrics: Patterns for Telemetry-Aware In-IDE AI Application Development using Model Context Protocol (MCP). arXiv preprint arXiv:2506.11019. https://arxiv.org/abs/2506.11019