Research · Academic & arXiv

Back to sweep

Research sweep · deep · 2025 – 2026

Code Intelligence & Code-Graph Indexing for AI Agents

Tools and emerging approaches for code intelligence and code-graph indexing for AI coding agents from June 2025 through early June 2026, spanning local/embedded indexers (CodeGraph/Caveman-style repo maps, tree-sitter, SQLite and embedded graph stores), enterprise-scale code understanding (SCIP, code knowledge graphs, embeddings+retrieval), LSP-to-MCP bridges such as Serena, and the semantic-vs-syntactic-vs-embedding trade-off.

  • GPT-5.5
  • tech
  • frontier
  • academic
  • financial
  • blogs

Synthesised 2026-06-03

Narrative

The 2025 to early-2026 academic frontier is converging on hybrid repository intelligence: deterministic or static-analysis-backed graph layers for symbol, dependency and navigation tasks; embeddings or sparse retrieval for broad matching; and MCP/LSP bridges for exposing those capabilities to agents. The strongest empirical theme is that purely textual grep-and-read workflows are inefficient at scale, while graph-native or graph-plus-embedding systems can cut tokens and tool calls substantially. The caveat is important: many of the newest practitioner tools are still benchmark-light and vendor-driven.

Semantic, syntactic and embedding methods are not substitutes for one another. Semantic systems are best for symbol navigation, call chains, type-aware lookup and change impact when the repository is buildable or language-server support is available. Syntactic Tree-sitter and AST graphs are easier to deploy locally and across many languages; they are robust and cheap, but miss deeper type and build semantics. Embedding retrieval is strongest for broad natural-language matching and fuzzy discovery, but weak on precise cross-file dependency reasoning unless paired with reranking or graph constraints.

The evidence quality is uneven. The best measured local-indexing evidence in this lane comes from Codebase-Memory's reported token and tool-call savings, alongside METR-style benchmark discipline. LSP-to-MCP tooling is moving quickly, but much of the evidence is still documentation, GitHub repositories or blog posts rather than peer-reviewed evaluation. For large repositories, the most resilient pattern appears to combine symbols, dependencies, commit history and embeddings rather than flattening code into text chunks.


Sources

ID Title Outlet Date Significance
a1 Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP arXiv 2026 Persistent Tree-sitter knowledge graph exposed via MCP; parses 66 languages and reports 10x fewer tokens and 2.1x fewer tool calls than a file-exploration agent on 31 repos.
a2 Repository Intelligence Graph: Deterministic Architectural Map for LLM Code Assistants arXiv 2026 Deterministic, evidence-backed architectural map of buildable components, aggregators, runners, tests, external packages, and package managers with explicit dependency and coverage edges.
a3 On the Challenges and Opportunities of Learned Sparse Retrieval for Code arXiv 2026 Introduces SPLADE-Code and argues that learned sparse retrieval can be competitive for code; reports sub-millisecond retrieval on 1M passages with little effectiveness loss.
a4 SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction arXiv 2025 Combines dual static-dynamic knowledge graphs, neural graph-query generation, SMT-guided beam search, and incremental KG maintenance.
a5 GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion arXiv 2025 Builds a multi-level code graph unifying files, ASTs, call graphs, class hierarchies, and data-flow graphs; hybrid retriever plus graph attention reranker.
a6 RepoScope: Leveraging Call Chain-Aware Multi-View Context for Repository-Level Code Generation arXiv 2025 Static-analysis-only repository structural semantic graph with call-chain prediction and structure-preserving serialization.
a7 RANGER -- Repository-Level Agent for Graph-Enhanced Retrieval arXiv 2025 Repository knowledge graph augmented with node text and embeddings; uses Cypher for entity queries and MCTS-guided graph exploration for natural-language queries.
a8 Knowledge Graph Based Repository-Level Code Generation arXiv 2025 Repository graph representation to improve code search and retrieval for repo-level generation; evaluated on EvoCodeBench.
a9 Beyond Function-Level Search: Repository-Aware Dual-Encoder Code Retrieval with Adversarial Verification arXiv 2025 Defines RepoAlign-Bench for change-request-driven repo retrieval and proposes a dual-tower retriever with adversarial reflection.
a10 Repository-level Code Search with Neural Retrieval Methods arXiv 2025 Multi-stage retrieval/reranking for repository-level code search using commit histories plus BM25 and CodeBERT reranking.
a11 RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph arXiv 2024 Plug-in repository-level code graph that boosts SWE-bench and CrossCodeEval performance across multiple methods.
a12 GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model arXiv 2024 Code Context Graph with control/data/control-dependence edges and coarse-to-fine graph retrieval.
a13 How and Why LLMs Use Deprecated APIs in Code arXiv 2024 Empirical study showing LLMs rely on code search services and can be influenced by retrieval behavior when using deprecated APIs.
a14 Improving Text Embeddings with Large Language Models arXiv 2024 LLM-assisted embedding training that improves BEIR/MTEB performance; relevant to embedding-based retrieval quality.
a15 Retrieval Augmented Code Generation and Summarization arXiv 2021 Early retrieval-augmented code generation/summarization framework (REDCODER).
a16 SCIP Code Intelligence Protocol / Sourcegraph SCIP Sourcegraph documentation / GitHub 2024 Language-agnostic code indexing protocol for go-to-definition, references, and implementations.
a17 Serena Open-source MCP toolkit / GitHub 2025 MCP-based coding agent toolkit exposing semantic retrieval and symbol-level editing via LSP integration.
a18 multilspy GitHub 2024 Python LSP client library intended for applications around language servers.
a19 MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers arXiv 2025 Proxy layer for MCP servers that can simplify access patterns and decouple clients from servers.
a20 CodeSift Practitioner tool/site 2025 MCP tools for code intelligence claiming reduced-token workflows for agents.
a21 GitHub MCP Server GitHub repository 2025 Official MCP server supporting repository and workflow intelligence across MCP hosts.
a22 HCAST: Human-Calibrated Autonomy Software Tasks METR / PDF 2025 Autonomy benchmark suite for software, ML engineering, cybersecurity, and research tasks.
a23 METR preliminary evaluations of Claude 3.7, GPT-4.5, o3/o4-mini, and related frontier-model reports METR evaluation reports 2025 Comparative agent evaluations on HCAST, SWAA, and RE-Bench, with time-horizon estimates and observations on reward hacking / cheating behaviors.
a24 METR Time-Horizon and Frontier-Risk updates METR blog / analysis 2025 Time-horizon analyses across software and research tasks; updates on frontier model behavior in task suites.
a25 Context Engineering for AI Agents in Open-Source Software arXiv 2025 Empirical study of AGENTS.md / AI config files across 466 OSS projects; shows no standard structure yet and strong variation in provided context.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.