A new benchmark spanning 1.2 million real products reveals that even state-of-the-art models struggle to remember what you actually want. A lightweight memory-augmented agent trained end-to-end beats them all.
Figure adapted from Yu et al. (2026), Shopping Companion benchmark results. Even frontier models fall short without long-term preference memory.
Decomposition of agent failure modes across recommendation and budgeting tasks.
Shopping online with an AI assistant sounds simple — until you realize the agent has no idea what you actually want. You told it last week that you prefer noise-cancelling headphones under $200 from Japanese brands, but the next time you open the chat, it's starting from scratch. Researchers at Alibaba International set out to fix this with Shopping Companion, a memory-augmented LLM agent designed for realistic, long-horizon e-commerce tasks.
The team introduced a new benchmark built on top of 1.2 million real-world products, organized around two critical task categories: personalized product recommendation (where the agent must infer your preferences from prior conversations) and budgeting and bundle deals (where it must reason across multiple items to hit a spending goal). These tasks span multiple turns, requiring agents to maintain and query a preference memory across an entire shopping session — not just within a single exchange.
The results were sobering. Even GPT-5, one of the strongest available models, achieved success rates below 70% on the benchmark. The core problem was architectural: prior systems treated preference identification and shopping assistance as two separate modules. You'd retrieve memories in one step, then hand them to a shopping agent in another — and the two parts weren't optimized together, so errors compounded.
Shopping Companion addresses this with a unified end-to-end framework that jointly optimizes memory retrieval and task execution using a dual-reward reinforcement learning strategy. One reward signal measures how well the agent captures user preferences; the other measures task success. Training both rewards simultaneously allows the model to learn that getting the memory right and completing the task are inseparable concerns. A lightweight version of the Shopping Companion model, trained on this objective, consistently outperformed stronger baseline models — including the frontier LLMs — precisely because its memory integration was tuned for the task, not bolted on after the fact.
The benchmark also surfaces a practical challenge that pure retrieval systems miss: users don't always state preferences explicitly. They imply them through conversation history, past purchases, and the way they reject recommendations. Shopping Companion's training regime is built around inferring these implicit preferences, which is closer to how real shopping sessions unfold.
The shift to agentic commerce — where AI agents browse, compare, and purchase on your behalf — makes long-term preference memory a first-class technical problem. An agent that forgets your brand preferences, budget constraints, or past returns is worse than useless: it creates noise, wastes time, and erodes trust. Shopping Companion demonstrates that memory retrieval cannot be an afterthought appended to a general-purpose LLM; it needs to be co-trained with the task it serves.
More broadly, this work is a template for any domain where agents must serve users across multiple sessions: travel booking, healthcare scheduling, financial planning. The dual-reward RL approach is domain-agnostic, and the benchmark construction methodology — grounding evaluation in millions of real-world entities — sets a new bar for how e-commerce AI should be tested. As AI shopping assistants move from novelty to infrastructure, the question isn't whether they'll remember — it's whether they'll remember correctly.
Agent must infer user preferences from prior conversation turns and recommend relevant products from the 1.2M catalog. Tests long-term memory and preference drift over sessions.
Agent must reason across multiple products simultaneously to optimize a bundle within a spending limit. Combines memory retrieval with multi-step arithmetic and constraint satisfaction.
Yu, Z., Xiao, K., Zhao, H., Luo, T., & Zeng, X. (2026). Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks. arXiv. https://arxiv.org/abs/2603.14864