We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management.
Unlike existing agent benchmarks that focus on tasks like web interaction, tool-use, or software automation in generic settings,
HippoCamp evaluates agents in user-centric environments to model individual user profiles and search from massive personal files for context-aware reasoning.
Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files.
Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning.
To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis.
We evaluate a wide range of state-of-the-art multimodal large language models and agentic methods on HippoCamp.
Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve merely a 48.0% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems.
Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks.
Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.