Project Page • Preprint • Multimodal Agents

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo

Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

Nanyang Technological University, Singapore

We construct three personal computers comprising 42.4 GB across over 2K heterogeneous files, 581 user-need-driven, evidence-grounded queries, and 46.1K fine-grained annotations. The benchmark evaluates agents' ability to search, perceive, and reason over realistic, multimodal personal file systems, where even the most advanced models achieve merely 48.0% accuracy in user profiling.

Data Visualization Hugging Face

3archetypal profiles
27file types
581QA pairs
48.3%best profiling accuracy

HippoCamp teaser figure showing a personal computer and a contextual agent reasoning over files. — **Overview of task in HippoCamp Benchmark.** HippoCamp is a benchmark designed to evaluate agents' ability to search, perceive, and reason over long-term, realistic, large-scale personal file systems.

Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool-use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search from massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve merely a 48.0% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Overview

Personalized Multimodal Memory Benchmark

HippoCamp is a benchmark designed to evaluate agents' ability to search, perceive, and reason over long-term, realistic, large-scale personal file systems. Through tasks that demand grounded retrieval, cross-file pattern inference, and long-horizon reasoning, the benchmark rigorously evaluates the personalized multimodal memory of contextual agents.

Realistic Personal Environments

We construct three distinct, file-intensive user profiles that faithfully simulate real-world digital ecosystems. Each profile captures long-term continuity, idiosyncratic folder structures, and cross-file interconnections.

Device-Scale Corpus and Supervision

The benchmark comprises over 2,000 heterogeneous files totaling 42.4 GB, paired with 581 evidence-grounded queries. We further provide 46.1K fine-grained annotations for step-wise failure diagnosis.

Search, Perception, and Reasoning

We design two core task categories — factual retention and profiling — that demand search, multimodal perception, and personalized reasoning. Each query requires grounding answers in evidence distributed across files and modalities.

Verifiable Personalization Remains Hard

Even the strongest systems struggle with entity disambiguation, multimodal grounding, and iterative evidence synthesis. Our results show that verifiable personalization on personal computers remains far from solved.

Construction

Pipeline, Task Distribution, and Capability Labels

Benchmark overview: (a) data collection and human-LLM collaborative annotation pipeline for grounded trajectory construction; (b) distribution of factual retention and profiling tasks; and (c) distribution of annotated agent capability labels over search, perception, and reasoning.

Benchmark overview, annotation pipeline, and agent capability distribution. — **Benchmark overview.** Data collection and human-LLM collaborative annotation pipeline, task distribution, and agent capability decomposition.

Leaderboard

HippoCamp Leaderboard

Which contextual agent performs best on realistic personal file systems? We benchmark a wide range of state-of-the-art multimodal models and agentic methods, reporting profiling accuracy, factual-retention accuracy, and a composite score.

Official 581 QA Pairs 42.4 GB 46.1K Annotations Data Visualization Hugging Face

Two benchmark tracks: Profiling and factual retention are reported separately.
Three user environments: Scores are averaged across Bei, Adam, and Victoria.
Composite score: The leaderboard score is the mean of overall profiling accuracy and overall factual-retention accuracy.
Fine-grained diagnosis: Capability-wise performance is further decomposed into search, perception, and reasoning.

Best Overall

ChatGPT Agent Mode

Composite score 55.6, with the strongest overall balance across both benchmark tracks.

Best Profiling

48.3 Acc

ChatGPT Agent Mode leads the profiling track, substantially ahead of all other evaluated methods.

Best Factual Retention

62.8 Acc

Factual retention remains easier than profiling, but only a few systems achieve reliable performance.

Rank	Model Name	Date	Category	Profiling F1	Profiling Acc	Factual F1	Factual Retention Acc	Composite Score	Notes
1	ChatGPT Agent Mode	Mar. 2026	Agent	21.0	48.3	35.3	62.8	55.6	Most balanced overall
2	Terminal Agent (GPT-5.2)	Mar. 2026	Terminal Agent	11.1	30.0	24.6	48.2	39.1	Strong factual retention
3	ReAct (Gemini-2.5-flash)	Mar. 2026	ReAct	18.5	20.0	26.5	38.7	29.4	Best ReAct variant
4	Standard RAG	Mar. 2026	RAG	18.4	26.7	30.0	30.2	28.5	Stable retrieval baseline
5	Terminal Agent (Gemini-2.5-flash)	Mar. 2026	Terminal Agent	15.0	25.0	26.4	23.3	24.2	Mid-tier across both tracks
6	ReAct (Qwen3-VL-8B-Instruct)	Mar. 2026	ReAct	11.8	13.5	43.1	28.5	21.0	High factual F1, low accuracy
7	Self RAG	Mar. 2026	RAG	15.2	10.0	31.9	27.5	18.8	Weak profile grounding
8	Search-R1	Mar. 2026	Search Agent	10.8	5.0	41.0	25.3	15.2	Search-heavy but weak inference
9	Terminal Agent (Qwen3-VL-8B-Instruct)	Mar. 2026	Terminal Agent	11.6	16.7	17.3	11.5	14.1	Lowest overall factual accuracy

Composite Score is computed as the arithmetic mean of the overall Profiling Accuracy and the overall Factual Retention Accuracy, so that both benchmark tracks contribute equally to the final ranking. We additionally report F1 scores to provide complementary information about answer quality and class-sensitive performance, but the leaderboard order is determined by this accuracy-based composite score rather than by F1.

Rank	Model Name	Date	Bei F1 / Acc	Adam F1 / Acc	Victoria F1 / Acc	Overall F1	Overall Acc
1	ChatGPT Agent Mode	Mar. 2026	23.8 / 35.0	22.7 / 55.0	16.7 / 55.0	21.0	48.3
2	Terminal Agent (GPT-5.2)	Mar. 2026	8.1 / 15.0	14.9 / 45.0	10.5 / 30.0	11.1	30.0
3	Standard RAG	Mar. 2026	13.7 / 10.0	20.8 / 35.0	20.6 / 35.0	18.4	26.7
4	Terminal Agent (Gemini-2.5-flash)	Mar. 2026	9.0 / 5.0	17.0 / 45.0	19.0 / 25.0	15.0	25.0
5	ReAct (Gemini-2.5-flash)	Mar. 2026	13.7 / 10.0	21.4 / 25.0	20.5 / 25.0	18.5	20.0
6	Terminal Agent (Qwen3-VL-8B-Instruct)	Mar. 2026	5.4 / 0.0	16.3 / 25.0	13.2 / 25.0	11.6	16.7
7	ReAct (Qwen3-VL-8B-Instruct)	Mar. 2026	5.5 / 5.6	17.8 / 25.0	12.2 / 10.0	11.8	13.5
8	Self RAG	Mar. 2026	13.8 / 5.0	16.0 / 25.0	15.9 / 0.0	15.2	10.0
9	Search-R1	Mar. 2026	6.6 / 0.0	16.5 / 15.0	9.4 / 0.0	10.8	5.0

Each cell is reported as F1 / Accuracy. Ranking is based on overall profiling accuracy.

Rank	Model Name	Date	Bei F1 / Acc	Adam F1 / Acc	Victoria F1 / Acc	Overall F1	Overall Acc
1	ChatGPT Agent Mode	Mar. 2026	20.4 / 31.2	56.2 / 90.3	29.3 / 67.0	35.3	62.8
2	Terminal Agent (GPT-5.2)	Mar. 2026	13.0 / 29.8	31.6 / 59.2	29.0 / 55.7	24.6	48.2
3	ReAct (Gemini-2.5-flash)	Mar. 2026	26.9 / 24.2	35.7 / 55.3	17.0 / 36.4	26.5	38.7
4	Standard RAG	Mar. 2026	29.7 / 24.2	39.7 / 42.7	20.5 / 23.6	30.0	30.2
5	ReAct (Qwen3-VL-8B-Instruct)	Mar. 2026	42.4 / 26.1	60.4 / 37.9	26.5 / 21.7	43.1	28.5
6	Self RAG	Mar. 2026	33.9 / 26.1	41.5 / 38.8	20.2 / 17.7	31.9	27.5
7	Search-R1	Mar. 2026	38.7 / 23.7	58.0 / 28.2	26.4 / 24.1	41.0	25.3
8	Terminal Agent (Gemini-2.5-flash)	Mar. 2026	21.8 / 18.1	33.1 / 31.1	24.4 / 20.7	26.4	23.3
9	Terminal Agent (Qwen3-VL-8B-Instruct)	Mar. 2026	14.6 / 10.7	21.6 / 13.6	15.7 / 10.3	17.3	11.5

Each cell is reported as F1 / Accuracy. Ranking is based on overall factual-retention accuracy.

Model Name	Date	Profiling Search Acc	Profiling Perception Acc	Profiling Reasoning Acc	Factual Search Acc	Factual Perception Acc	Factual Reasoning Acc
ChatGPT Agent Mode	Mar. 2026	56.5	28.5	55.8	49.1	55.5	33.8
Terminal Agent (GPT-5.2)	Mar. 2026	46.3	27.3	44.1	27.2	29.7	36.4
ReAct (Gemini-2.5-flash)	Mar. 2026	34.9	20.4	33.0	19.1	23.4	14.8
Search-R1	Mar. 2026	24.9	15.7	25.8	3.8	7.2	3.9
Standard RAG	Mar. 2026	26.2	13.8	25.5	26.2	28.7	19.1
Self RAG	Mar. 2026	23.1	13.2	22.4	8.9	12.2	7.0
Terminal Agent (Gemini-2.5-flash)	Mar. 2026	21.0	13.7	21.1	23.3	24.7	17.2
ReAct (Qwen3-VL-8B-Instruct)	Mar. 2026	26.1	16.7	23.9	13.4	18.7	10.6
Terminal Agent (Qwen3-VL-8B-Instruct)	Mar. 2026	11.1	5.7	11.4	18.6	24.2	12.2

Capability-wise accuracies reveal a consistent gap between search-heavy performance and downstream grounded perception or reasoning.

For leaderboard inclusion, please contact me via email: zhe012@e.ntu.edu.sg.

Profiles

Personal Computing Environments

Each profile instantiates a distinct personal-device environment characterized along multiple dimensions, and is paired with representative file-content statistics and factual-retention and profiling QA examples.

Three archetypal HippoCamp profiles: Bei, Adam, and Victoria. — **Archetypal user profiles in HippoCamp.** Profile A represents a student and content-creator context, Profile B a legal-executive environment, and Profile C a senior-financial-analyst setting.

Benchmark Examples

Concrete Examples From HippoCamp

This section presents concrete benchmark examples from HippoCamp. Factual retention examples require retrieving and reasoning over verifiable file-grounded facts, whereas profiling examples synthesize many grounded facts across time into coherent user-level inferences such as preferences, behavioral patterns, scheduling information, retrospective reflections, and workflows.

Example 1: Cross-Modal Asset Retrieval

The first example asks the agent to locate a previously written vlog script and identify the corresponding cat photos required by the script. This instance tests precise file localization, structured fact extraction from documents, and cross-modal matching under explicit constraints.

Factual retention example showing cross-modal asset retrieval. — **Factual retention example.** Given a vlog script, the agent extracts the required assets and identifies the matching photos, with ground-truth answer and evidence visualizations.

Example 2: Document-Video Compliance Verification

The second example evaluates rule-based factual verification under multimodal evidence. The query asks whether the "Saver Menu" logo placement in a user advertisement complies with McDonald's clearspace guidelines.

Factual retention example showing document-video compliance verification. — **Factual retention example.** The agent verifies logo clearspace by extracting the rule from the manual and checking it against video frames, with ground-truth answer and evidence visualizations.

Profiling Subtasks

Profiling queries are decomposed into five complementary subtasks: preferences, behavioral patterns, scheduling information, retrospective reflections, and workflows. These subtasks share the same evidence-grounding requirement but differ in the dominant abstraction operator, ranging from event-level reconstruction to trait-level generalization.

Profiling subtask example for preferences. — Inferring stable photo-editing preferences from an email thread with annotated visual feedback and user confirmation.

Preferences profiling captures stable, trait-like choices that generalize across situations and requires evidence-backed abstraction from episodic traces.

Annotation Hierarchy

Hierarchical Annotation Schema

The pyramid organizes supervision from low-level atomic grounding and action traces to structured trajectories and QA tasks, with increasing abstraction and aggregation toward user-level memory. Profiling queries sit at the top, requiring long-horizon, cross-modal integration of multiple factual-retention facts into a coherent user model.

Analysis

Where Current Agents Fail

We complement the profile-wise results with finer-grained analysis. We decompose performance by agent capability, characterize systematic failure and success modes, and distill design principles for future agents.

**Representative failure and success patterns on a cross-modal profiling query.** The query requires aligning evidence across heterogeneous personal files and modalities.

Retrieval Mismatch

RAG-based systems often retrieve semantically similar but contextually irrelevant documents, failing to isolate user-relevant personal files.

Grounding Avoidance

Reasoning-centric agents can locate candidate evidence yet systematically avoid committing to concrete, evidence-grounded answers.

Hard Evidence Hallucination

Terminal-based agents in sandboxed environments frequently fabricate file paths, metadata, or non-existent content as supporting evidence.

Entity Misattribution

Even the strongest systems can retrieve correct evidence yet bind it to the wrong entity, producing plausible but incorrect answers.

Difficulty Axis 1

Evidence Breadth

Difficulty Axis 2

Modality Breadth

Difficulty Axis 3

Reasoning Depth

Difficulty