Project Page • Preprint • Multimodal Agents

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Nanyang Technological University, Singapore

We construct three personal computers comprising 42.4 GB across over 2K heterogeneous files, 581 user-need-driven, evidence-grounded queries, and 46.1K fine-grained annotations. The benchmark evaluates agents' ability to search, perceive, and reason over realistic, multimodal personal file systems, where even the most advanced models achieve merely 48.0% accuracy in user profiling.

Data Visualization Hugging Face
  • 3archetypal profiles
  • 27file types
  • 581QA pairs
  • 48.3%best profiling accuracy
Overview of task in HippoCamp Benchmark. HippoCamp is a benchmark designed to evaluate agents' ability to search, perceive, and reason over long-term, realistic, large-scale personal file systems.
Open Original PDF
Abstract

Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool-use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search from massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve merely a 48.0% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Overview

Personalized Multimodal Memory Benchmark

HippoCamp is a benchmark designed to evaluate agents' ability to search, perceive, and reason over long-term, realistic, large-scale personal file systems. Through tasks that demand grounded retrieval, cross-file pattern inference, and long-horizon reasoning, the benchmark rigorously evaluates the personalized multimodal memory of contextual agents.

Realistic Personal Environments

We construct three distinct, file-intensive user profiles that faithfully simulate real-world digital ecosystems. Each profile captures long-term continuity, idiosyncratic folder structures, and cross-file interconnections.

Device-Scale Corpus and Supervision

The benchmark comprises over 2,000 heterogeneous files totaling 42.4 GB, paired with 581 evidence-grounded queries. We further provide 46.1K fine-grained annotations for step-wise failure diagnosis.

Search, Perception, and Reasoning

We design two core task categories — factual retention and profiling — that demand search, multimodal perception, and personalized reasoning. Each query requires grounding answers in evidence distributed across files and modalities.

Verifiable Personalization Remains Hard

Even the strongest systems struggle with entity disambiguation, multimodal grounding, and iterative evidence synthesis. Our results show that verifiable personalization on personal computers remains far from solved.

Construction

Pipeline, Task Distribution, and Capability Labels

Benchmark overview: (a) data collection and human-LLM collaborative annotation pipeline for grounded trajectory construction; (b) distribution of factual retention and profiling tasks; and (c) distribution of annotated agent capability labels over search, perception, and reasoning.

Benchmark overview. Data collection and human-LLM collaborative annotation pipeline, task distribution, and agent capability decomposition.
Open Original PDF
Leaderboard

HippoCamp Leaderboard

Which contextual agent performs best on realistic personal file systems? We benchmark a wide range of state-of-the-art multimodal models and agentic methods, reporting profiling accuracy, factual-retention accuracy, and a composite score.

Official 581 QA Pairs 42.4 GB 46.1K Annotations Data Visualization Hugging Face
  • Two benchmark tracks: Profiling and factual retention are reported separately.
  • Three user environments: Scores are averaged across Bei, Adam, and Victoria.
  • Composite score: The leaderboard score is the mean of overall profiling accuracy and overall factual-retention accuracy.
  • Fine-grained diagnosis: Capability-wise performance is further decomposed into search, perception, and reasoning.
Best Overall

ChatGPT Agent Mode

Composite score 55.6, with the strongest overall balance across both benchmark tracks.

Best Profiling

48.3 Acc

ChatGPT Agent Mode leads the profiling track, substantially ahead of all other evaluated methods.

Best Factual Retention

62.8 Acc

Factual retention remains easier than profiling, but only a few systems achieve reliable performance.

Rank Model Name Date Category Profiling F1 Profiling Acc Factual F1 Factual Retention Acc Composite Score Notes
1ChatGPT Agent ModeMar. 2026Agent21.048.335.362.855.6Most balanced overall
2Terminal Agent (GPT-5.2)Mar. 2026Terminal Agent11.130.024.648.239.1Strong factual retention
3ReAct (Gemini-2.5-flash)Mar. 2026ReAct18.520.026.538.729.4Best ReAct variant
4Standard RAGMar. 2026RAG18.426.730.030.228.5Stable retrieval baseline
5Terminal Agent (Gemini-2.5-flash)Mar. 2026Terminal Agent15.025.026.423.324.2Mid-tier across both tracks
6ReAct (Qwen3-VL-8B-Instruct)Mar. 2026ReAct11.813.543.128.521.0High factual F1, low accuracy
7Self RAGMar. 2026RAG15.210.031.927.518.8Weak profile grounding
8Search-R1Mar. 2026Search Agent10.85.041.025.315.2Search-heavy but weak inference
9Terminal Agent (Qwen3-VL-8B-Instruct)Mar. 2026Terminal Agent11.616.717.311.514.1Lowest overall factual accuracy

Composite Score is computed as the arithmetic mean of the overall Profiling Accuracy and the overall Factual Retention Accuracy, so that both benchmark tracks contribute equally to the final ranking. We additionally report F1 scores to provide complementary information about answer quality and class-sensitive performance, but the leaderboard order is determined by this accuracy-based composite score rather than by F1.

Rank Model Name Date Bei F1 / Acc Adam F1 / Acc Victoria F1 / Acc Overall F1 Overall Acc
1ChatGPT Agent ModeMar. 202623.8 / 35.022.7 / 55.016.7 / 55.021.048.3
2Terminal Agent (GPT-5.2)Mar. 20268.1 / 15.014.9 / 45.010.5 / 30.011.130.0
3Standard RAGMar. 202613.7 / 10.020.8 / 35.020.6 / 35.018.426.7
4Terminal Agent (Gemini-2.5-flash)Mar. 20269.0 / 5.017.0 / 45.019.0 / 25.015.025.0
5ReAct (Gemini-2.5-flash)Mar. 202613.7 / 10.021.4 / 25.020.5 / 25.018.520.0
6Terminal Agent (Qwen3-VL-8B-Instruct)Mar. 20265.4 / 0.016.3 / 25.013.2 / 25.011.616.7
7ReAct (Qwen3-VL-8B-Instruct)Mar. 20265.5 / 5.617.8 / 25.012.2 / 10.011.813.5
8Self RAGMar. 202613.8 / 5.016.0 / 25.015.9 / 0.015.210.0
9Search-R1Mar. 20266.6 / 0.016.5 / 15.09.4 / 0.010.85.0

Each cell is reported as F1 / Accuracy. Ranking is based on overall profiling accuracy.

Rank Model Name Date Bei F1 / Acc Adam F1 / Acc Victoria F1 / Acc Overall F1 Overall Acc
1ChatGPT Agent ModeMar. 202620.4 / 31.256.2 / 90.329.3 / 67.035.362.8
2Terminal Agent (GPT-5.2)Mar. 202613.0 / 29.831.6 / 59.229.0 / 55.724.648.2
3ReAct (Gemini-2.5-flash)Mar. 202626.9 / 24.235.7 / 55.317.0 / 36.426.538.7
4Standard RAGMar. 202629.7 / 24.239.7 / 42.720.5 / 23.630.030.2
5ReAct (Qwen3-VL-8B-Instruct)Mar. 202642.4 / 26.160.4 / 37.926.5 / 21.743.128.5
6Self RAGMar. 202633.9 / 26.141.5 / 38.820.2 / 17.731.927.5
7Search-R1Mar. 202638.7 / 23.758.0 / 28.226.4 / 24.141.025.3
8Terminal Agent (Gemini-2.5-flash)Mar. 202621.8 / 18.133.1 / 31.124.4 / 20.726.423.3
9Terminal Agent (Qwen3-VL-8B-Instruct)Mar. 202614.6 / 10.721.6 / 13.615.7 / 10.317.311.5

Each cell is reported as F1 / Accuracy. Ranking is based on overall factual-retention accuracy.

Model Name Date Profiling Search Acc Profiling Perception Acc Profiling Reasoning Acc Factual Search Acc Factual Perception Acc Factual Reasoning Acc
ChatGPT Agent ModeMar. 202656.528.555.849.155.533.8
Terminal Agent (GPT-5.2)Mar. 202646.327.344.127.229.736.4
ReAct (Gemini-2.5-flash)Mar. 202634.920.433.019.123.414.8
Search-R1Mar. 202624.915.725.83.87.23.9
Standard RAGMar. 202626.213.825.526.228.719.1
Self RAGMar. 202623.113.222.48.912.27.0
Terminal Agent (Gemini-2.5-flash)Mar. 202621.013.721.123.324.717.2
ReAct (Qwen3-VL-8B-Instruct)Mar. 202626.116.723.913.418.710.6
Terminal Agent (Qwen3-VL-8B-Instruct)Mar. 202611.15.711.418.624.212.2

Capability-wise accuracies reveal a consistent gap between search-heavy performance and downstream grounded perception or reasoning.

For leaderboard inclusion, please contact me via email: zhe012@e.ntu.edu.sg.

Profiles

Personal Computing Environments

Each profile instantiates a distinct personal-device environment characterized along multiple dimensions, and is paired with representative file-content statistics and factual-retention and profiling QA examples.

Archetypal user profiles in HippoCamp. Profile A represents a student and content-creator context, Profile B a legal-executive environment, and Profile C a senior-financial-analyst setting.
Open Original PDF
Benchmark Examples

Concrete Examples From HippoCamp

This section presents concrete benchmark examples from HippoCamp. Factual retention examples require retrieving and reasoning over verifiable file-grounded facts, whereas profiling examples synthesize many grounded facts across time into coherent user-level inferences such as preferences, behavioral patterns, scheduling information, retrospective reflections, and workflows.

Benchmark Task 1 Example

Factual Retention

These benchmark examples require producing precise answers that are fully supported by file-grounded evidence under realistic filesystem "haystack" conditions. Evaluation emphasizes high-fidelity factual recall, structured alignment, and evidence-backed answers with minimal hallucination.

Example 1: Cross-Modal Asset Retrieval

The first example asks the agent to locate a previously written vlog script and identify the corresponding cat photos required by the script. This instance tests precise file localization, structured fact extraction from documents, and cross-modal matching under explicit constraints.

Factual retention example. Given a vlog script, the agent extracts the required assets and identifies the matching photos, with ground-truth answer and evidence visualizations.
Open Original PDF

Example 2: Document-Video Compliance Verification

The second example evaluates rule-based factual verification under multimodal evidence. The query asks whether the "Saver Menu" logo placement in a user advertisement complies with McDonald's clearspace guidelines.

Factual retention example. The agent verifies logo clearspace by extracting the rule from the manual and checking it against video frames, with ground-truth answer and evidence visualizations.
Open Original PDF
Benchmark Task 2 Example

Profiling

These benchmark examples require inferring user-level attributes from device-resident evidence distributed across files, modalities, and time. Evaluation emphasizes profile consistency, correct temporal anchoring, executability of suggested actions when applicable, and traceability to grounded evidence.

Profiling Subtasks

Profiling queries are decomposed into five complementary subtasks: preferences, behavioral patterns, scheduling information, retrospective reflections, and workflows. These subtasks share the same evidence-grounding requirement but differ in the dominant abstraction operator, ranging from event-level reconstruction to trait-level generalization.

Profiling subtask distribution. Proportions of the five profiling subtasks across the three profiles.
Open Original Figure
Profiling Atlas

Preferences

Inferring stable photo-editing preferences from an email thread with annotated visual feedback and user confirmation.
Open Original PDF
Preferences profiling captures stable, trait-like choices that generalize across situations and requires evidence-backed abstraction from episodic traces.
Annotation Hierarchy

Hierarchical Annotation Schema

The pyramid organizes supervision from low-level atomic grounding and action traces to structured trajectories and QA tasks, with increasing abstraction and aggregation toward user-level memory. Profiling queries sit at the top, requiring long-horizon, cross-modal integration of multiple factual-retention facts into a coherent user model.

HippoCamp hierarchical annotation schema. The pyramid organizes supervision from low-level atomic grounding and action traces to structured trajectories and QA tasks, with increasing abstraction and aggregation toward user-level memory.
Open Original PDF
Analysis

Where Current Agents Fail

We complement the profile-wise results with finer-grained analysis. We decompose performance by agent capability, characterize systematic failure and success modes, and distill design principles for future agents.

Representative failure and success patterns on a cross-modal profiling query. The query requires aligning evidence across heterogeneous personal files and modalities.
Open Original PDF

Retrieval Mismatch

RAG-based systems often retrieve semantically similar but contextually irrelevant documents, failing to isolate user-relevant personal files.

Grounding Avoidance

Reasoning-centric agents can locate candidate evidence yet systematically avoid committing to concrete, evidence-grounded answers.

Hard Evidence Hallucination

Terminal-based agents in sandboxed environments frequently fabricate file paths, metadata, or non-existent content as supporting evidence.

Entity Misattribution

Even the strongest systems can retrieve correct evidence yet bind it to the wrong entity, producing plausible but incorrect answers.

Difficulty Axis 1

Evidence Breadth

Difficulty Axis 2

Modality Breadth

Difficulty Axis 3

Reasoning Depth

Difficulty

Difficulty Distribution