What is DeepImageSearch and DISBench?
DeepImageSearch represents a paradigm evolution in image retrieval, advancing from independent image matching to corpus-level contextual reasoning over visual histories. People capture thousands of photos over the years, forming rich episodic memories where information is distributed across temporal sequences rather than confined to single snapshots. Many real-world queries over such episodic memories cannot be resolved by evaluating each image independently. The target images can only be identified by exploring and reasoning over the entire image corpus. This corpus-level contextual reasoning makes agentic capabilities essential rather than auxiliary.
DISBench is the first benchmark designed for this task. Given a user's photo collection and a natural language query, agents must autonomously plan search trajectories, discover latent cross-image associations, and chain scattered visual evidence through multi-step exploration to return the exact set of qualifying images. The benchmark covers two reasoning patterns: Intra-Event queries that require locating a target event via contextual clues and then filtering within it, and Inter-Event queries that demand scanning across multiple events to find recurring elements under temporal or spatial constraints.
How to Read the Leaderboard
Champion List shows top results per track. Use the sub-tabs to switch between:
- Standard Pre-processing is limited to encoding images into embeddings for building a retrieval index. No additional pre-computation (e.g., captioning, graph construction) is allowed. Tests agentic reasoning over raw visual data.
- Open Arbitrary pre-processing is permitted (captioning, knowledge graph construction, structured indexing, etc.). Tests system-level upper bounds with full engineering freedom.
Full Analysis lets you compare across tracks and filter by agent framework, backbone model, or retriever. Click any score column header to sort and highlight.
Evaluation Metrics
All metrics are computed at the set level: models must predict the exact set of target images for each query.
- EM (Exact Match): the predicted set must be identical to the ground truth (no extra, no missing).
- F1 Score: harmonic mean of precision and recall over the predicted vs. ground-truth image sets.
Scores are reported across three dimensions: Overall (all queries), Intra-Event (locate a specific event, then filter targets within it), and Inter-Event (scan across multiple events to find recurring elements under temporal/spatial constraints).
How to Submit
Prepare a .json file with two fields: meta (your method info) and predictions (your model outputs).
Go to the Submit tab, upload the file, and the system will automatically create a
Pull Request on the Space repository for review.
After maintainers merge your PR, the evaluation script will compute scores and update the leaderboard.
Required fields in meta:
method_name: display name for your methodagent_framework,backbone_model,retriever_modeltrack:"Standard"or"Open"
See the Submit tab for the full JSON template and format details.
| Rank | Method | Agent | Backbone | Retriever | Overall EM ↓ | Overall F1 | Intra EM | Intra F1 | Inter EM | Inter F1 |
|---|