DISBench Leaderboard

What is DeepImageSearch and DISBench?

DeepImageSearch represents a paradigm evolution in image retrieval, advancing from independent image matching to corpus-level contextual reasoning over visual histories. People capture thousands of photos over the years, forming rich episodic memories where information is distributed across temporal sequences rather than confined to single snapshots. Many real-world queries over such episodic memories cannot be resolved by evaluating each image independently. The target images can only be identified by exploring and reasoning over the entire image corpus. This corpus-level contextual reasoning makes agentic capabilities essential rather than auxiliary.

DISBench is the first benchmark designed for this task. Given a user's photo collection and a natural language query, agents must autonomously plan search trajectories, discover latent cross-image associations, and chain scattered visual evidence through multi-step exploration to return the exact set of qualifying images. The benchmark covers two reasoning patterns: Intra-Event queries that require locating a target event via contextual clues and then filtering within it, and Inter-Event queries that demand scanning across multiple events to find recurring elements under temporal or spatial constraints.

How to Read the Leaderboard

Champion List shows top results per track. Use the sub-tabs to switch between:

Standard Pre-processing is limited to encoding images into embeddings for building a retrieval index. No additional pre-computation (e.g., captioning, graph construction) is allowed. Tests agentic reasoning over raw visual data.
Open Arbitrary pre-processing is permitted (captioning, knowledge graph construction, structured indexing, etc.). Tests system-level upper bounds with full engineering freedom.

Full Analysis lets you compare across tracks and filter by agent framework, backbone model, or retriever. Click any score column header to sort and highlight.

Evaluation Metrics

All metrics are computed at the set level: models must predict the exact set of target images for each query.

EM (Exact Match): the predicted set must be identical to the ground truth (no extra, no missing).
F1 Score: harmonic mean of precision and recall over the predicted vs. ground-truth image sets.

Scores are reported across three dimensions: Overall (all queries), Intra-Event (locate a specific event, then filter targets within it), and Inter-Event (scan across multiple events to find recurring elements under temporal/spatial constraints).

How to Submit

Prepare a .json file with two fields: meta (your method info) and predictions (your model outputs). Go to the Submit tab, upload the file, and the system will automatically create a Pull Request on the Space repository for review.

After maintainers merge your PR, the evaluation script will compute scores and update the leaderboard.

Required fields in meta:

method_name: display name for your method
agent_framework, backbone_model, retriever_model
track: "Standard" or "Open"

See the Submit tab for the full JSON template and format details.

Rank	Method	Agent	Backbone	Retriever	Overall EM ↓	Overall F1	Intra EM	Intra F1	Inter EM	Inter F1

Submit via Pull Request

Upload a .json file containing your method metadata and predictions. The system will create a Pull Request on the Space repository. Maintainers will review, run evaluation, and merge your results into the leaderboard.

1. Upload
Submit your JSON file below

2. PR Created
A Pull Request is opened automatically

3. Review & Merge
Maintainers evaluate and publish scores

JSON Format

{
  "meta": {
    "method_name": "My-Agent",
    "organization": "My-Org",
    "project_url": "https://github.com/...",
    "agent_framework": "ImageSeeker",
    "backbone_model": "Gemini-3-Pro",
    "retriever_model": "Qwen3-VL-Embedding-8B",
    "track": "Standard"
  },
  "predictions": {
    "query_001": ["photo_id_1", "photo_id_2"],
    "query_002": ["photo_id_5"],
    ...
  }
}

Field Descriptions

meta.method_name: Display name shown on the leaderboard (required)

meta.organization: Your team or organization name (optional, not displayed)

meta.project_url: Link to paper, code, or project page (optional)

meta.agent_framework: Agent framework used (e.g. ReAct, ImageSeeker)

meta.backbone_model: Main LLM/VLM backbone (e.g. GPT-4o, Gemini-3-Pro)

meta.retriever_model: Retrieval model used (e.g. Qwen3-VL-Embedding-8B)

meta.track: Must be "Standard" or "Open"

predictions: A dict mapping each query ID to a list of predicted photo IDs

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories