BAGEL-World Bagel Logo : Towards High-Quality Visual Question–Visual Answering

Visual Question–Visual Answering (VQ-VA): Generating an image, rather than text, in response to a visual question.

*Equal Contribution
1Monash University, 2Tsinghua University, 3UC Santa Cruz, 4Bytedance Seed, 5University of Adelaide
teaser image

Examples of Visual Question–Visual Answering (VQ-VA), highlighting the substantial gap between existing closed-source models and open-weight models. The rightmost column further shows that a model trained with BAGEL-World significantly improves VQ-VA performance.

Abstract

This paper studies Visual Question–Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question—an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with BAGEL-World yields strong empirical gains: it helps LightBAGEL attain 45.0 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 6.81 from vanilla LightBAGEL; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope it will facilitate future research on VQ-VA.

BAGEL-World Data Framework
data pipeline

Illustration of the BAGEL-World framework for creating VQ-VA data. The framework consists of two stages: (1) preprocessing, which classifies and filters web-interleaved documents, and (2) an agentic pipeline that generates VQ-VA samples from the filtered documents. The agentic pipeline contains five sub-modules: retriever, filter, instruction generator, rewriter, and reasoner.

IntelligentBench: A Benchmark for VQ-VA

The purpose of IntelligentBench is to evaluate the VQ-VA abilities of different models, where the questions require knowledge and reasoning to answer. Specifically, it contains 360 human-curated examples divided into three domains: world knowledge (171), design knowledge (88), and reasoning (101).

The construction of IntelligentBench involves three main steps: (1) Document Review: Human experts examined about 3k classified interleaved web documents and, from each, selected the image pair that best represented the document's content and exhibited strong semantic connections. (2) Question Design: For each selected image pair, experts designed free-form questions targeting world knowledge, design knowledge, or reasoning. (3) Expert Cross-Review: Every candidate item is independently reviewed by at least another expert; only items that receive unanimous approval are retained.

Right image shows examples of IntelligentBench.

example image
Results
Quantitative Results

Results on VQ-VA: IntelligentBench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model World
Knowledge
Design
Knowledge
Reasoning Overall
GPT-Image-1 (OpenAI, 2025) 84.5 80.68 81.19 82.64
Nano Banana (Nano Banana AI, 2025) 81.6 82.95 80.69 81.67
BAGELThink (Deng et al., 2025) 61.99 55.11 62.38 60.42
Qwen-Image (Wu et al., 2025a) 38.07 33.66 32.75 34.31
FLUX.1-Kontext-Dev (Labs et al., 2025) 20.18 24.43 19.80 21.11
OmniGen2 (Wu et al., 2025b) 11.11 13.07 7.92 10.69
Step1X-Edit (Liu et al., 2025) 11.7 10.23 15.35 12.36
UniWorld-V1 (Lin et al., 2025) 2.92 0.57 1.49 1.94
LightBAGEL 6.14 7.39 7.43 6.81
Ours 43.57 46.02 46.53 45.00

Results on Reasoning-Based Image Editing Benchmark (a): RISEBench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model Temporal Causal Spatial Logical Overall
Nano Banana (Nano Banana AI, 2025) 25.9 47.8 37.0 18.8 32.8
GPT-Image (OpenAI, 2025) 34.1 32.2 37.0 10.6 28.9
Gemini-2.0-Flash-exp (Google, 2024) 8.2 15.5 23.0 4.7 13.3
Seedream-4.0 (Bytedance Seed, 2025) 12.9 12.2 11.0 7.1 10.8
BAGELThink (Deng et al., 2025) 5.9 17.7 21.0 1.1 11.9
Qwen-Image-Edit (Wu et al., 2025a) 4.7 10.0 17.0 2.4 8.9
FLUX.1-Kontext-Dev (Labs et al., 2025) 2.3 5.5 13.0 1.2 5.8
Step1X-Edit (Liu et al., 2025) 0.0 2.2 2.0 3.5 1.9
OmniGen (Xiao et al., 2025) 1.2 1.0 0.0 1.2 0.8
EMU2 (Sun et al., 2024) 1.2 1.1 0.0 0.0 0.5
HiDream-Edit (Cai et al., 2025) 0.0 0.0 0.0 0.0 0.0
FLUX.1-Canny (Labs et al., 2025) 0.0 0.0 0.0 0.0 0.0
LightBAGEL 1.1 1.1 3.0 1.1 1.6
Ours 14.1 21.1 14.0 1.1 12.7

Results on Reasoning-Based Image Editing Benchmark (b): KRIS-Bench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model Factual Conceptual Procedural Overall Average
GPT-4o (OpenAI, 2025) 86.99 80.08 78.61 82.18
Gemini-2.0 (Google, 2024) 73.03 61.92 67.76 67.24
Doubao (ByteDance, 2025) 72.02 64.99 62.94 67.00
OmniGen (Xiao et al., 2025) 44.79 34.23 34.37 38.00
Emu2 (Sun et al., 2024) 57.81 43.75 43.57 48.69
BAGEL-Think (Deng et al., 2025) 62.75 62.49 42.76 57.91
Step1X-Edit (Liu et al., 2025) 53.32 52.51 37.21 49.17
AnyEdit (Yu et al., 2025) 52.06 50.96 37.68 48.21
MagicBrush (Zhang et al., 2023) 54.22 47.30 34.60 46.74
InsPix2Pix (Brooks et al., 2023) 33.38 32.47 25.84 31.22
LightBAGEL 57.62 50.24 41.06 50.33
Ours 62.10 60.11 45.02 57.16

Results on Standard Image Editing Benchmark: G-Edit-Benchmark-EN and Img-Edit. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model GEdit-Bench-EN ImgEdit-Bench
SC PQ Overall Overall
GPT-4o (OpenAI, 2025) 7.85 7.62 7.53 4.20
Gemini-2.0-flash (Google, 2024) 6.73 6.61 6.32
Instruct-Pix2Pix (Brooks et al., 2023) 3.58 5.49 3.68 1.88
MagicBrush (Zhang et al., 2023) 4.68 5.66 4.52 1.90
AnyEdit (Yu et al., 2025) 3.18 5.82 3.21 2.45
ICEdit (Zhang et al., 2025) 5.11 6.85 4.84 3.05
Step1X-Edit (Liu et al., 2025) 7.09 6.76 6.70 3.06
OmniGen2 (Wu et al., 2025b) 7.16 6.77 6.41 3.43
BAGEL (Deng et al., 2025) 7.36 6.83 6.52 3.20
Ovis-U1 (Wang et al., 2025a) 6.42 3.98
UniPic (Wang et al., 2025b) 6.72 6.18 5.83 3.49
UniPic 2.0 (Wei et al., 2025) 7.10 4.06
UniWorld-V1 (Lin et al., 2025) 4.93 7.43 4.85 3.26
LightBagel 6.56 7.06 6.06 3.65
Ours 6.58 7.00 6.13 3.76
Qualitative Results

For complete results on IntelligentBench, please refer to the PDF, which also includes the results on RISEBench.

BibTeX

Incoming