Bagel-World: Towards High-Quality Visual Question-Visual Answering

Abstract

This paper studies Visual Question–Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question—an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce BAGEL-World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with BAGEL-World yields strong empirical gains: it helps LightBAGEL attain 45.0 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 6.81 from vanilla LightBAGEL; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope it will facilitate future research on VQ-VA.

BAGEL-World Data Framework

Illustration of the BAGEL-World framework for creating VQ-VA data. The framework consists of two stages: (1) preprocessing, which classifies and filters web-interleaved documents, and (2) an agentic pipeline that generates VQ-VA samples from the filtered documents. The agentic pipeline contains five sub-modules: retriever, filter, instruction generator, rewriter, and reasoner.

IntelligentBench: A Benchmark for VQ-VA

The purpose of IntelligentBench is to evaluate the VQ-VA abilities of different models, where the questions require knowledge and reasoning to answer. Specifically, it contains 360 human-curated examples divided into three domains: world knowledge (171), design knowledge (88), and reasoning (101).

The construction of IntelligentBench involves three main steps: (1) Document Review: Human experts examined about 3k classified interleaved web documents and, from each, selected the image pair that best represented the document's content and exhibited strong semantic connections. (2) Question Design: For each selected image pair, experts designed free-form questions targeting world knowledge, design knowledge, or reasoning. (3) Expert Cross-Review: Every candidate item is independently reviewed by at least another expert; only items that receive unanimous approval are retained.

Right image shows examples of IntelligentBench.

Results

Quantitative Results

Results on VQ-VA: IntelligentBench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model	World Knowledge	Design Knowledge	Reasoning	Overall
GPT-Image-1 (OpenAI, 2025)	84.5	80.68	81.19	82.64
Nano Banana (Nano Banana AI, 2025)	81.6	82.95	80.69	81.67
BAGELThink (Deng et al., 2025)	61.99	55.11	62.38	60.42
Qwen-Image (Wu et al., 2025a)	38.07	33.66	32.75	34.31
FLUX.1-Kontext-Dev (Labs et al., 2025)	20.18	24.43	19.80	21.11
OmniGen2 (Wu et al., 2025b)	11.11	13.07	7.92	10.69
Step1X-Edit (Liu et al., 2025)	11.7	10.23	15.35	12.36
UniWorld-V1 (Lin et al., 2025)	2.92	0.57	1.49	1.94
LightBAGEL	6.14	7.39	7.43	6.81
Ours	43.57	46.02	46.53	45.00

Results on Reasoning-Based Image Editing Benchmark (a): RISEBench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model	Temporal	Causal	Spatial	Logical	Overall
Nano Banana (Nano Banana AI, 2025)	25.9	47.8	37.0	18.8	32.8
GPT-Image (OpenAI, 2025)	34.1	32.2	37.0	10.6	28.9
Gemini-2.0-Flash-exp (Google, 2024)	8.2	15.5	23.0	4.7	13.3
Seedream-4.0 (Bytedance Seed, 2025)	12.9	12.2	11.0	7.1	10.8
BAGELThink (Deng et al., 2025)	5.9	17.7	21.0	1.1	11.9
Qwen-Image-Edit (Wu et al., 2025a)	4.7	10.0	17.0	2.4	8.9
FLUX.1-Kontext-Dev (Labs et al., 2025)	2.3	5.5	13.0	1.2	5.8
Step1X-Edit (Liu et al., 2025)	0.0	2.2	2.0	3.5	1.9
OmniGen (Xiao et al., 2025)	1.2	1.0	0.0	1.2	0.8
EMU2 (Sun et al., 2024)	1.2	1.1	0.0	0.0	0.5
HiDream-Edit (Cai et al., 2025)	0.0	0.0	0.0	0.0	0.0
FLUX.1-Canny (Labs et al., 2025)	0.0	0.0	0.0	0.0	0.0
LightBAGEL	1.1	1.1	3.0	1.1	1.6
Ours	14.1	21.1	14.0	1.1	12.7

Results on Reasoning-Based Image Editing Benchmark (b): KRIS-Bench. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model	Factual	Conceptual	Procedural	Overall Average
GPT-4o (OpenAI, 2025)	86.99	80.08	78.61	82.18
Gemini-2.0 (Google, 2024)	73.03	61.92	67.76	67.24
Doubao (ByteDance, 2025)	72.02	64.99	62.94	67.00
OmniGen (Xiao et al., 2025)	44.79	34.23	34.37	38.00
Emu2 (Sun et al., 2024)	57.81	43.75	43.57	48.69
BAGEL-Think (Deng et al., 2025)	62.75	62.49	42.76	57.91
Step1X-Edit (Liu et al., 2025)	53.32	52.51	37.21	49.17
AnyEdit (Yu et al., 2025)	52.06	50.96	37.68	48.21
MagicBrush (Zhang et al., 2023)	54.22	47.30	34.60	46.74
InsPix2Pix (Brooks et al., 2023)	33.38	32.47	25.84	31.22
LightBAGEL	57.62	50.24	41.06	50.33
Ours	62.10	60.11	45.02	57.16

Results on Standard Image Editing Benchmark: G-Edit-Benchmark-EN and Img-Edit. Fully open-source models (both training data and model weights) are shown without shading, open-weight models are shaded in light blue, and closed-source models are shaded in light gray for clarity.

Model	GEdit-Bench-EN			ImgEdit-Bench
	SC	PQ	Overall	Overall
GPT-4o (OpenAI, 2025)	7.85	7.62	7.53	4.20
Gemini-2.0-flash (Google, 2024)	6.73	6.61	6.32	–
Instruct-Pix2Pix (Brooks et al., 2023)	3.58	5.49	3.68	1.88
MagicBrush (Zhang et al., 2023)	4.68	5.66	4.52	1.90
AnyEdit (Yu et al., 2025)	3.18	5.82	3.21	2.45
ICEdit (Zhang et al., 2025)	5.11	6.85	4.84	3.05
Step1X-Edit (Liu et al., 2025)	7.09	6.76	6.70	3.06
OmniGen2 (Wu et al., 2025b)	7.16	6.77	6.41	3.43
BAGEL (Deng et al., 2025)	7.36	6.83	6.52	3.20
Ovis-U1 (Wang et al., 2025a)	–	–	6.42	3.98
UniPic (Wang et al., 2025b)	6.72	6.18	5.83	3.49
UniPic 2.0 (Wei et al., 2025)	–	–	7.10	4.06
UniWorld-V1 (Lin et al., 2025)	4.93	7.43	4.85	3.26
LightBagel	6.56	7.06	6.06	3.65
Ours	6.58	7.00	6.13	3.76

Qualitative Results

For complete results on IntelligentBench, please refer to the PDF, which also includes the results on RISEBench.

BibTeX

Incoming

BAGEL-World : Towards High-Quality Visual Question–Visual Answering

Visual Question–Visual Answering (VQ-VA): Generating an image, rather than text, in response to a visual question.

BibTeX