DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

ICLR 2026

Fangyu Lei^1,2,3*§, Jinxiang Meng^1,2*, Yiming Huang⁵, Junjie Zhao³, Yitong Zhang⁶,
Jianwen Luo^1,2, Xin Zou³, Ruiyi Yang³, Wenbo Shi³, Yan Gao³, Shizhu He^1,2,
Zuo Wang³, Qian Liu⁴, Yang Wang³, Ke Wang^3,†, Jun Zhao^1,2, Kang Liu^1,2,†

¹Institute of Automation, CAS ²University of Chinese Academy of Sciences
³ByteDance Seed ⁴TikTok ⁵UC San Diego ⁶NUS

^*Equal Contribution, ^§Work done at ByteDance Seed, ^†Corresponding authors

Paper Code Data Twitter Submit

Overview

Real-world enterprise data intelligence workflows encompass data engineering (DE) that turns raw sources into analytical-ready tables and data analysis (DA) that converts those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows.

Data Engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines. Data Analysis (DA) tasks pose open-ended business problems that demand strategic planning and insight synthesis. Our experiments reveal that even state-of-the-art agents falter, with DE success rates under 20% and DA scores averaging below 40%.

🧩 Case Studies

DE: Repository-level Data Engineering
DA: Open-ended Data Analysis

Scenario: Designing a Data Warehouse Blueprint

Given a vague business requirement, the agent must design a comprehensive Architecture Blueprint. This involves identifying necessary data sources, defining the grain for Staging/Intermediate/Marts layers, and specifying strict data contracts before writing any code.

Scenario: Building Pipelines from Scratch

The agent is tasked with building a multi-layer SQL-based data pipeline from zero. It must generate SQL files for Staging (cleaning), Intermediate (logic), and Marts (aggregation), ensuring all 30+ files pass compilation and data integrity tests.

Scenario: Evolving Business Logic

A new requirement demands evolving the existing pipeline. The agent must interpret the request, identify the impact scope, and add or modify specific SQL files to update the DAG without breaking existing dependencies.

User Question: Credit risk and RMB 100M allocation with churn-aware limits/rates across SME ratings.

Executive summary

Deploy 100M RMB: A 70.97M @4.65%, B 13.32M @5.05%, C 15.70M @5.45%; D excluded. Offer 120.23M to land 100M after churn.
Default ladder: A 0.00%, B 2.63%, C 5.88%, D 100%; revenue capacity and stability favor overweighting A.
Expected net profit ≈ 3.44M RMB (LGD 45%, PD observed).

Visual evidence

Data and method

Tables: company_info, sales_invoices, input_invoices, annual_rate_&_churn.
Metrics: revenue capacity (avg monthly sales), profit CV, dependency HHI, default by rating, churn curve by rate.

Key insights

Default risk escalates sharply by rating; exclude D or fully collateralize.
A-rated: strongest revenue and diversified supply chain → prioritize exposure.
Volatility and concentration amplify B/C risk → penalize HHI >0.50 or CV >2.0 with tighter covenants.

Credit allocation rules

Limit = 30% × Avg Monthly Sales × rating factor (A1.20/B1.00/C0.70/D0) × stability factor (by CV) × dependency factor (HHI-based, floored 0.60).
Rates: base A 4.65% / B 5.05% / C 5.45% with CV and HHI add-ons; churn-adjusted offers = deploy / (1 − churn).

Source: case-reports/dacomp-001/dacomp-001_gsb_ref_2.md

User Question: Identify ride booking peaks/troughs in 2024, analyze cancellation and cost per km, and give operational actions.

Uber Ride Booking Analysis: Peaks, Troughs & Actions

Executive Summary

150k rides (2024-01-01 to 09-09) show a 4.5x peak-to-trough swing (18:00 vs 23:00) with stable ~25% cancellation.

Hourly and daily patterns

Peak 18:00 (12,397 bookings); trough 23:00 (2,762). Business: supply-demand imbalance.

Peak day Nov 16 (306) vs trough May 2 (198). Weekend cost/km higher (43+) vs weekday (28-30).

Vehicle performance

Auto/Go Mini dominate share; completion 61-63% across vehicle types; cancellations stable across hours.

Operational recommendations

Dynamic driver allocation: pre-position +40% for 17:00-19:00; trim 23:00-05:00.
Surge tuning: stronger 18:00-20:00 multipliers; keep weekday base steady, lean into weekend premium.
Cancellation management: predictive flags, better ETA accuracy, proactive alternatives.
Vehicle mix: raise Auto/Go Mini during peaks; reserve premium for weekend/high-value slots.

Source: case-reports/dacomp-021/dacomp-021_gsb_ref_3.md

User Question: Find >120-day campaigns with CAC +25% and LTV/CAC -20% last 30 days; analyze decay by lifecycle/channel/maturity and recommend fixes.

Customer Acquisition Efficiency Decay (Google Ads)

Data and scope

Tables: google_ads__campaign_report, google_ads__customer_acquisition_analysis.
Window: 2024-01-03 to 2024-12-31; runtime >120 days; decay if CAC growth >25% and LTV/CAC change < -20%.

Flagged set

10 campaigns (High 8, Medium 2); avg CAC growth +176.7%, avg LTV/CAC change -93.5%; composite risk mean 0.778.
Examples: CMP_ACC_FIN_001_008 (Shopping) CAC +636%, LTV/CAC -100%; CMP_ACC_GAME_001_008 (Video) CAC +72%, LTV/CAC -80.7%; etc.

Decay patterns

Channels: Video/Display show steep inflation and LTV/CAC collapse; Shopping also impacted.
Lifecycle: Decline-stage worst; Unknown treated as Mature/Decline.
Maturity: Mature accounts decaying → audience exhaustion and saturation.

Recommendations

Budget triage: cut 20-35% from High-risk Video/Display with LTV/CAC <1; reallocate 15-25% to Growth/low-risk Search or stronger remarketing.
Lifecycle resets: new creatives + exclusions for Video; tighter placements and value-based bidding for Display; feed/query sculpting for Shopping.
Targeting: boost VIP/High Value B2B; suppress low-value cohorts; segment by intent/recency.
Guardrails: weekly decay dashboard; pause if CAC growth >50% WoW with LTV/CAC <1.2; alerts when efficiency percentile <35 two weeks.

Source: case-reports/dacomp-057/dacomp-057_gsb_ref_1.md

User Question: Filter payment_rate_percentage <75% and outstanding_balance >15,000; score risk, profit share, 12-month collection trend, distribution vs normal, 6-month warning, and tiered controls.

Customer Payment Risk Deep-Dive

Executive summary

High-risk cohort: 399 customers; avg risk score 53.65 (top >70).
Profit impact: 13.81% of total gross profit loss (cohort -$345.46M vs total -$2.50B).
Collections: 3-month deterioration, -5.78 pp; 12-month change -1.35 pp.
6-month warning: base/worst-case show $0 shortfall in provided forecasts (risk_adjusted_inflows == forecasted_inflows).

Key metrics

Filter: payment_rate_percentage <75% AND outstanding_balance >15,000.
Risk Score = (100 - payment_rate_percentage) ×0.4 + ((850 - credit_score)/850×100) ×0.4 + (100 - business_stability_score) ×0.2.
Median lifespan 894 vs 940 (normal); avg_invoice_amount 15,714 vs 15,365.

Visual evidence

Recommendations

Extreme (>=75): hold orders; require deposits/LoCs; shorten terms; enforce late fees; rapid outreach.
High (60-75): reduce limits; autopay; early-pay discounts; weekly monitoring.
Medium (45-60): proactive dunning; reminders; PO validation.
Low (<45): maintain terms; loyalty incentives.
Portfolio: prioritize collections by risk × outstanding_balance; stress-test cash at -5%/-10%/-20% collection shocks.

Source: case-reports/dacomp-090/dacomp-090_gsb_ref_1.md

Rank	Model	Framework	Architecture Score	Type
1	GPT-5	DE-Agent	63.93	Proprietary
2	DeepSeek-V3.1	DE-Agent	52.66	Open
3	Gemini-2.5-Pro	DE-Agent	51.96	Proprietary
4	Qwen3-Coder	DE-Agent	51.43	Open
5	Qwen3-235B-A22B	DE-Agent	50.73	Open
6	o3	DE-Agent	48.32	Proprietary
7	Qwen3-8B	DE-Agent	45.12	Open
1	GPT-5	DE-Agent	63.60	Proprietary
2	DeepSeek-V3.1	DE-Agent	53.08	Open
3	Gemini-2.5-Pro	DE-Agent	51.90	Proprietary
4	Qwen3-Coder	DE-Agent	51.11	Open
5	Qwen3-235B-A22B	DE-Agent	50.61	Open
6	o3	DE-Agent	48.02	Proprietary
7	Qwen3-8B	DE-Agent	46.22	Open

Rank	Model	Framework	CFS Score	CS Score	Type
1	GPT-5	DE-Agent	30.79	61.98	Proprietary
2	Gemini-2.5-Pro	DE-Agent	27.66	55.32	Proprietary
3	Qwen3-Coder	DE-Agent	23.64	54.21	Open
4	DeepSeek-V3.1	DE-Agent	22.33	50.04	Open
5	o3	DE-Agent	15.07	35.55	Proprietary
6	Qwen3-235B-A22B	DE-Agent	2.43	20.15	Open
7	Qwen3-8B	DE-Agent	1.31	15.33	Open
1	GPT-5	DE-Agent	30.49	61.85	Proprietary
2	Gemini-2.5-Pro	DE-Agent	26.98	55.18	Proprietary
3	Qwen3-Coder	DE-Agent	23.23	54.59	Open
4	DeepSeek-V3.1	DE-Agent	22.62	50.22	Open
5	o3	DE-Agent	15.00	35.10	Proprietary
6	Qwen3-235B-A22B	DE-Agent	2.31	20.03	Open
7	Qwen3-8B	DE-Agent	1.21	15.78	Open

Rank	Model	Framework	CFS Score	Success Rate (SR)	Type
1	GPT-5	DE-Agent	38.75	20.00%	Proprietary
2	Qwen3-Coder	DE-Agent	27.12	12.00%	Open
3	o3	DE-Agent	24.42	6.00%	Proprietary
4	DeepSeek-V3.1	DE-Agent	24.11	10.00%	Open
5	Gemini-2.5-Pro	DE-Agent	23.97	8.00%	Proprietary
6	Qwen3-8B	DE-Agent	15.89	2.00%	Open
7	Qwen3-235B-A22B	DE-Agent	12.43	2.00%	Open
1	GPT-5	DE-Agent	37.88	20.00%	Proprietary
2	Qwen3-Coder	DE-Agent	26.59	12.00%	Open
3	DeepSeek-V3.1	DE-Agent	24.69	8.00%	Open
4	Gemini-2.5-Pro	DE-Agent	24.28	8.00%	Proprietary
5	o3	DE-Agent	24.23	6.00%	Proprietary
6	Qwen3-8B	DE-Agent	15.19	0.00%	Open
7	Qwen3-235B-A22B	DE-Agent	13.01	0.00%	Open

Rank	Model	Framework	📊 DA Score	Type
1	GPT-5	DA-Agent	50.84	Proprietary
2	GPT-5	OpenHands	46.99	Proprietary
3	Kimi-K2	DA-Agent	41.89	Proprietary
4	Gemini-2.5-Pro	DA-Agent	34.70	Proprietary
5	DeepSeek-V3.1	DA-Agent	34.33	Open
6	DeepSeek-V3.1	OpenHands	33.87	Open
7	Gemini-2.5-Pro	OpenHands	33.38	Proprietary
8	o3	DA-Agent	28.20	Proprietary
9	o3	OpenHands	26.57	Proprietary
10	Qwen3-Coder	DA-Agent	25.13	Open
11	Qwen3-Coder	OpenHands	24.28	Open
12	Doubao-Seed-1.6	DA-Agent	20.74	Proprietary
13	Qwen3-235B-A22B	DA-Agent	13.25	Open
14	Qwen3-235B-A22B	OpenHands	12.43	Open
15	Qwen3-8B	DA-Agent	4.47	Open
1	GPT-5	DA-Agent	49.49	Proprietary
2	GPT-5	OpenHands	43.69	Proprietary
3	Gemini-2.5-Pro	DA-Agent	33.75	Proprietary
4	Gemini-2.5-Pro	OpenHands	31.22	Proprietary
5	Kimi-K2	DA-Agent	31.22	Proprietary
6	o3	DA-Agent	28.70	Proprietary
7	o3	OpenHands	27.87	Proprietary
8	DeepSeek-V3.1	DA-Agent	27.75	Open
9	DeepSeek-V3.1	OpenHands	24.16	Open
10	Qwen3-Coder	DA-Agent	22.64	Open
11	Qwen3-Coder	OpenHands	21.84	Open
12	Doubao-Seed-1.6	DA-Agent	17.83	Proprietary
13	Qwen3-235B-A22B	DA-Agent	12.74	Open
14	Qwen3-235B-A22B	OpenHands	11.50	Open
15	Qwen3-8B	DA-Agent	6.33	Open

* DA Score is the aggregate of Completeness, Accuracy, Insightfulness, Readability, Analytical Depth, and Visualization scores.

BibTeX

@misc{lei2025dacompbenchmarkingdataagents,
      title={DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle}, 
      author={Fangyu Lei and Jinxiang Meng and Yiming Huang and Junjie Zhao and Yitong Zhang and Jianwen Luo and Xin Zou and Ruiyi Yang and Wenbo Shi and Yan Gao and Shizhu He and Zuo Wang and Qian Liu and Yang Wang and Ke Wang and Jun Zhao and Kang Liu},
      year={2025},
      eprint={2512.04324},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.04324}, 
}