How Much Detail Should a Training Data Explanation Include?

Does giving users more detail about an AI's training data actually help them scrutinize it, or does it just slow them down? I ran a controlled study to find out, and the results pushed back against a lot of conventional explainability advice.

Participants

544

Critique Comments

Mixed 2×2

Study Design

IUI ’26

Published At

← Back to Projects

Problem

Should training data explanations be concise or comprehensive? Does more information help users scrutinize AI, or does it overwhelm them?

Approach

Mixed 2×2 factorial design (n = 32) crossing information depth (summary vs. detailed) with progressive disclosure (present vs. absent).

Key Finding

Users preferred and performed better with detailed explanations, producing more and more accurate critiques: and willingly accepted the higher cognitive load.

544

Critique comments coded across all conditions: participants wrote significantly more, and more accurate, comments with detailed explanations. The conventional advice to simplify AI explanations was directly contradicted by user behavior.

Overview

Every explanation designer faces the same dilemma: give users enough information to reason critically, and you risk overwhelming them. Keep it simple, and you leave them without the tools to scrutinize what matters.

I ran a 2×2 study to answer it directly. One variable controlled how much information the explanation contained: summary versus full detail. The other controlled how it was revealed: all at once, or progressively. The findings cut against the standard advice to simplify: participants preferred the detailed explanation, produced more accurate critiques with it, and accepted the higher cognitive cost without complaint.

My Role

Study Design & Prototyping

Designed all four stimulus conditions for the 2×2 factorial study, each internally consistent while differing only on the target variables. Built working prototypes and piloted them with a separate group to verify each condition was experienced as intended before the main study.

Data Collection

Recruited and ran all 32 participants remotely over Zoom. Administered critique tasks, Likert questionnaires, comprehension checks, and semi-structured interviews in counterbalanced sessions averaging two hours each.

Statistical Analysis

Coded all 544 critique comments for topic coverage and accuracy alongside a second researcher. Conducted Wilcoxon Signed-Rank tests, Mann-Whitney U tests, repeated measures ANOVA, and Pearson chi-squared tests across subjective and objective outcome measures.

Qualitative Analysis

Conducted and analyzed semi-structured post-study interviews using directed qualitative content analysis, identifying consistent preference patterns and interpretive strategies across participants.

The Problem

The conventional wisdom in explainability design is to keep things simple. Simplicity reduces cognitive load. Cognitive load gets in the way of reasoning. So give users less, and they'll think more clearly. The problem is that this logic was built on short, single-attribute explanations. Training dataset explanations are different: longer, more structured, and used precisely when someone is trying to decide whether to trust an AI system at all.

The Tradeoff

More information, better reasoning, but at a higher cognitive cost. The question was whether a smart disclosure strategy could get you the benefits without the penalty. The answer turned out to be more interesting than that.

Research Questions

RQ 1

How does information depth (summary vs. detailed) affect user perceptions of the AI system and the quality of their critiques?

RQ 2

Does progressive disclosure reduce cognitive load associated with detailed explanations without sacrificing critique quality?

RQ 3

Are there interaction effects between information depth and progressive disclosure on user perceptions and critique outcomes?

Methods

Mixed 2×2 Factorial Design

Information depth (summary vs. detailed) as within-subjects factor; progressive disclosure (present vs. absent) as between-subjects factor. 32 participants, 2-hour sessions, conducted remotely over Zoom.

NASA-TLX

Validated cognitive load measure capturing mental demand, effort, and frustration. Applied after each critique scenario to compare load across both depth conditions.

Critique Task + Coding

Participants critiqued two AI systems using a simplified SWOT format. Two researchers coded all 544 comments for topic coverage and accuracy using a scheme from prior work.

Semi-Structured Interviews

Post-study interviews probed preferences, interpretation strategies, and reactions to the two explanation levels. Analyzed via directed qualitative content analysis.

Research Process

Prior Work & Gap Identification

Factorial Design

Condition Prototyping

Pilot Study

Study (n=32)

Statistical Analysis

The Gap

We knew from prior work that users value training data context. What nobody had tested was how much. That question matters because training dataset explanations behave differently from most XAI outputs: longer, more structured, and used at the specific moment when someone is deciding whether to trust a system at all. Findings on depth from shorter formats don't automatically carry over.

Designing the Study

Depth went within-subjects so every participant experienced both levels directly; that direct comparison was the point. Progressive disclosure went between-subjects to keep sessions manageable; they were already averaging two hours. I built all four interfaces from scratch, then piloted them to confirm each condition landed the way it was supposed to before starting the main study.

Study

Information Depth and Progressive Disclosure in Training Dataset Explanations

Mixed 2×2 Design n = 32 Critique Task NASA-TLX Semi-Structured Interviews

Each participant critiqued two AI systems (an automated hiring system and an automated admission system), one with a summary explanation and one with a detailed explanation, in counterbalanced order. Critique comments were collected via Userback, followed by Likert questionnaires and NASA-TLX after each scenario, and a semi-structured interview at the end.

Key Findings

Detailed explanations improved every perception measure. Higher understanding, trust, fairness, and learning, but also higher cognitive load. Better on every measure that mattered, at a cost participants were willing to pay.

More depth, more and better critiques. Participants left more comments with the detailed explanation (9.84 vs. 7.16 mean) and more were accurate (76.5% vs. 66.8%). Only 1 detailed comment was completely inaccurate, versus 6 in summary.

Progressive disclosure increased learning, not reduced load. Participants with PD reported significantly higher perceived learning (p=.012), particularly with detailed explanations. Cognitive load was unaffected.

Participants unanimously preferred the detailed explanation. The summary, they said, didn't give them enough context to evaluate or trust the system.

Results at a Glance

Detailed explanations consistently outperformed summary explanations across every measure. The contrast is sharpest in critique quality: participants produced more comments with the detailed condition, and substantially more of those comments were accurate.

Detailed explanation

Summary explanation

Critique Volume: mean comments per participant (p < .001)

Detailed

9.84

Summary

7.16

Critique Accuracy: % of comments coded accurate (544 total comments)

Detailed

76.5%

Summary

66.8%

Comment volume bars are scaled to a session maximum of 12. Accuracy percentages are direct. All differences are statistically significant (p ≤ .004).

On subjective measures, participants rated detailed explanations significantly higher on all four perception scales: perceived understanding (p<.001), trust (p=.002), fairness (p<.001), and perceived learning (p<.001). The one dimension where the summary condition had an edge was cognitive load; detailed explanations were significantly more demanding (p=.004 via NASA-TLX). Participants accepted this tradeoff willingly. In interviews, they said the extra effort was worth it, and that summary explanations left them without the context they needed to evaluate the system.

The 2×2 Design

Depth was within-subjects: every participant used both explanation levels. Disclosure was between-subjects: half saw everything upfront, half used an accordion interface that revealed one category at a time.

Static (No PD)

Progressive Disclosure

Summary

Summary · Static

Four concise summaries, one per category, shown all at once. Balanced topic coverage; shorter, less accurate critiques.

Summary · Progressive

Same concise summaries revealed one category at a time. PD provided little additional benefit at this depth level.

Detailed

Detailed · Static

Full Q&A explanation shown at once. Higher cognitive load, but significantly better perceptions and critique accuracy than summary.

Detailed · Progressive ✓

Full Q&A revealed progressively. Same benefits as detailed static, with significantly higher perceived learning.

Information depth (within-subjects) drove the main effects. Progressive disclosure had a targeted impact: significantly improving perceived learning, but only for the detailed explanation.

I really felt like with the [summary] one, I didn't have enough context or enough information to really know how the data was being used or where it is coming from. So, the lack of information impacted my trust in the system.

P25, after completing critiques in both depth conditions

Key Insights

Users chose effort over confusion

When the choice was between working harder or not having enough to evaluate the system, participants chose harder. They wanted to understand what they were looking at, not just read it and move on.

Depth improved accuracy but narrowed focus

More detail produced more accurate critiques, but participants concentrated heavily on data collection, the first category shown. The summary condition had worse accuracy overall but more even coverage across topics.

Progressive disclosure is structural, not efficient

It didn't reduce cognitive load. What it did was make reading feel more deliberate and increase participants' sense of learning, especially with detailed explanations. Different kind of value than usually claimed.

Comfort and effectiveness came apart

Participants rated summaries as less demanding (a metric that would look positive on a usability checklist) while producing systematically weaker critiques. Measuring experience is not the same as measuring effectiveness.

Design Implications

Core Insight

Detailed explanations outperformed summaries on every measure, perceptions and critique accuracy both, even though they cost participants more effort. For training dataset explanations, which users encounter at the moment they are deciding whether to trust a system, providing comprehensive information is the right default. Users do not want a summary of what the data looks like. They want to understand it well enough to decide if they trust it.

Default to detail for onboarding and evaluation tasks. When users are forming an initial mental model of an AI system or are tasked with evaluating it, detailed explanations produce better outcomes. Summaries should not be treated as a safer or more user-friendly default.
Use progressive disclosure to improve the learning experience, not to manage cognitive load. Progressive disclosure did not reduce cognitive load in this study, but it did increase perceived learning for detailed explanations. Its value is in making exploration feel intentional, not in hiding information to lighten the load.
Consider interface mechanisms that broaden topic engagement. Detailed explanations led participants to focus heavily on data collection, the first category in the interface. Designers should consider how to draw attention to other types of information, through visual emphasis, prompting, or ordering, when comprehensive coverage is the goal.
Do not optimize explanations for comfort. Users rated summary explanations as easier but did not produce better critiques with them. Explanation design should be evaluated against reasoning and accuracy measures, not just satisfaction scores.

Impact

For product teams designing AI transparency features: do not default to simplified summaries because they feel safer or more user-friendly. Users evaluating an AI system, deciding whether to trust it, purchase it, or flag a concern, need enough information to actually form a judgment. This study showed they will do the work when the information is there, and produce worse judgments when it isn't.

Progressive disclosure is a UX pattern worth using, but for the right reason: it makes exploration feel deliberate and increases perceived learning. It does not reduce cognitive load. Published at ACM IUI 2026.

What I’d Do Differently

All four conditions used the same resume screening AI domain, which limits how confidently we can generalize the depth effects. Replicating with a different domain, such as medical diagnosis or loan approval, would establish whether users’ preference for detailed information is domain-independent or tied to the stakes of the specific AI. I would also add eye-tracking: the core interpretation, that more detail drives engagement which in turn improves critique accuracy, currently rests on behavioral evidence alone. Gaze data would directly test whether participants actually read the additional content or skim past it.

← Previous Explanation Presentation Style

Next → AI Literacy Scale Development