Tree Testing · UI/UX Atlas

The full lesson

Most navigation problems are structural, not cosmetic. A user who can’t find “Return Policy” doesn’t need a prettier menu — they need the right label in the right place. Tree testing catches those structural failures early, before visual design locks in decisions that are expensive to reverse. It strips away the interface so you get clean, quantitative data about exactly where your hierarchy breaks down.

What Tree Testing Actually Measures

A tree test shows participants a plain-text, indented version of your site’s navigation — the “tree.” It gives them a task and asks them to find a specific piece of content. There are no visuals, no search bar, and no back button. Participants click through the hierarchy until they land on what they think is the right destination. The tool records every click.

The output is behavioral, not attitudinal. You don’t ask “Is this label clear?” — you watch whether people find the right node, how long it takes, and how many wrong turns they make. The key metrics are:

Directness — percentage of participants who reached the correct answer without backtracking
Task success — percentage who landed on the correct node at all (direct + indirect)
First-click accuracy — whether the first branch chosen was the right branch (a strong predictor of overall findability)
Time on task — a proxy for cognitive effort and labeling clarity

First-click accuracy deserves special attention. Research by Bob Bailey and Cari Wolfson found that when users make the right first click, they succeed about 87% of the time. A wrong first click drops success to roughly 46%. That makes first-click data — in tree tests and click tests alike — one of the most actionable signals in IA research.

When to Run a Tree Test

Tree testing fits the evaluative phase of IA work. Use it:

Before redesigning navigation — establish a baseline success rate on the current structure so you can measure improvements later
After card sorting — you have a proposed taxonomy; now verify it’s actually findable before building it
After a merge or restructure — a product acquisition, new feature set, or org change has scrambled your existing IA
When analytics show navigation drop-off — you suspect a labeling or hierarchy problem and want quantitative confirmation

Do not use tree testing as a substitute for generative research. If you don’t yet know what categories users expect, start with card sorting or contextual inquiry. Tree testing validates a structure; it cannot generate one.

Sample Size: The 5-User Rule Does Not Apply Here

A common mistake is applying the “5 users is enough” heuristic to tree tests. That rule applies to qualitative studies. Tree testing produces quantitative data — success rates, directness percentages, first-click distributions — and those numbers need statistical integrity to drive confident decisions.

Practical guidance:

Goal	Minimum participants
Directional / exploratory	30–50
Statistically reliable benchmark (95% CI)	40–50 per tree version
Comparative test (two tree versions)	40–50 per variant

For most studies, 50 participants per tree version is a defensible target. It keeps confidence intervals manageable without over-recruiting. If you’re comparing two IA options, run them as separate parallel arms. Do not show both trees to the same participant — order effects will contaminate the results.

Writing Effective Tree Test Tasks

The tasks you write determine the quality of your results. A poorly framed task measures reading comprehension, not findability.

Rules for good tree test tasks:

Use the participant’s language, not the product’s labels. The task cannot echo the label you’re testing. If the category is “Account Settings,” the task cannot say “Go to account settings.” Rephrase around the goal: “You want to change the email address on your account.”
Describe an outcome, not a navigation step. “Find information about returning a jacket you bought last week” tests the structure. “Go to the Returns section” gives away the answer.
Make the success state unambiguous. Before launching, document exactly which node counts as correct. Some tasks have multiple valid endpoints — a product accessible from two categories, for example. Decide in advance whether both count.
Vary task complexity. Include easy tasks (high expected success) to confirm the structure works where you think it does, alongside the harder tasks you’re most uncertain about.
Limit to 10–15 tasks per session. Beyond that, fatigue introduces noise and participants start guessing instead of thinking.

Write tasks as realistic user goals: “You forgot your password and need to reset it.” Keep task language scenario-based and vendor-neutral. Pre-define all valid correct nodes before launching. Pilot the task set with 2–3 internal participants to surface ambiguous wording.

Don't

Echo any label from the tree inside the task prompt — this defeats the test. Write tasks that require specialized prior knowledge about your product. Assume there is always a single correct answer — some items legitimately live in multiple valid locations. Run the study without piloting, then realize tasks are broken after 40 responses.

Running a Tree Test: Tooling and Protocol

The most widely used purpose-built tools are Optimal Workshop’s Treejack, UserZoom, and Maze. All three produce the standard directness and success metrics, first-click dendrograms, and path analysis visualizations. For budget-constrained teams, Maze’s free tier supports basic tree tests.

Setting up the tree:

Import your hierarchy as a flat outline — most tools accept tab-indented text or a spreadsheet
Do not include design or UI chrome; this is deliberately stripped of context
Depth matters: trees that are too shallow (2–3 levels) don’t reflect real navigation decisions; trees too deep (7+ levels) exhaust participants and produce noise
A practical sweet spot is 3–5 levels deep, with no single node having more than 10–12 children at any one level

Recruitment:

Recruit participants who match your actual user population using screener surveys. Tree tests are well-suited to unmoderated remote delivery. The absence of a visual interface removes the need for a moderator and enables faster, cheaper fielding at scale.

Pilot run:

Always run 3–5 pilot sessions before full launch. This catches ambiguous tasks, broken tree structure, or unexpected disputes about correct answers.

Reading and Acting on Results

Once you have data, analysis works in two passes: macro (structure-level) first, then micro (node-level).

Macro pass — identify problem areas:

Flag any task with directness below 50%: most participants are not following the intended path
Task success below 70% signals a structural problem, not just a labeling edge case
Compare directness vs. success: a task with 40% directness but 80% success means people wander but eventually recover — the structure is navigable, but not intuitive

Micro pass — diagnose the cause:

Open the path-analysis view (a flow or dendrogram of clicks per task)
Find the branch where participants first go wrong — that is the IA failure point
Look for “gravity wells” — nodes that attract large numbers of wrong clicks, indicating a label mismatch or an expected location that differs from the actual one
Watch for “pogo-sticking”: participants who bounce repeatedly between siblings signal that labeling across that level is ambiguous

Translating findings into IA changes:

Tree test data tells you which nodes are failing, but not why. Before making changes, pair your findings with qualitative evidence:

Review any think-aloud notes from piloted or moderated sessions
Run a follow-up card sort focused only on the nodes that failed — this surfaces the mental model your labels are violating
If gravity wells point to a specific wrong node, investigate whether the content actually belongs there as well as its current location (polyhierarchy — placing content in more than one location — may be the right fix, not relabeling)

Benchmarking and Tracking Progress

Tree testing delivers maximum value when run iteratively, not as a one-off. After the initial diagnostic run, use the baseline metrics to set targets:

Industry benchmarks suggest that a well-designed navigation structure should achieve 70–80% task success and above 60% directness on representative tasks
Track improvement across redesign iterations by running the same task set on the revised tree
Report confidence intervals alongside raw percentages — a 74% success rate from 50 participants (CI: 60–85%) is a very different claim than the same rate from 8 participants (CI: 39–94%)

Presenting confidence intervals to stakeholders shifts the conversation from “did we hit 70%?” to “what range of outcomes can we expect in production?” — a more honest and more useful framing for design decisions.

Tree Testing vs. Other IA Validation Methods

Knowing where tree testing fits in the broader toolkit prevents both over-reliance and under-use:

Method	Measures	Sample size	Phase
Card sorting (open)	How users group and label content	15–30	Generative
Card sorting (closed)	How users map content to existing categories	20–40	Evaluative
Tree testing	Whether users can find content in the hierarchy	30–50+	Evaluative
First-click testing	Whether navigation entry points are predictable	20–50	Evaluative
Moderated usability test	Why users succeed or fail in a real interface	5–8	Evaluative / qualitative
Analytics path analysis	How users actually navigate in production	All users	Continuous

A solid IA research program typically flows like this: open card sort to discover categories, closed card sort to validate them, tree test to confirm findability, then moderated sessions to catch edge cases before launch. Analytics close the loop post-launch.

Common Mistakes That Invalidate Results

Even well-intentioned teams routinely undermine tree test validity with a handful of avoidable errors:

Testing the sitemap, not the mental model — importing a navigation structure built from internal org logic rather than user language will confirm that the structure fails, but won’t point to the fix
Skipping the pilot — ambiguous tasks produce ambiguous data; a 15-minute pilot with 3 colleagues catches most problems
Using too small a sample — applying the qualitative 5-user heuristic to a quantitative method produces confidence intervals too wide to act on
Treating success rate as the only metric — directness, first-click accuracy, and gravity wells are often more actionable than the headline success percentage
Changing the tree between arms of a comparative study — any change to the tree structure, not just the nodes being tested, invalidates the comparison