UX Metrics Frameworks: HEART, PULSE, GSM & CASTLE
Master four complementary UX measurement systems — HEART, PULSE, GSM, and CASTLE — and learn how to tie design decisions directly to outcomes that matter.
9 min read
The full lesson
Shipping a redesign without measuring its impact is a guess dressed up as a decision. The four frameworks in this lesson — HEART (Google), PULSE (Microsoft), GSM (Goals-Signals-Metrics), and CASTLE (an emerging enterprise model) — give you structured vocabularies for turning fuzzy design goals into trackable signals. That makes conversations with product managers, engineers, and executives far more concrete.
Knowing when to use each framework, and how to combine them, is the difference between saying “we improved the experience” and saying “task success rose from 61% to 84%, reducing support contacts by 22%.”
Why Frameworks Beat Ad-Hoc Metrics
Most teams default to metrics that are already set up: page views, daily active users (DAU), NPS. These aren’t useless. But they’re incomplete as UX evidence because they mix up usage with satisfaction, and engagement with real value.
The deeper problem is the say/do gap: attitudinal surveys measure what users say, not what they do. A user can give you an 8/10 NPS score and still abandon your checkout every time the address form errors. Behavioral data — task success rates, time-on-task, error rates — is far more reliable. Good metrics frameworks push you to instrument both layers and compare them.
Frameworks also create alignment artifacts. When a designer, PM, and data analyst all sign off on the same GSM tree, you’ve ended the “we measure success differently” argument before it starts.
The HEART Framework
Google’s HEART framework was introduced by Kerry Rodden and colleagues in a 2010 CHI paper. It gives product teams five categories that together describe the full user experience.
| Category | What it captures | Example metric |
|---|---|---|
| Happiness | Subjective satisfaction, perceived ease | UMUX-Lite score, CSAT |
| Engagement | Depth and frequency of use | Sessions per user per week, features used |
| Adoption | New users reaching a key milestone | % of new users completing onboarding in 7 days |
| Retention | Users returning over time | 30-day retention, churn rate |
| Task Success | Objective usability — completion rate, time, errors | Task completion rate, time-on-task |
A few nuances practitioners often miss:
- Not all five categories apply to every project. A search-results redesign might focus on Task Success and Happiness. A notification-system redesign might focus on Engagement and Retention. Pick the categories that match the problem.
- Engagement is a double-edged signal. High engagement can mean users love the product — or that it’s confusing and they’re spending extra time recovering from errors. Always pair Engagement metrics with Task Success to understand the direction.
- Happiness metrics must be validated instruments. Home-grown satisfaction questions have no psychometric baseline, so comparisons over time become unreliable. Use UMUX-Lite (2 items, Likert-7) or SUS for standardized benchmarking.
Goals-Signals-Metrics (GSM)
GSM is not a separate tool. It is the process for populating HEART (or any metrics framework) with meaningful numbers. The hierarchy has three levels:
- Goal — What is the user trying to accomplish? What is the product trying to achieve? Write goals in plain language: “Users should be able to find the right pricing plan without contacting support.”
- Signal — Observable user behavior that shows progress toward the goal. Signals can be positive (completed plan selection) or negative (rage-clicked the compare table). Define signals before asking what’s feasible to instrument.
- Metric — The specific, quantified measurement of a signal. For example: ”% of users who complete plan selection within a single session without contacting support.”
Building a GSM Tree
Build the GSM tree collaboratively in a workshop with design, PM, engineering, and data. The output is a table, not a paragraph. Here is a condensed example for an enterprise SaaS dashboard:
| Goal | Signal | Metric |
|---|---|---|
| Users find the right report quickly | Users reach the target report without backtracking | % of sessions reaching report with zero navigation backtrack |
| Dashboard feels trustworthy | Users don’t question data freshness | % of users who hover the last-updated timestamp (proxy for doubt) |
| Teams adopt sharing features | Users send a report link to a colleague | 30-day share-to-viewer conversion rate |
GSM enforces a discipline in the signal step. Most teams jump straight from goal to metric and skip the behavioral reasoning. The signal layer forces the question: “What would we actually observe if users were succeeding or failing?” That surfaces blind spots in your instrumentation plan before you’re stuck logging the wrong events.
The PULSE Framework
Microsoft’s PULSE framework pre-dates HEART. Its categories lean more toward product health and engineering observability.
| Category | Meaning | Typical source |
|---|---|---|
| Page views | Volume of use and reach | Analytics |
| Uptime | Reliability and availability | SRE / monitoring |
| Latency | Perceived performance | RUM (Real User Monitoring) |
| Seven-day active users | Engagement breadth | Product analytics |
| Earnings | Business outcome proxy | Revenue data |
PULSE is primarily a product health dashboard, not a UX quality instrument. It tells you whether the product is working and being used. It does not tell you whether users are succeeding or satisfied. The classic mistake is treating PULSE metrics — especially page views and 7-day actives — as the main measure of success, equating engagement with value.
PULSE shines in joint OKR conversations with engineering and business. Designers often lack visibility into latency and uptime data, but both have direct UX consequences. A 200 ms vs. 2 s response time is a user-experience difference, not just an infrastructure difference. Including Latency in your PULSE dashboard gives you a hook for advocating performance budgets.
HEART + PULSE Together
Use them at different altitudes:
- PULSE as the operational baseline — the product must be fast, reliable, and used.
- HEART + GSM as the UX quality layer — within that working product, are users succeeding and satisfied?
If a regression in PULSE Latency shows up and your HEART Task Success data confirms it is causing abandonment, that multi-signal case is a compelling argument for a performance sprint.
The CASTLE Framework
CASTLE is a more recent framework designed specifically for complex enterprise and B2B products, where the user, the buyer, and the business beneficiary are often different people.
| Letter | Category | What it captures |
|---|---|---|
| Completion | Task and workflow completion | End-to-end process success, not just micro-task success |
| Adoption | Feature and platform uptake | % of licensed users actively using key capabilities |
| Satisfaction | Attitudinal quality | Validated survey scores (UMUX-Lite, CES) |
| Time efficiency | Effort and speed | Time-on-task, Customer Effort Score |
| Learnability | Onboarding and skill acquisition | Time-to-first-value, training hours required |
| Errors | Quality and reliability | Error rate, recovery rate, escalations to support |
CASTLE acknowledges a reality HEART underweights: in enterprise software, adoption and learnability are first-class UX outcomes. A feature that power users love but 80% of licensed seat holders have never touched is a UX failure — even if its HEART scores are excellent.
CASTLE in Practice
Enterprise UX teams typically run CASTLE measurement at three levels:
- Product level — aggregate scores across the full platform
- Workflow level — scores for specific end-to-end processes (for example, the invoice-approval workflow)
- Feature level — scores for individual components (for example, the bulk-edit panel)
This hierarchy maps cleanly onto enterprise user research programs. Product-level CASTLE informs roadmap prioritization. Workflow-level findings drive redesign scopes. Feature-level data validates individual design decisions.
Do
Define CASTLE metrics per workflow, not just per product. A single aggregate CASTLE score hides which workflows are dragging down the overall experience. Map Completion and Time Efficiency metrics to specific job-to-be-done flows so you know exactly where to invest.
Don't
Don’t apply CASTLE to consumer apps — it creates unnecessary overhead. CASTLE earns its complexity only when you have multi-role workflows, large licensed seat counts, formal onboarding programs, and an IT/admin buyer persona separate from the end user.
Choosing the Right Framework
| Scenario | Recommended approach |
|---|---|
| Consumer app, single user type | HEART + GSM |
| Enterprise SaaS, complex workflows | CASTLE + GSM |
| Cross-functional product health review | PULSE as baseline, HEART for UX layer |
| Usability study on a specific feature | HEART Task Success + SUS/UMUX-Lite |
| OKR-setting with leadership | GSM tree to back up HEART categories with specific metrics |
The most common mistake is treating these as competing methodologies. They are complementary lenses. PULSE tells you the engine is running. HEART tells you the driver is comfortable. CASTLE tells you the whole logistics fleet is moving cargo efficiently. GSM is the process you use to decide which gauge on the dashboard to add next.
Connecting Metrics to Outcomes
Frameworks only create value if they connect to decisions. Three practices close that loop.
1. Set directional baselines before a project starts. Without a pre-launch baseline, post-launch improvement is unverifiable. Even a two-week instrumentation sprint to capture current Task Success rates before a redesign gives you something to compare against.
2. Define a North Star metric per initiative, not per product. A single product can have multiple initiatives running at the same time, each with its own primary metric from the HEART or CASTLE vocabulary. Combining them into one product-level number makes it impossible to attribute cause.
3. Report outcome-tied metrics, not activity metrics. “We ran 12 usability tests” is activity. “Task success on the filter panel increased from 54% to 78% after redesign” is an outcome. Present the latter to leadership; keep the former in your team’s internal documentation.
Mixed-Method Triangulation
No single metric tells the full story. Modern UX measurement triangulates across three data types:
- Behavioral (what users do) — task completion rates, click paths, error logs, funnel drop-offs
- Attitudinal (what users say) — UMUX-Lite, CES, open-ended interview themes
- Physiological / biometric (what users show) — eye tracking, facial expression coding (less common; most valuable for high-stakes decisions)
HEART and CASTLE provide the vocabulary. GSM determines which signals to instrument. Mixed-method triangulation determines which data sources fill those signals. Relying on attitudinal surveys alone to predict behavior is the canonical say/do gap mistake — users consistently report higher ease and satisfaction than their behavior supports.
For quantitative benchmarking studies — comparing your product to a competitor, or to your own baseline from a year ago — you need at least 40 participants to reach 95% confidence with a reasonable margin of error. The “5-user rule” applies only to qualitative problem-finding studies. Applying it to quantitative benchmarking will produce wildly unreliable numbers.