---
title: "Stop A/B testing the colour of buttons · start A/B testing the size of the team"
description: "A/B testing button colour returns 0.3% lift on a good day. Changing the team's size, focus or interface to the rest of the company moves the needle by 30%. Why most teams optimise the wrong variable."
date: 2026-04-26T13:00:00.000Z
updated: 2026-04-26T13:00:00.000Z
author: "Dezso Mezo"
tags: "Startup, Experimentation, Engineering management, Product"
slug: stop-ab-testing-buttons-team-size
canonical: https://dfieldsolutions.com/blog/stop-ab-testing-buttons-team-size
---

# Stop A/B testing the colour of buttons · start A/B testing the size of the team

Most A/B tests are theatre. Here is what actually moves a startup's metrics, and why team-shape is the experiment your competitors are afraid to run.
Every product team I have audited in the last three years has the same shape of experiment backlog. Eighty percent of the queued tests are 'colour of CTA', 'copy on the headline', 'order of these three steps'. The expected lift on each, by the team's own pre-experiment estimate, is 0.5-2 percent. The actual lift, after running, is somewhere between -0.5% and +0.7%, with confidence intervals that touch zero on both sides.

The remaining twenty percent of queued experiments · the ones about team shape, on-call rotation, who owns which feature · are never run. They are scary, political, and the team running them does not know how to measure. So they ship colour swaps and feel productive.

This is a counter take. The experiments that move startup metrics by 10-30% are about team-shape, not pixel-shape. Here are four we have helped clients run. Numbers are real, blurred enough to protect the customer.

## Experiment 1 · Halve the product team

A B2B SaaS client, 11 product engineers, weekly velocity 14 story points (their measure). We split into two squads: a 4 person 'core product' squad, and a 7 person 'platform / migration' squad. Six weeks later the 4 person squad shipped twice as much customer facing change as the 11 person team had managed in the prior 6 weeks. The 7 person squad caught up on three years of accrued debt that was silently throttling everything.

The lesson is not 'fire half the team'. The lesson is that a single 11-person team is almost always wrong shape. Two specialised teams of 4-7 do more than one 'aligned' team of 11. The 'team A/B test' here is splitting the team in half by responsibility and seeing which side moves the metric you care about.

## Experiment 2 · Move the support team into the engineering org

A consumer app client. Support reported into ops. Engineers shipped a feature, support discovered three weeks later that 8% of users could not get past the onboarding. Reproduction took two weeks because the support team was emotionally and organisationally distant from engineering. Time to fix: 9 weeks total per onboarding bug.

We moved support reporting into engineering, with a daily 15 minute sync between support lead and on-call engineer. Time to fix dropped to 4 days. NPS went from 31 to 47 in one quarter. Net product velocity did not change · the engineers were not 'distracted' by support, they were finally hearing reality.

## Experiment 3 · Kill the standing meeting

A 30 person engineering org. 14 weekly standing meetings on the calendar, totalling 9.5 hours per attendee per week. We killed all of them for one month and replaced with: a single Monday 30 minute kickoff per squad, and a Friday async written update.

Output per engineer (PRs merged, story points, your favourite measure) went up 22% in the first month. After two months we re introduced two of the meetings (the ones people genuinely missed). Net: 6 hours per week per engineer reclaimed, no measurable downside.

## Experiment 4 · Pair every senior with a junior for 6 weeks

Most teams treat pair programming as a special occasion. We ran a 6 week experiment where every senior had a pinned junior pair, working on the same ticket queue, 4 hours a day. Junior ramp time (time to ship first independent feature) dropped from a 14 week median to 6 weeks. Senior throughput dropped 12% in those 6 weeks but rose 18% afterwards because they stopped fielding 'how do I do X' questions all day.

## Why nobody runs these tests

- They are political. Splitting a team or moving a reporting line touches egos and contracts. Button colour does not.
- They are slow to measure. 6-12 weeks per cycle, vs 2 weeks for a UI A/B test. PMs prefer fast results.
- They are not in the standard A/B testing tool. Optimizely will not run an experiment for you on team structure.
- They are scary. If team shape A wins by 30%, you have a hard conversation about why you ran shape B for 18 months.

## How to actually run a team shape experiment

1. Pick one metric. Not 5. Customer facing PRs per week, or NPS, or revenue.
2. Define the change in one sentence. 'We are merging X and Y squads for 8 weeks' or 'support reports into engineering for 8 weeks'.
3. Pre commit to the duration. No mid experiment changes. 8-12 weeks is usually right.
4. Snapshot the metric before. Trailing 8 week average, not last week.
5. At the end · keep, revert, or escalate. Make the call in writing, with the data, in 48 hours.

> **TIP:** If your last 10 experiments were all UI tweaks and the company metric did not move, the experiment to run is on yourself · stop optimising the wrong variable. Team shape is the variable.

---

Source: https://dfieldsolutions.com/blog/stop-ab-testing-buttons-team-size
Author: Dezso Mezo · Founder, DField Solutions
Site: https://dfieldsolutions.com
