Stop A/B testing the colour of buttons · start A/B testing the size of the team
Most A/B tests are theatre. Here is what actually moves a startup's metrics, and why team-shape is the experiment your competitors are afraid to run.
Most A/B tests are theatre. Here is what actually moves a startup's metrics, and why team-shape is the experiment your competitors are afraid to run.
Every product team I have audited in the last three years has the same shape of experiment backlog. Eighty percent of the queued tests are 'colour of CTA', 'copy on the headline', 'order of these three steps'. The expected lift on each, by the team's own pre-experiment estimate, is 0.5-2 percent. The actual lift, after running, is somewhere between -0.5% and +0.7%, with confidence intervals that touch zero on both sides.
The remaining twenty percent of queued experiments · the ones about team shape, on-call rotation, who owns which feature · are never run. They are scary, political, and the team running them does not know how to measure. So they ship colour swaps and feel productive.
This is a counter take. The experiments that move startup metrics by 10-30% are about team-shape, not pixel-shape. Here are four we have helped clients run. Numbers are real, blurred enough to protect the customer.
A B2B SaaS client, 11 product engineers, weekly velocity 14 story points (their measure). We split into two squads: a 4 person 'core product' squad, and a 7 person 'platform / migration' squad. Six weeks later the 4 person squad shipped twice as much customer facing change as the 11 person team had managed in the prior 6 weeks. The 7 person squad caught up on three years of accrued debt that was silently throttling everything.
The lesson is not 'fire half the team'. The lesson is that a single 11-person team is almost always wrong shape. Two specialised teams of 4-7 do more than one 'aligned' team of 11. The 'team A/B test' here is splitting the team in half by responsibility and seeing which side moves the metric you care about.
A consumer app client. Support reported into ops. Engineers shipped a feature, support discovered three weeks later that 8% of users could not get past the onboarding. Reproduction took two weeks because the support team was emotionally and organisationally distant from engineering. Time to fix: 9 weeks total per onboarding bug.
We moved support reporting into engineering, with a daily 15 minute sync between support lead and on-call engineer. Time to fix dropped to 4 days. NPS went from 31 to 47 in one quarter. Net product velocity did not change · the engineers were not 'distracted' by support, they were finally hearing reality.
A 30 person engineering org. 14 weekly standing meetings on the calendar, totalling 9.5 hours per attendee per week. We killed all of them for one month and replaced with: a single Monday 30 minute kickoff per squad, and a Friday async written update.
Output per engineer (PRs merged, story points, your favourite measure) went up 22% in the first month. After two months we re introduced two of the meetings (the ones people genuinely missed). Net: 6 hours per week per engineer reclaimed, no measurable downside.
Most teams treat pair programming as a special occasion. We ran a 6 week experiment where every senior had a pinned junior pair, working on the same ticket queue, 4 hours a day. Junior ramp time (time to ship first independent feature) dropped from a 14 week median to 6 weeks. Senior throughput dropped 12% in those 6 weeks but rose 18% afterwards because they stopped fielding 'how do I do X' questions all day.
If your last 10 experiments were all UI tweaks and the company metric did not move, the experiment to run is on yourself · stop optimising the wrong variable. Team shape is the variable.

Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Let's talk about your project. 30 minutes, no strings.