Growth optimization vs. product testing
Incremental product growth optimization is inherently different from traditional product work. That's also why companies such as Meta have dedicated growth orgs that are separate from the core product organization.
In “growth optimization”, testing velocity is key. It differs from classic feature testing in several ways: smaller changes (adjustments are minor and inexpensive, such as changing text or design elements), more subtle effects (most changes are low-risk with relatively small effect sizes, yet the cumulative effect can be large), and lower success rates (few changes are successful, most don't bring a statistically significant conversion-rate improvement).
As a result, success in growth optimization hinges on a team's ability to identify and accumulate many small wins, whereas core product work focuses more on de-risking bigger bets.
Limitations of classic A/B testing
In the context of growth optimization, A/B testing often falls short due to the large sample size required and the rigidity of the statistical approach:
- Detecting small uplifts takes too long: For example, imagine a website with 100k monthly visitors and a 1.5% baseline conversion rate (typical SMB eCommerce store). To detect a 10% incremental uplift with 95% confidence, an A/B test needs to run for 8 weeks (8 months for a 5% uplift).
- Multivariate tests aren't feasible: Given the traffic requirements, very few companies can test more than one variable at a time.
- Few ideas can be tested: Teams must make difficult prioritization decisions, and “success” often depends on the PM's experience and luck.
- Opportunity costs are high: A/B testing hurts overall conversion rates, as a fixed portion of traffic (typically 50%) remains allocated to the underperforming variant for the entire test duration.
Multi-Armed Bandits
Multi-armed bandits are not new. They have been used successfully in areas such as search or ads optimization. However, they are much less commonly used in product optimization compared to A/B testing.
MAB algorithms balance two competing goals: exploration and exploitation. Exploration aims at testing as many ideas as possible, while exploitation aims to maximize the overall conversion rate by showing winning ideas to as many users as possible.
An effective technique to balance exploitation and exploration is called “Thompson Sampling.” Simply put, it works like this: start with a guess (assume all options are equally good), test one variation (pick one option randomly, show it to a user, and see how they respond), then update your guess (adjust your belief based on the observation).
MAB algorithms are particularly data-efficient when paired with “hierarchical Bayesian” models. These models recognize that a product design is a function of multiple variables (“factors”) that may vary in importance. The hierarchical approach offers a significant advantage over A/B testing, since it helps avoid wasting traffic on finding the best levels of unimportant factors.
Benchmarking algorithms
At Levered, we use custom hierarchical MAB algorithms for automated product growth optimization. The best way to directly compare the two approaches is by running a simulation. We define a typical product optimization scenario and observe how effective each algorithm is in finding “winners” and improving the overall conversion rate over time.
Defining the scenario
We are optimizing a UX across three variables (“factors”), e.g., the headline, hero image, and CTA copy of a landing page. For each factor, we want to explore four different levels. This makes 4×4×4 = 64 possible variants. The three different factors vary in importance and each variant has a “true” conversion rate between 2% and 4% (both ex-ante unknown).
Quantifying performance
We focus on three characteristics:
- Average Conversion Rate: How well each method turns visitors into customers over time.
- Variance of Conversion Rates: Whether the results are steady or volatile.
- Chance of Beating the Other Method: The likelihood that one method outperforms the other.
Results
We observe that on average, MAB:
- Reaches 3x higher relative uplift
- Needs 7x less traffic to reach 10% uplift
- Generates about half as much standard deviation
- Outperforms A/B testing in 85% of simulations after 15k interactions
Discussion
When it comes to testing expensive changes, such as new feature launches, A/B testing is still a valid approach. The larger the company and the more traffic the product gets, the more likely it is that A/B testing will work out fine.
However, when the space of options is large, traffic is sparse, and the potential cost of experimentation is moderate, MAB is the superior alternative. This applies particularly in the context of CRO in small and medium enterprises.