A/B testing is quantitative-behavioral research method used to optimize conversion of a product, by showing one group of users control version and other group a slight variation. Version that performs better is the winner, sounds simple?

Quantitative research methods require larger sample size in order to have statistical power, or simply put more users in test, higher the chance to correctly identify a winner. But what is statistical power and how much is more?
Imagine a group of cat aficionados hires you to determine if yellow cats are more numerous than grey cats (for whatever purpose). But you’re lazy, you go out see two tabbys and Chat Noir poster, and conclude greys are way more numerous. This sounds wrong, because there is no mathematics to back up the validity of your conclusion. Statistical power represents the probability that research will correctly recognize an effect and it’s related to sample size and incidence of the effect. More cats you count, more statistical power you have and if one group outnumbers other by large margin you need to count less. In order to be convincingly sure in results, power needs to be more than 80 percent.

Back to A/B testing, we know power of 90 percent (0.9) is enough to assure valid test, but how many users do we need? In order to calculate necessary number of users, we need to consider performance of control version, possible uplift of variation and probability of false negative and false positive outcomes. We decide to run A/B test hoping that we can increase our conversion rate by 5 percent. Let’s say our current website (version A) has conversion of 12 percent. Every test will produce false negative results, also known as type-II error or β, which means test winner will be mistakenly identified as loser. While this is not good, it’s not a disaster either, you wasted time and discarded possible uplift, but you keep conversions. Statistical power depends on false negative results eg. if false negative is β=0.1 power is 0.9. On the other hand, false positive results, type-I error or α, are more dangerous, because they’ll make you believe you have a winner, when realistically there might be no uplift or even worse. Usual value for false positive probability is α=0.05. If you’re not fond of math, you can put these numbers in some of many online calculators available. When we calculate sample size with α=0.05 and β=0.1 turns out we need about thousand participants in order to be 90 percent sure in test outcome.
What happens if we want more than 90 percent power, have conversion rate under 10 percent and realistic improvement of new version is few percent at best? Science says we need thousands of participants to achieve statistical power, but if you’re a startup you probably don’t have these users and it’s unrealistic to run tests for months. Even if you’re not a startup this is impractical, especially in modern business when you need to produce results fast.
Guys at KissMetrics ran an interesting simulation, showing that if you run back-to-back tests, constantly trying new solutions, you can limit tests to maximum 2000 participants, achieve similar results to scientific method, while cutting down testing times.
Still, if you can’t produce new variations constantly, if there is no clear winner after 2000 participants or you have no pressure to stop early, you should stick to the science and run full test (within a reasonable time frame). You should also keep in mind possibility of false positive result is 5 percent, so one in 20 tests might actually be false positive.

uber

Analysis from Google, Amazon, Qubit and other companies, suggests that only 10 out of 100 A/B tests will produce true effect. This means you can have a test winner, implement it expecting an uplift and nothing will happen. Aside from sample size, the reason for this is probably novelty effect during test. This effect disturbs the banner blindness of returning users and makes them switch from scan mode to reading. In most cases actual uplift is lower than one measured in tests. Let’s be realistic, that slight variation you created for version B will not significantly change behavior of people. Moving that button a bit, trying out different image or rephrasing copy might create some uplift. Unless you have $100 product and you start selling it for $1, you cannot change behavior or hypnotize people into buying your stuff. Matter of fact Qubit’s simulation shows tests with more than 20 percent uplift are not valid.

With a lot of services available, A/B testing has become fashion, but there’s not enough talk about science behind tests. Anyone can setup and run a test, get amazing uplift with statistically irrelevant data or false positive and be happy about it, without even understanding there is a problem, and this is dangerous. You need traffic (possibly a lot of it) to run meaningful A/B test. If you’re a SME or startup having 100 conversions per month, there’s so many other more productive methods you can use to grow your business. A/B tests are not growth hacking method, but can be helpful when you want to introduce and validate new option or make changes to UI. When you think about A/B test, make sure to consider statistical power and if possible run second test with same control and variation in order to confirm and validate results.

google

Did you like this article?

Please subscribe if you wish to receive new post updates once in a blue moon

or share with your friends