Crashing into growth marketing: a CEO journey. Part 3 — A/B testing.
This is the third post of many to document my journey as early-stage startup CEO through the growth marketing minidegree by CXL Institute — a 12-week online program about the practicalities of growth marketing.
Disclaimers: I’ve done intense learning sprints before. I’ve never touched marketing from a practical perspective before this course. Consult a specialist before trying this at home.
Promo insert: My startup is called Zelos, it’s an app that helps you manage tasks and distribute activities with a very large group or community, and you can sign up here for free: getzelos.com
The science of A/B testing
Yep, it’s apparently really scientific, and the science is statistics. I go into the A/B testing chapters with a minor reluctancy, because it’s not going to happen for our startup any time soon. We’re in the early stage, and cashing in 1000 or 10 000 monthly purchases are not milestones that we’ll reach by tomorrow. And I’ve been told (and the chapter intro tells me again) -
do not A/B test with low volume!
With 1000 conversions per month (and by conversions they mean actual purchases, subscriptions and upgrades) the maximum test capacity is 20 tests per year. Twenty! That’s less than two tests per month! With 1000 purchases! We have a lot to grow.
So you understand why I’m not extremely enthusiastic about a full chapter of something that I cannot use any time soon. However, I’m quickly convinced that there’s value in knowledge:
“Statistics aren’t necessarily fun to learn. It’s probably more fun to put up a test between a red and green button and wait until your testing tool tells you one of them has beaten the other. If this is your strategy, you’re ripe for disappointment. This approach isn’t much better than guessing. Often, it ends with a year’s worth of testing but the exact same conversion rate as when you started.”
Excuse me what? That’s EXACTLY what I thought A/B was about. Testing red buttons against green buttons and learning which one works better. I guess I’m listening now.
But listening isn’t too easy as statistics is very much science, and the best teachers here are actual scientists. So instead of saying normal-sounding things like “you have proven that your idea performs better” they say stuff like “you can now reject the null hypothesis”.
It also doesn’t help that the titles for the videos are auto-generated. English is a third language for me, so I really get more support from reading than just listening when it’s about complex matters.. But the main lecturer has a foreign accent. And the scientific glossary isn’t making sense to the AI writing the titles. Why is life so hard?
I’m so glad I took the entry-level statistics course on Coursera that one time when I had the personal meltdown that made me want to apply for a degree in social sciences (in reality I ended up spending two years in anthropology and didn’t graduate). At least I know what probability means in a general sense.
Wikipedia and Google search are great crutches for getting through these chapters. At some points I do feel like skipping a video because it’s 15 minutes in and I still understand nothing. But striving on rewards me every time, as there are small takeaways to understand here and there. And by the end of the whole chapter they’ve repeated stuff enough to paint a general picture.
I now understand the difference between Frequentist and Bayesian statistics, and am able to enjoy geeky jokes like this:
Frequentists vs. Bayesians
A/B testing — why and what?
There’s two general reasons for testing:
- research — measuring impact and user behaviour; what do they do when elements are added and removed; and
- optimisation — deploying adjustments and changes for conversion lift
The goal for both is to determine the likelihood of there existing a significant difference between the two alternative options: a control version (usually the current state of things) and the challenger version (something added/removed/changed).
The goal metrics depend on the need and maturity of the company. From easy and less useful to complex and meaningful, some of the categories to set as testing goals are:
- Transactions (the first level of “actual” testing!)
- Revenue per users
- Potential lifetime value
The gist is to make a correct conclusion from the measured results — will you be confident (and hopefully right) in assuming the challenger performs better (and upgrades your metric long term) when it performed better in the short test. And there’s two ways of being easily wrong:
- False positive: you measure the challenger to perform better, but in reality it’s not much better (or it’s worse) than the control. Usually this is caused by something called level of significance being low — and the results are good by chance. In order to be sure the results aren’t incidental, this level of significance should be 90–95% when testing. This mistake is also easy to make with overpowered tests that have a way too large sample size.
- False negative: you measure the challenger to perform the same or worse, but in reality it’s significantly better than the control. This is caused by something called statistical power being low — usually because the test doesn’t have enough volume to detect the significant difference. In order to make sure the effects can be detected, tests should be run at a power of at least 80%
The percentages of significance and power are core parts of the calculations determining the other necessary components for tests:
- necessary volume (how many users need to be involved in the test)
- detectable lift (how much better does one version need to be to be really sure it’s better)
- length of test (how many weeks do you need to test to get a significant result)
There’s scientific formulas, but luckily also a variety tools and calculators that can be used by people who cannot do basic multiplication (like me).
Research — what could be optimised?
The biggest part of the challenge is to pick the right battles. What should be tested? Remember, even with the huge amount of traffic that already gets you 1000 conversions you can test only one or two things per month!
Here’s a model called 6V for determining tests for optimisation. It’s what you should first do when you are contracted to optimise a site or product for a new client; but also every time you’re starting with a new optimisation project within your company.
- Value. What’s actually important and delivers impact? What’s the mission, strategy and goals of the company, and what are the important metrics that we should support?
- Versus. What are the best practices with the competition? Are they changing stuff on their sites? There’s many tools that let you track changes on other websites (and even determine their A/B tests with browser plugins), and be inspired from the activities of your competitors.
- View. Looking at behaviour data, journeys, and sources. One should measure the conversion % (and time) in customer journey, and determine uplift possibilities between the journey segments:
- Users with enough time to take action
- Users with some interaction
- Users with heavy interaction
- Users with a direct intent to buy
- Users with a will to buy
- Users who successfully purchase
- Users who return to purchase
- Voice. Conduct surveys. Get access to customer service feedback. Do research on competitor’s public customer service. Many consumers publicly tweet their feedback to companies — I spent a good hour on Twitter reading through the good and the bad about some of the major players in the productivity space. Really insightful!
- Validated. What have you already tested, and what were the results of the previous experiments? It’s good to always tag the database with enough information so you can easily reference later (or share the database with other optimisation teams):
- What product / service was tested
- Customer journey phase
- User segment (device / type / source / etc)
- Template (front page / listing page / etc)
- Persuasion technique used (backfire / bandwagon / etc)
- Verified. What scientific research is available? I had never thought to go on Google Scholar and type in a stupid question such as “why is teamwork hard to manage”. I got a ton of exciting research on my own subject that we can already use as inspiration and validation for our blog!
The first thing is to write the experiment hypothesis. The first thing. Before you experiment. Don’t come up with a good explanation after the test. The hypothesis should include the problem, the proposed solution, and the predicted outcome. Something like this:
If [I apply this], then [behaviour change] will happen, because of [this reason].
If we had any volume to test, my current hypothesis would be: if we add a product screenshot above fold of our landing page, users will register more free accounts because they now understand the value proposal clearly. But instead of testing, we’ll probably just go and implement that change and see if it helps.
There’s a lot of technicalities that went beyond my understanding in actually setting up the experiment in testing tools (as I have zero experience with these), but I managed to take away these general recommendations:
- Measure things yourself and send data to your own analytics, don’t trust the A/B tools 100%
- Add custom code to both A and B versions of the test to make sure your control version doesn’t perform better because it’s a little quicker.
- Set up tests for new visitors only, and make sure they keep the same experience later once your testing time is done (don’t delete the tested assets for that set of users).
- Make sure the order of events makes sense — they see the test first, then purchase. Don’t count users who purchase and then accidentally see your test later.
But again — none of this will make sense if you don’t have a large amount of traffic with a lot of conversions already. Statistics give adequate results if there’s enough data. If there’s not enough data, you’re better off guessing.