Have you ever implemented the top-performing variation from a PPC ad copy A/B test but don’t actually see any improvement?
This happens more often than you’d think.
A/B testing works – you just need to avoid some common pitfalls.
This article tackles the top mistakes that cause PPC A/B tests to fail, plus practical tips to ensure your tests deliver meaningful results. We’ll cover issues like:
- Chasing statistical significance at the expense of business impact.
- Not running tests long enough to get sufficient data.
- Failing to segment traffic sources and other critical factors.
Aiming for a 95% statistical significance is often an overkill
When running A/B tests, general best practices say you want to start with a strong hypothesis. Something that goes along the lines of:
- “By adding urgency to my ecommerce ad copy, we anticipate CTR to increase by four percentage points.”
That’s a great way to start. Having a proper description of the testing perimeter, its control and experiment cells, the main KPI (and potentially secondary KPIs, too), and the estimated results helps structure tests and subsequent analysis.
However, when marketers start using such a methodology, they often start geeking out and hear about the “Holy Grail” of valid results: reaching statistical significance (or “stat sig”). This is when things get confusing quickly.
(I’ll assume you know what stat sig is, but if that’s not the case, then you want to start here and play with this tool to better understand the remainder of this article.)
If you’ve been in the PPC business for some time, you’ve noticed common patterns such as:
- What usually works: Urgency, limited stock and exclusive deals messages.
- Doesn’t necessarily work: Environmental and societal messages (sorry, Earth!).
- What usually works: Placing that lead form above the fold on your landing page.
- Doesn’t necessarily work: Complex, long lead forms.
So if you’re 99% confident you can have those quick wins right now, just do it. You don’t need to prove everything using A/B tests and stat sig results.
You might be thinking, “OK, but how do I convince my client we can simply roll out that change without even testing it before?”
To address this, I’d recommend:
- Documenting your tests in a structured way so you can present relevant case studies down the road.
- Benchmarking competitors (and players outside of your target industry). If they all do just about the same, there may be a valid reason.
- Sharing relevant results from relevant articles titled “Top 50 tests every marketer should know about” (e.g., A/B Tasty, Kamaleoon).
Your goal here should be to skip the line and save time. And we all know time is money, so your clients (or CMO and CFO) will thank you for that.
Don’t let statistical significance stop your test
We’ve heard some marketers say, “You should only end a test when you have enough information for it to be statistically significant.” Caution here: this is only partly true!
Don’t get me wrong, having a test reach 95% statistical significance is good. Unfortunately, it doesn’t mean you can trust your test results quite yet.
Indeed, when your A/B test tool tells you that you reached stat sig, it means your control and experiment cells are indeed different. That’s it.
How is it useful when you already know that? After all, you designed your test to be an A/B test, not an A/A test (unless you’re a stat researcher).
In other words, reaching stat sig doesn’t mean your experiment cell performed better (or worse) than the control one.
How do you know your test results indicate the top-performing asset correctly? You may think your results read that cell B overperforms cell A by five percentage points. What else do you need?
As mentioned above, reaching 95% acknowledges that your control and experiment cells behave differently. But your top performer could switch from cell A to B and then from cell B to A even after reaching 95% stat sig.
Now that’s a problem: your A/B test results aren’t reliable as soon as they reach 95% stat sig. How unreliable, you ask? 26.1%. Whoops…
If you want to dive into more details, here is a greater analysis from Evan Miller (and a broader perspective on Harvard Business Review).
How do you know your results are actually reliable? First, you want to refrain from stopping your tests until they reach 95%. And you also want to design your A/B tests differently. Here’s how.
Evaluate your target audience
If you’re not a math person, you want to read Bradd Libby’s article first.
TL;DR: Tossing a coin 10 times will hardly prove said coin is perfectly balanced. One hundred is better, and 1 million is great. An infinite amount of time will be perfect. Seriously, try tossing coins and see for yourself.
For PPC terms, what that means is that designing A/B tests should start with knowing your audience. Is it 10 people or 1 million? Depending on this, you know where you stand: in A/B testing, more data means higher accuracy.
Get the daily newsletter search marketers rely on.
Size matters in A/B testing
Not all projects or clients have high-volume platforms (be it sessions, clicks, conversions, etc.).
But you only need a big audience size if you anticipate small incremental changes. Hence, my first point in this article is not to run tests that state the obvious.
What’s the ideal audience size for an estimated uplift of just a few percentage points?
Good news: A/B Tasty developed a sample size calculator. I’m not affiliated with A/B Tasty in any way, but I find their tool easier to understand. Here are other tools if you’d like to compare: Optimizely, Adobe and Evan Miller.
Using such tools, look at your historical data to see whether your test can reach a state where its results are reliable.
But wait, you’re not done yet!
Customer journey is critical, too
For example, let’s say you observe a 5% conversion rate for a 7,000-visitor pool (your average weekly visitor volume).
The above sample size calculators will tell you that you need less than 8 days if you anticipate your conversion rate to increase by 1.5 percentage points (so from 5% to 6.5%).
Eight days to increase your conversion rate by 1.5 percentage points?! Now that’s a bargain if you ask me. Too bad you fell into the other trap!
The metric you wanted to review first was those 8 days. Do they cover at least one (if not two) customer journey stage?
Otherwise, you will have had two cohorts entering your A/B test results (for example, your clicks) but only one cohort to go through the entire customer journey (having the possibility to generate a conversion).
And that skews your results dramatically.
Again, this highlights that the longer your test runs, the more accurate its results will be, which can be especially challenging in B2B, where purchasing cycles can last years.
In that case, you probably want to review process milestones before the purchase and ensure conversion rate variations are somewhat flat. That will indicate your results are getting accurate.
As you can see, reaching stat sig is far from enough to decide whether your test results are accurate. You need to plan your audience first and let your test run long enough.
Other common A/B testing mistakes in PPC
While the above is critical in my mind, I can’t help but point out other mistakes just for the “fun” of it.
Not segmenting traffic sources
PPC pros know this by heart: Branded search traffic is worth much more than cold, non-retargeting Facebook Ads audiences.
Imagine a test where, for some reason, your branded search traffic share inflates relatively to that cold Facebook Ads traffic share (thanks to a PR stunt, let’s say).
Your results would look so much better! But would those results be accurate? Probably not.
Bottom line: you want to segment your test by traffic source as much as possible.
Sources I’d recommend looking into before launching your test:
- SEO (oftentimes, that’s 90% branded traffic).
- Emailing and SMS (existing clients overperform most of the time).
- Retargeting (those people know you already; they’re not your average Joe).
- Branded paid search.
Make sure you’re comparing similar things in your tests.
For instance, despite Google suggesting that doing a Performance Max vs. Shopping experiment “helps you determine which campaign type drives better results for your business,” it’s not an apples-to-apples comparison.
They don’t mention that Performance Max covers a broader range of ad placements than Shopping campaigns. This makes the A/B test ineffective from the start.
To get accurate results, compare Performance Max with your entire Google Ads setup, unless you use brand exclusions. In which case, you’ll want to compare Performance Max with everything Google Ads except branded Search and Shopping campaigns.
Not taking critical segments into account
Again, most marketers know that mobile devices perform very differently than their desktop counterparts. So why would you blend desktop and mobile data in that A/B test of yours?
Same with geos – you shouldn’t compare U.S. data with data from France or India. Why?
- Competition isn’t the same.
- CPMs vary widely.
- Product-market fit isn’t identical.
Make sure to “localize” your tests as much as possible.
Final segment: seasonality.
Unless you’re working on that always-on-promo type of business, your average customer isn’t the same as your Black Friday / Summer / Mother’s Day customer. Don’t cram all those A/B tests into one.
Avoid A/B testing traps for better PPC results
Understanding these key issues helps you design rigorous A/B tests that truly move the needle on your most important metrics.
With some tweaks to your process, your tests will start paying dividends.
Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.