Results: Identical Ad Sets, a Split Test, and Chaos

I was going to hold off on sharing the fact that I tested completely identical ad sets as a big reveal, but I decided to spoil the surprise by putting it in the title. I don’t want you to miss what I did here.

The fact that I tested identical ad sets won’t be the surprise. But, there is plenty to be found here that will raise eyebrows.

It’s kinda crazy. It’s ridiculous. Some may consider it a waste of money. And there are so many lessons found within it.

Let’s get to it…

The Inspiration

Testing stuff is my favorite thing to do. There’s always something to learn.

Several of my recent tests have me questioning whether targeting even matters anymore (read this and this). It’s not that it’s somehow unimportant that you reach the right people. It’s that, because of audience expansion when optimizing for conversions, the algorithm is going to reach who the algorithm is going to reach.

It’s this “mirage of control” that sticks with me. But, there’s something else: If the algorithm is going to do what the algorithm is going to do, what does that say about the impact of randomness?

For example, let’s say you are testing four different targeting methods while optimizing for conversions:

Advantage+ Audience without suggestions
Advantage+ Audience with suggestions
Original audiences w/ detailed targeting (Advantage Detailed Targeting is on and can’t be turned off)
Original audiences w/ lookalike audiences (Advantage Lookalike is on and can’t be turned off)

In three of these options, you have the ability to provide some inputs. But in all of them, targeting is ultimately algorithmically controlled. Expansion is going to happen.

If that’s the case, what can we make of the test results? Are they meaningful? How many were due to your inputs and how many due to expansion? Are they completely random? Might we see a different result if we tested it four times?

Once I started to consider the contributions of randomness, it made me question every test we run that’s based on reasonably small sample sizes. And, let’s be honest, advertisers make big decisions on small sample sizes all the time.

But, maybe I’m losing my mind here. Maybe I’m taking all of this too far. I wanted to test it.

The Test

I created a Sales campaign that consisted of three ad sets. All three had identical settings in every way.

1. Performance Goal: Maximize number of conversions.

2. Conversion Event: Complete Registration.

Note that the reason I used a Sales campaign was to get more visibility into how the ads were delivered to remarketing and prospecting audiences. You can do this using Audience Segments. I used Complete Registration so that we could generate somewhat meaningful results without spending thousands of dollars on duplicate ad sets.

3. Attribution Setting: 1-day click.

I didn’t want results for a free registration to be skewed or inflated by view-through results, in particular.

4. Targeting: Advantage+ Audience without suggestions.

5. Countries: US, Canada, and Australia.

I didn’t include the UK because it isn’t allowed when running an A/B test.

6. Placements: Advantage+ Placements.

7. Ads: Identical.

The ads were customized identically in each case. No difference in copy or creative, by placement or Advantage+ Creative. These ads were also started from scratch, so they didn’t leverage engagement from a prior campaign.

Surface-Level Results

First, let’s take a look at whether the delivery of these three ad sets was mostly the same. The focus in this case would first be on CPM, which would impact Reach and Impressions.

It’s close. While CPM is within about $1, Ad Set C was the cheapest. While it’s not a significant advantage, it could lead to more results.

I’m also curious about the distribution to remarketing and prospecting audiences. Since we used the Sales objective, we can view this information with Audience Segments.

It falls within a range of about $9, but we can’t ignore that the most budget was spent on remarketing for Ad Set B. That could mean an advantage for more conversions. Keep in mind that results won’t be inflated by view-through conversions since we’re using 1-day click attribution only.

Conversion Results

Let’s cut to the chase. Three identical ad sets spent a total of more than $1,300. Which would lead to the most conversions? And how close is it?

Ad Set B generated the most conversions, and it wasn’t particularly close.

Ad Set B: 100 conversions ($4.45/conversion)
Ad Set C: 86 conversions ($5.18/conversion)
Ad Set A: 80 conversions ($5.56/conversion

Recall that Ad Set A benefitted from the lowest CPM, but it did not help. Ad Set A generated 25% fewer conversions than Ad Set B, and the cost per conversion was more than a dollar higher.

Did Ad Set B generate more conversions because of that additional $9 spent on remarketing? No, I don’t think you’d have a particularly strong argument there…

Ad Set C generated, by far, the most conversions via remarketing with 16. Only 7 from Ad Set B (and 5 from Ad Set A).

Split Test Results

Keep in mind that this was an A/B Test. So, Meta was actively looking to find the winner. A winner was found quickly (I didn’t allow Meta to stop the test after finding a winner), and there would even be a percentage confidence that the winner would stay the same or change if the test were run again.

Let’s break down what this craziness means…

Based on a statistical simulation of test data, Meta is confident that Ad Set B would win 59% of the time. While that’s not overwhelming support, it’s more than twice as high as the confidence in Ad Set C (27%). Ad Set A, meanwhile, is a clear loser at 14%.

Meta’s statistical simulation clearly has no idea that these ad sets and ads were completely identical.

Maybe the projected performance has nothing to do with the fact that everything about each ad set is identical. Maybe it’s because of the initial engagement and momentum from Ad Set B that it now has a statistical advantage.

I don’t know. I wasn’t a Statistics major in college, but that feels like a reach.

Lessons Learned

This entire test could seem like a weird exercise and a waste of money. But, it may be one of the more important tests I’ve ever run.

Unlike other tests, we know that variance in performance has nothing to do with how the ad set, ad copy, or creative. We shrug off the 25% difference because we know the label “Ad Set B” didn’t provide some kind of enhancement to delivery that it generated 25% more conversions.

Doesn’t this say something about how we view test results when things weren’t set up identically?

YES!!

Let’s say that you are testing different ads. You create three different ad sets and spend $1,300 to test those three ads. One generates 25% more conversions than another. It’s the winner, right? Do you turn the other one off?

Those who actually were Statistics majors in college are likely clamoring to scream at me in the comments something about small sample sizes. YES! This is a key point!

Randomness is natural, but it should even out with time. In the case of this test, what results would come from the next $1,300 spent? And then the next? More than likely, the results will continue to fluctuate and we’ll see different ad sets take the lead in a race that will never be truly decided.

It is incredibly unlikely that if we spent $130,000 with this test, rather than $1,300, that we’d see the winning ad set with a 25% advantage over the bottom performer. And that is an important theme of this test — and of randomness.

What does a $1,300 snapshot of ad spend mean? About 266 total conversions? Can you make decisions about a winning ad set? A winning ad creative? Winning text?

Do not underestimate the contribution of randomness to your results.

Now, I don’t want the takeaway to be that all results are random and they mean nothing. Instead, I ask you to limit your obsession over test results and finding winners if you’re not able to generate the volume that would be supported with confidence that the trends would continue.

Some advertisers test everything. And if you have the budget to generate the volume that will give you meaningful results, great!

But, we need to stop this small sample size obsession with testing. If you’re unlikely to generate a meaningful difference, you don’t need to “find a winner.”

That’s not paralyzing. It’s freeing.

A Smaller Sample Size Approach

How much you need to spend to get meaningful results will be variable, depending on several factors. But, for typical advertisers who don’t have access to large budgets, I suggest taking more of a “light test” approach.

First, consolidate whatever budget you have. Part of the issue with testing with a smaller budget is that it further breaks up the amount you can spend. It makes meaningful results even less likely when you split up a $100 budget five ways.

You should still test things, but it doesn’t always need to be with a desire to find a winner.

If what you’re doing isn’t working, do something else. Use a different optimization. A different targeting approach. Different ad copy and creative. Try that out for a few weeks and see if results improve.

If they don’t? Try something else.

I know this drives those crazy who feel like they need to run split tests all the time for the purpose of finding “winners,” but when you understand that randomness drives a reasonable chunk of your results, that obsession weakens.