Overtracking and trigger analysis: reducing sample sizes while INCREASING the sensitivity of experiments

Pablo Estevez
Booking.com Data Science
14 min readApr 20, 2022

--

Pablo Estevez , Or Levkovich

For product development teams, setting up A/B experiments involves balancing trade-offs between learning goals, experiment design complexity, and development work. One such trade-off appears when making sure that the experiment tracks the behavior of the right users. Tracking users we cannot treat (called overtracking) affects the variance of our metrics and dilutes the treatment effect, making its detection harder.

In this article, we make the trade-off between overtracking, detectable effects and sample size explicit, highlighting how the impact of overtracking on sample size and false negatives can be quite high. This justifies larger investment by product development teams to get experiment tracking right, or to apply trigger analysis when studying the experiment data. Correcting overtracking can be one of the most cost-efficient ways to reduce the minimum detectable effect of experiments, all while reducing sample sizes and subject-recruitment times.

Over… what did you say?

Overtracking!

In a randomized controlled trial (RCT) or A/B test, we randomly separate our population into treated and control groups, apply the treatment to everyone in the treatment group, and observe the difference in outcomes through the lens of statistics. But sometimes we end up considering subjects that are not treatable regardless of which group they end up in.

Think about a change that occurs at the bottom of a search results page. We can assign users who reach this page to either the control or treatment group, but only those who scroll far enough to see the change will be treatable (namely, eligible to be treated: treated in the treatment group, not treated in the control group). Or think of a new ranking model which will often return the same recommendations as the old model: there would only be a “treatment” when both systems disagree on the selection, order or amount of accommodation alternatives. When they agree, users in both the control and treatment groups get the same user experience. And what about when the two ranking models only start disagreeing towards the bottom of the page? If we were to study the original population in all three of these cases, we would be overtracking, as some of the units tracked in the analysis would not be treatable.

Two sets of users are exposed to two different ranking models in an experiment. They only start to disagree below the view-port, so only users who scroll past that point could be affected by the treatment. If users who do not scroll are included in the experiment, we say that the experiment is “overtracking”.

We can of course try to solve this issue, but it comes at a cost: time and effort. In the above case we would need to call both ranking models to detect when they disagree, and introduce client-side tracking code in an invisible pixel to detect when the first disagreeing result enters the viewport (often called “counterfactual triggering”). We could also conduct “trigger analysis”, which achieves the same result by ignoring in the analysis phase those users who did not get treated (Deng and Hu, 2015). Given that solving this issue requires quite a lot of work, it only makes sense to do so if overtracking is a big issue.

So… what’s the issue with overtracking?

Any observed difference in outcome between the control and treatment groups is affected by the natural variance of the outcome in each group (the baseline noise of the measurement), and by the effect of the treatment. The hope is for the effect to be so much bigger than the noise that we are able to detect it. Intuitively, if we add users to the experiment that are not treatable, they will also contribute to the variance. But they cannot contribute to the measurable effect of the treatment as they are effectively not treated. Since the experiment now has more users, the total effect will also dilute, making it harder to detect it (Deng and Hu, 2015).

More formally, overtracking reduces the experiment power, thus requiring larger sample sizes to recover the ability to detect effects, or increasing the minimum size of the effects that can be detected.

But is it that bad?

Reducing power is never desirable, but improving tracking can take a lot of work. To justify that effort we need to get an idea of the size of the issue. For that, we need to look into the math of an experiment’s power.

If you prefer to skip the math section, here is the essence: we show that when overtracking by a factor k, the required sample size for the experiment must increase by a factor of k-squared, scaled by the ratio between the variances in the overtracked and treatable populations.

The math

Let us start by defining three populations. The treatable population are those samples that can be treated. Those in variant get treated, and those in base could have been treated. The untreatable population are those who are not treatable but get included in an overtracked experiment. And the overtracked population is the combination of treatable and untreatable samples in the experiment. We will use subindices t, u and o to indicate parameters of each of these populations respectively. We also introduce the overtracking factor k, such that from every k subjects in the overtracked population only 1 is treatable and (k-1) are untreatable. Note that the treatable and untreatable populations can be different in other aspects, in particular in their mean outcomes and their variance.

Different populations involved in an overtracked experiment. Subjects in green and red have different experiences and are thus treatable. Subjects in yellow have the same experience in control and treatment and are thus untreatable. The combination of both is the overtracked population.

Since only treatable users are affected by the treatment, we could run an experiment where only treatable subjects are included. In this illustration, we use a 2-sample t-test with equal sample size for treatment and control groups. The sample size required to measure an effect dₜ in a treatable only experiment is (Sullivan, 2022):

When this treatable population is joined by untreatable subjects, we get a new required sample size computation for the overtracked experiment, nₒₑ:

What is worth noting is that the effect we are trying to detect is still the same: only the treatable population can be affected by the treatment, and thus the effect to be detected is still the effect dₜ on that treatable population. So we need to understand how dₜ dilutes with overtracking and express dₒ as a function of dₜ and k. Note also that the mean outcomes of the control and treatment group (μ and μ*) on the overtracked experiment are the pooled means of the outcomes in the treatable population μₜ and untreatable population μᵤ, and that the mean outcomes on the untreatable population are the same for control and treatment groups (as there is effectively no treatment).

Based on equation (5), we find that to detect a change dₜ in the treatable population, we need to detect a change dₜ/k in the overtracked population (k times smaller than the original effect).

The increase on sample size required to detect dₜ when comparing the treatable only experiment to the overtracked experiment is then:

Equation (6) shows that when overtracking by a factor k, the required sample size to detect the same effect in the treatable population increases by a factor , scaled by the ratio between the variances in the overtracked and treatable populations.

An interesting remark is that experimenters often find themselves dealing with a population where untreatable subjects do not replace treatable ones, but instead are added to the population of treatable subjects. Since this overtracked population is already k times larger than that of only treatable subjects, the experimenter is left only needing to increase it by another factor k, to recover the sensitivity over effects on the treatable population. If instead the experimenter decides to leave the population size as it is, the detectable effect size on the treatable population increases by the square root of k, making it harder to detect small effects.

We can also observe these effects through numerical simulations. As shown below in Figure 1, when introducing overtracking to a series of simulated experiments with same variances (sₜ²=sₒ²), the distributions of the outcome metric become closer and increasingly overlapping due to the effect dilution. While an increase in sample size by a factor k makes the spread of distributions narrower, it is not enough to offset their overlap. When increasing population size by , the decrease in the detectable effect (shown by the decreasing distance between base and variant sample means), occurs at the same pace as the drop in variance of the two groups (observed in the narrowing of the distributions’ bell curves). As a result, the overlap between the distributions of base and variant remains unchanged, which indicates that the experiment’s power and ability to detect an effect are also unchanged.

Figure 1: Hypothesis testing under varying levels of experiment overtracking

Similarly, Figure 2 shows a bootstrapped simulation computing the power of an experiment, by computing the p-value on each bootstrap and comparing it to the threshold defined by the chosen significance level. The simulation introduces a fixed treatment effect, and the sample size at k=1 is selected to achieve 80% power, as observed in the left side of the graph. As k increases we observe how increasing the sample size by k-squared maintains the same power, while doing so only by a factor k results in consistently increasing false negative rate (FNR), effectively reducing the power of the experiment. A modest value of k=2 is enough to reduce power to about 50%, even after increasing the population size by k.

Figure 2: Experiment power for a fixed effect, under varying levels of overtracking factor

Binary outcomes

For the case of binary outcome variables, we can further express the variances as a function of k and the distributions’ parameters:

Where pₜ , pᵤ , pₒ are the respective probabilities of success of the treatable, untreatable and overtracked populations.

Thus, for the binary case the increase on sample size becomes:

A similar exercise could be developed for the case of continuous or discrete numeric metrics. This will generally be much more complex, and will depend on the specific distributions of the outcome on the treatable and untreatable populations and their parameters.

What can we make of these math formulas?

Let us take this back to our opening example, where a team is testing the value of a new ranking model on the Booking.com website. The model is used in the search results page, which about 50% of users reach. Moreover, about 25% of users reaching that point are eligible for the new model (e.g. because it requires them to be logged in to extract new features). Finally, the old model and the new model frequently agree on their top ranked items, while only about 80% of users scroll past that point. Combining these 3 fractions indicates that only 10% of the total population is treatable by our experiment, equivalent to an overtracking factor k=10. For the sake of the example, let us assume that the goal is to improve conversion, from a rate of 10% in both the total and the treatable population, and a daily traffic of ten thousand users in our website (thus one thousand in the treatable population).

The hypothesis of our experiment is that we will impact the conversion of the treatable population, since that is the only one that will effectively receive a treatment. In 10 days we would recruit 10 thousand subjects from that population, allowing us to detect effects if they are at least of size 1.5 percentage points (nₜₒₑ=10000 for dₜ=1.5 pp). If instead we would run the overtracked experiment for the same time, the increased traffic of 100 thousand subjects would bring the detectable effect on the overtracked population to 0.5 pp. But since this is a diluted effect, it would be equivalent to detecting an effect on the treatable population of 5 pp (nₒₑ=k*nₜₒₑ=100000; dₒ=0.5 pp ; dₜ=k*dₒ=5 pp). Now we would miss effects between 1.5 and 5 pp in the treatable population that were previously detectable! Alternatively, we could run the experiment for 10 times longer to obtain a sample size of 1 million subjects, and guarantee again detecting effects of size of 1.5 pp in the targetable population (nₒₑ=k²*nₜₒₑ=1M; dₒ=0.15 pp ; dₜ=k*dₒ=1.5 pp). But the required duration would increase from 10 days to 3 months! Either way, we pay a high price by not fixing overtracking, either on increased false negatives, or on running time.

What about dissimilar treatable and untreatable populations?

The above example assumes the same variance in the treatable and overtracked population, but what happens when these differ? For binary outcomes, we can use the closed formula obtained above to explore this more general case. Figure 3 below shows that when the proportion parameters of the targetable and untargetable populations match, the required increase in sample is . When they differ, a larger value of the variance on the untreatable population generally results in faster than growth.

Figure 3: Growth in required sample size under overtracking conditions for a binary metric, for different values of the proportion parameter p on the treatable and untreatable populations.

For continuous outcomes, the effect of overtracking depends on the specific outcome distributions. The general process remains the same though: the change in effect size imposes a growth of sample size of when the treatable and overtracked variances are the same, while when they differ, the relation between these variances may accelerate or slow down that growth. To exemplify this, we can resort to numerical simulations. The plot below (Figure 4) shows the situation for normal distributions with different variances and means between the treatable and untreatable population. We can observe how the required increase of sample size accelerates with increased variance on the untreatable population, and with increasing difference of means between the treatable and untreatable population, as both situations increase the variance of the overtracked population. Similar results can be obtained when sampling poisson or negative binomial distributions, often used to model count outcomes such as page views.

Figure 4: Growth in required sample size under overtracking conditions for a normally distributed metric, for different values of the mean and variance of the untreatable population.

Bots as a source of overtracking

One interesting case in e-commerce is the effect of bot traffic. Bots are often unaffected by experiments while performing their tasks automatically, and thus can constitute a form of overtracking for which the above results also apply. One caveat though is that for some important e-commerce metrics such as conversion, their probability of success and variance are close to 0 (as they rarely perform transactions). Replacing pᵤ=0 in Equation 9 gives:

And since the traffic is already increased by k due to the presence of bots (compared to the treatable population), the increase on running time becomes:

This ratio is close to 1 if pₜ is small as is usually the case in e-commerce conversion, meaning that the required running time to detect effects on the treatable population does not change. And goes up to 1/(1-pₜ) for larger values of pₜ and k, which is again a limited effect in required running time except for very high conversion funnels. Do note that for metrics which the bots trigger, or for numeric metrics where the mean outcome of real users differs from that of bots (e.g. profit metrics), tracking bots into experiments can have a sizeable impact on power.

So, what do we do?

Back to our example and its multiple sources of untreatable users. We now have a tool to drive the discussion within the product development team and identify where to focus our tracking efforts. Limiting the tracking of our experiment to only the search results page is trivial, and would already cut the overtracking factor and thus the running time of the experiment by half. Eligibility conditions may be more tricky as they also need to be checked in our control group, introducing potential changes to our reference population. But given the large overtracking factor associated with this step, we decide to at least verify the most stringent and less harmful eligibility conditions (e.g. if the user is logged in), increasing the proportion of targetable subjects in this step from 25% to 85%. Finally, calling both models in each group and tracking only on-view of the first differing result would be hard to code and error prone, while the correction of overtracking small, and thus we may choose to not address this issue. This leaves us with an overtracking factor k=1.5, about 7 times smaller than what we started with.

At this point, the team will be able to detect effects about 2.6 (square root of 7) times smaller than at the start, while saving on the almost 7 times longer running time than it would have taken to get the same sensitivity in the overtracked situation.

Key takeaways

We hope to have convinced you that overtracking dilutes the effects that we intend to detect in an experiment, and makes it harder to test the hypothesis in question. And not just a little, but a lot: if overtracking exists with proportion k, it is often necessary to recruit as many visitors into the experiment in order to be able to discover the same expected effect with the pre-specified statistical uncertainties. The alternative of leaving the sample size unchanged increases the minimum detectable effect size by a factor k, increasing the chance of false negatives. Overtracking often already increases the sample size or recruitment rate by k, but this only solves part of the problem: to reach the desired we still need to recruit k times more subjects, or accept an increase in the size of the minimum effect that we can detect by the square root of k.

In view of this, the effort required to reduce overtracking by experiment design or by trigger analysis often pays off. Even though fixing overtracking takes some work, it can in many cases be the most efficient way to increase power, reducing running time and allowing detection of even smaller effects.

Next time you run an experiment, remember to:

  • Check if you are overtracking by identifying your targetable population and comparing it with your tracked population.
  • Approximate the size of the overtracking. If most of your tracked population is targetable, then k is small and you are probably OK. If k approaches 2 (where around 50% of your population is targetable), then you start having an issue. And a k over 4 is likely a red flag.
  • Use the results in this article to assess the impact of overtracking. How much is the minimum detectable effect increasing due to overtracking? How much shorter could you run the experiment if overtracking is addressed? How much longer would it need to run to recover from the lost sensitivity if you do not address overtracking?
  • Use this information to decide together with your team if and where to address overtracking.

References

Deng, A., & Hu, V. (2015). “Diluted treatment effect estimation for trigger analysis in online controlled experiments”. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 349–358).

Sullivan, L., 2022. Power and Sample Size Determination. [online] Sphweb.bumc.bu.edu. Available at: <https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html> [Accessed 5 April 2022].

Acknowledgments

Previous work by Nils Skotara and Christina Katsimerou inspired us to look into the case of bot traffic as an instance of overtracking.

Lin Jia, Nils Skotara, Guy Taylor, James Hiam, and Tanja Matic provided important feedback during the writing of this article.

--

--