Leveraging proxy variables for causal inference

Christina Katsimerou
Booking.com Data Science
14 min readJun 24, 2021

--

Authors: Christina Katsimerou, Devini Senaratna, Camille Strasser

Customer behaviour is a key driver of business outcomes. For instance, in Booking.com we have launched transportation and attractions, because we believe that using one app for all aspects of the trip makes the overall travel experience easier and smoother for our customers, and as a result increases their engagement with our platform; or we encourage flexibility via booking with free cancellation option to help bring additional peace of mind to the travel planning process for customers and reduce operational workload for us and our partners in uncertain times. Being able to measure the impact of customer behaviour on downstream business metrics can help teams prioritise ideas that drive business growth and design better products around them. Conversely, failing to unlock the business drivers, might stall the product development around local optima.

While A/B testing is the standard tool for measuring the impact of exposing a new product to the customer, it falls short in measuring the impact of the product usage on the business outcomes. For instance, we can measure the impact of encouraging free cancellation bookings by displaying free-cancellation options more prominently, but we can’t control which customer books with free cancellation.

On the other hand, establishing causal relationships from observational data is extremely difficult due to spurious correlations; for example, a loyal customer is more likely to use our new products, because they like us in the first place; or a customer that booked with free cancellation is more likely to cancel their booking, as their uncertain travel plans were the reason to choose free cancellation. Ignoring the factors that drive both product usage and the business outcomes would likely result in overestimation of the causal effect of the product usage itself. And typically, our customers’ motives cannot be observed, making it impossible to estimate the causal effect with standard causal inference techniques.

In this post, we use a working example to demonstrate the need for causal inference methods at Booking.com. We summarise the limitations of existing methods and borrow a relatively new technique relying on negative control exposures and outcomes to estimate the causal effect of customer behaviour on business outcomes from observational data. We use simulated data and an A/B experiment to compare the method against other causal inference methods. We include sensitivity analysis to quantify the robustness of the estimates against assumption violations.

What is the impact of free-cancellation bookings on cancellations?

Refundable or free-cancellation bookings accommodate flexibility around a reservation for our customers and make it easier for them to manage their booking without the need to reach out to customer service. Especially in times of high uncertainty around travelling, for example due to a pandemic, our accommodation partners might want to offer free-cancellation on their rooms to attract customers that seek flexibility, even if there is some risk of cancellations.

It is therefore important for us to understand whether and to what extent increasing refundable bookings increases the number of customers that end up staying at the accommodation. We will refer to booking with a free cancellation option as treatment (X) and whether the customer stays at their accommodation (referred to as ‘stayed booking’ or ‘stay’) as the outcome (Y).

We call the effect of the treatment on the outcome average causal effect (ACE) and using the Pearl’s notation [1] we denote it as:

Where:

  • X=1 when the accommodation has been booked with a free cancellation option
  • Y=1 when the customer books and stays at the accommodation

We employ the do-operator to express the idea that we are interested in the causal effect of the treatment, which is the result of an intervention, and to distinguish it from the observed difference between the refundable and non-refundable stays, or more formally*

The reason why these quantities are different is because both the choice to book with free cancellation and stay at the accommodation are dictated by the customer’s intent: customers with high intent to stay and certainty about the details of their trip are likely to opt for a, typically cheaper, non refundable room, whereas the customers with high uncertainty around their trip will probably opt for a refundable room, if available. In causal terminology, we call the customer intent a confounder and represent it in a causal graph as a parent between the treatment and the outcome.

Fig. 1. A graphical representation of a confounder driving both the treatment and the outcome

To estimate the causal effect in the presence of the confounder U, all we need to do is stratify our customers into low and high intent groups and compute the difference of refundable and non-refundable stays within these groups. The ACE will be the weighted average of these sub-differences, or as the backdoor criterion states:

Randomised trials are sometimes impossible to run

However, we can’t observe our customers’ intent directly from the data, which is why we call U a hidden confounder. As such, we can’t use expression (3) to estimate the ACE, as it requires access to the hidden confounder. Normally, the best way to circumvent the hidden confounder problem is control which customers book with the free cancellation choice by means of a randomised control trial or A/B experiment. Such an intervention would remove the link between customer intent and the treatment, since we, as experimenters, decide which customers book with free cancellation. But of course, we can’t control directly customer behaviour.

Instead, we could use product experiments around free cancellations that impact free cancellation bookings. With the travel restrictions due to Covid-19 pandemic, the room-logic team proposed prioritising the refundable over the non-refundable rate in the room recommendation block, in order to facilitate flexibility and prevent cancellations that cannot be adhered to because customers booked non refundable options. To understand whether the idea is indeed beneficial for the customers, it was wrapped in a controlled experiment, in which the refundable rate was displayed higher than the non-refundable rate, when available, for half the traffic. Such an intervention increases the visibility of free cancellation options and looks as follows in a causal graph:

Fig. 2. A graphical representation of an experiment affecting the treatment in the presence of a confounder.

In this graph, T represents the random exposure of visitors to the treatment (T=0 means the visitor is in base and T=1 in the variant). This experiment can measure the effect of the increased free cancellation exposure on the stayed bookings (T → Y), but not the effect of booking with free cancellation (X → Y). The treatment will likely increase the free cancellation bookings in the variant, as more customers are encouraged to book with free cancellation. Despite the encouragement, a customer can still choose not to book with free cancellation, depending on their travel plans. This means that the hidden intent to stay remains a confounder between X and Y. The change in stayed bookings between base and variant comes from the customers that booked with free cancellation due to the encouragement but wouldn’t have booked with free cancellation otherwise (no bookers and non refundable bookers) or else the compliers. For the compliers we can estimate the effect of free cancellation bookings on stayed bookings as the compliers’ average causal effect (CACE):

This quantity can be estimated from the experiment data as the ratio of the difference in stayed bookings between base and variant and the difference in the number of booked refundable options between base and variant. But we cannot generalise the causal effect from the compliers’ group to all our customers, because they are not a random sample of our customer population. In other words, CACE is not equal to ACE. To estimate ACE across the entire population, we need to know:

  1. The effect of booking with free cancellation for the customers that didn’t, in particular:
  • the customers that always book non-refundable (non-refundable bookers in the variant). Had we “forced” them to book with free cancellation, they would either cancel, in which case booking with free cancellation would decrease the number of stays, or not cancel and stay, in which case the stays would be the same with or without the free-cancellation option.
  • the customers that never book (no bookers in the variant). Had they booked with free cancellation, they would either cancel or end up staying, similarly to the previous segment. Hence, for the never-bookers, booking with free cancellation would either increase the stays or keep them the same as without the free-cancellation option, respectively.

2. The effect of not booking with free cancellation for the customers that always book with free cancellation (free cancellation bookers in base). Had they not had the free cancellation option, they would either book the non-refundable option and stay or not book at all. So, adding the option of free cancellation for these customers can result in fewer stays if they would have booked anyways non refundable, or more stays if they wouldn’t have booked at all.

Using the ranges of values the effects can take in each of these customer segments gives us natural bounds on the ACE. Over the past decades there has been a lot of effort in tightening these bounds, such as [2]-[4]. Elaborating on this work goes beyond the purpose of this post, in which we will rely on standard results to obtain bounds for the ACE in the empirical validation section.

Finding proxies for unobserved confounders

Since we cannot get a point estimate of the average causal effect from an experiment like the one mentioned above, we have to come up with different identification methods that use auxiliary information to overcome the problem of the hidden confounding. Specifically, we could use observed descendants of the unobserved confounding variable, or proxies. In our working example, proxies of the customer intent can be:

  • Booking window (days between the reservation date and check-in date), as shorter booking windows reveal stronger commitment to travel.
  • Logged-in status, as being logged-in reveals higher intent to travel than a not logged-in visitor, who is more likely to be browsing.

We can now extend Fig. 1 as follows:

Fig. 3. A graphical representation of proxy variables of the customer’s intent to travel

In Fig. 3, logged-in status (Z) could also relate directly to X, as logged-in users get more notifications about free cancellations. Booking window (W) could also impact directly whether a customer will stay at the accommodation, because there is a higher chance of unexpected change of plans at larger booking windows.

Miao et al. [5], (building up on prior work [6]), express the unknown E[Y|x, u]P(u), which appears in the backdoor criterion expression (3), through the observed P(W|Z, x) and P(Y|Z, x), exploiting the following conditional independencies derived from the graph of Fig. 3:

Then, the causal effect of X on Y can be identified, if P(W|Z, x) is invertible, for all x for categorical variables with k values as follows:

This can be viewed as an adjusted version of the backdoor criterion, where we replace P(Z) with P(W|Z, x)⁻¹ P(W) to account for the fact that Z is not a simple confounder, but rather a proxy of the hidden confounder. In fact the same expression can identify the causal effects in simpler graphs, where either Z is not related to X, or W not related to Y, as long as both Z and W are associated with U and they confound X and Y only via U. We will refer to this identification as proxy adjustment.

The condition of invertibility of P(W|Z, x) implies that both W and Z are associated with the hidden confounder U and that they have the same number of levels. Since the condition involves observed variables, it can be verified empirically.

When all variables are binary, the expression simplifies to:

In the rest of this post, we will focus on the case of all variables in the graph being binary. This means that categorical variables with more than two levels, such as the booking window, will be reduced to two levels, namely short and long booking-window.

Testing the proxy adjustment with simulated data

We first look at simulations to understand the estimation bias and variance of the proxy adjustment method, and compare it against other causal identification methods. For the simulations, we generated 1000 data points from the graph of Fig. 3 according to the following structural model:

Variables of the structural model

The ground truth effect is computed controlling for U from the backdoor criterion from expression (3).

We compare the results against no adjustment, namely:

and “incorrect” backdoor adjustment with Z as a confounder rather than a proxy of the hidden confounder, that is:

As seen in Fig. 4, the proxy adjustment estimate is an unbiased estimator of the causal effect, as the estimation bias is centred around zero. The other two methods are biased, as they do not account correctly for the hidden confounding.

Fig. 4. Estimation bias

Sensitivity analysis

The success of the method in identifying an unbiased effect depends on two essential assumptions:

  1. P(W|Z, x) is invertible for every x. This means that both W and Z are associated within each level of X. Since the condition involves observed variables, it can be verified empirically.
  2. The graph is correctly specified. Specifically, the proxy adjustment will result in biased estimation of the effect if Z and/or W are weakly associated with the hidden confounder, which is prevented by assumption (1), or either one of Z and W are confounders of X and Y. The latter case is unverifiable from the observed data. (However, this assumption is arguably weaker than the assumption often used in literature that there is no hidden confounder (ignorability assumption [7]), which is unverifiable from the observed data and often unreasonable).

Sensitivity analysis can help us understand how robust the point estimates of the ACE are with respect to miss-specification of the graph. We are particularly interested in the cases when (i) the proxy W is weakly correlated with U, and (ii) Z or W are confounders of X and Y. All graphs in this section are obtained by using the structural model of the previous section and by adjusting its parameters.

(i) Varying correlation of proxies (only for W)

In the extreme case when W is not at all correlated with U, the estimation bias is larger than the bias in case of weak correlation (Fig. 5), which in turn is larger than the case of high correlation (Fig. 4). Similarly, as the correlation between W and U decreases, the precision of the proxy adjustment estimator decreases, as seen from the increase of the confidence intervals (Fig. 4 & 5). The other methods are also biased, because they don’t account for the direct confounding of U.

Fig. 5 Bias and variance when varying the strength of correlation of W and U

(ii) Z or W are confounders of X and Y

Fig. 6 shows that in all cases of misspecification of the graph, the proxy adjustment yields a biased estimate. Depending on the misspecification and the structural relationship between the variables, the other methods yield less biased estimates.

Fig. 6. Bias and variance when Z and/or W are confounders of X and Y

Empirical validation

Evaluating the method beyond simulations is difficult, because as explained earlier, we do not have access to the true average causal effect. But we do have bounds of the true effect, estimated from the encouragement experiment described earlier. So, we could at least check whether the point estimate of the proxy adjustment yields an estimate within the ACE bounds we can obtain from the experiment**.

Our observational dataset was the base of the experiment, augmented with the two proxies booking window and logged-in status of the visitor (Figure 7, right). In both experimental and observational dataset, we filtered only the searches of booking windows larger than two weeks, to break as much as possible the association between W and X (as free cancellation is typically available and relevant for larger booking windows).

The figures below juxtapose the causal graphs used to estimate the bounds of ACE from the experimental setting (left) and the ACE from the observation setting (right).

Fig. 7. Left, the causal graph of the A/B experiment. Right, the causal graph of the observational study.

From the experiment, we obtained that the relative effect of booking with free cancellation (in logarithmic scale) lies within [-1.8, 7.66], while the relative proxy adjustment estimate was 3.17, well within the true effect bounds. This does not necessarily validate the proxy adjustment method, but the fact that the estimates agree is certainly reassuring.

The upper bound of the treatment effect from the experimental data is quite high, as it assumes that all of the no-bookers wouldn’t cancel their reservation had they booked with free cancellation. However, even in the more realistic scenario that a very small fraction of the no-bookers would have stayed, the upper bound still includes the proxy-adjustment estimate.

Conclusion

In summary, we showed how we can leverage proxies of the hidden confounder for causal inference from observational data. This method can be very useful in cases where we have no experiment data, for example when we are interested in the causal effect of customer behaviour on a business outcome. However, this method can introduce bias if the graph is misspecified, for example if one of the proxies is also a confounder of the treatment and the effect. We encourage the data scientists to be critical with the assumptions they make, and if possible, combine multiple studies, such as an experiment that allows partial identification of the effect, to avoid major biases in the estimation of the causal effect.

Acknowledgements

Thanks to Onno Zoeter and Rafael Mourao for improving this post with their comments.

References

[1] Judea Pearl — CAUSALITY 2nd Edition, 2009 & Causal Inference in statistics: A primer, J. Pearl, M. Glymour, N.P. Jewell

[2]Manski, C. Nonparametric bounds on treatment effects. American Economic Review, Papers and Proceedings, 80: 319–323, 1990.

[3]Balke, A. and Pearl, J. Counterfactual probabilities: Computational methods, bounds and applications. In Proceedings of the Tenth International Conference on Uncertainty in Artificial Intelligence, UAI’94, pp. 46–54, 1994.

[4]Balke, A. and Pearl, J. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1172–1176, September 1997.

[5] Identifying Causal Effects With Proxy Variables of an Unmeasured Confounder Wang Miao, Zhi Geng, Eric Tchetgen Tchetgen

[6] Measurement bias and effect restoration in causal inference, Manabu Kuroki & Judea Pearl

[7] Rubin, Donald (1978). “Bayesian Inference for Causal Effects: The Role of Randomization”. The Annals of Statistics. 6 (1): 34–58.

Footnotes

* This is the formal way of saying that association is not causation.

** Note that for the computation of the ACE bounds from the experimental data, as well as the point estimate of the proxy adjustment, we so far assumed that we observe the true probabilities at the population level. In reality, since we are working with a sample of the population, we also need to account for the statistical uncertainty of all our estimates. In our analyses, we included confidence intervals and the conclusions remain the same, but we excluded the computations to keep this already lengthy post shorter.

--

--