Encouraged to comply

Nils Skotara

Published in

Booking.com Data Science

14 min readAug 4, 2022

Improving bounds with Instrumental Variables

Authors: Marko Bokulić, Polina Popova, Nils Skotara, Xiaowei Zhang (in alphabetical order)

Introduction

Correlation is not causation. This popular meme is well known and widely used. Its implications however are less so, and can be very severe. AI algorithms have been flagged as being racially biased, or they fail spectacularly when context changes. This is because current models learn mere associations with no notion of causal relationships (Sgaier et al, 2020). Techniques of causal inference, have only recently gained attention. Many valuable insights as to why current models fail have been obtained within the last few years (Pearl & Mackenzie, 2017). A known method of studying causal effects of interventions is the randomized controlled experiment. Conducting an experiment though is not always possible and many times causal inference on observational data is the only viable option. Methods commonly applied to achieve this (e.g. inverse propensity weighting) rely on assumptions which are known to be overly optimistic and are rarely justifiable. Mainly, these methods assume that no relevant variables are missing in the analysis. Particularly in the business context, accepting results and basing important financial decisions on the hope that missing variables have no influence seems overly optimistic at best and extremely risky at worst. And businesses often lack important information, for example their customers’ intent, which likely influences the outcome of interests. Therefore using these methods could lead to wrong and costly decisions. Thus, this article provides a solution that uses prior knowledge instead of unjustifiable assumptions. For example: instead of pretending that socio-demographics of visitors of a website have no effect on their conversion, it is possible to use a known fact, such as “ Our conversion rate has never been higher than a certain number”, to justify estimates in the analysis.

The technique enabling this is the so-called encouragement design: a variant of the method of Instrumental Variables. Examples for an encouragement design are suggestions or discounts: the subject has an incentive to take the treatment (for example, “buy the upgrade now and get a discount”) but they can refuse. Since not everyone who is encouraged takes the treatment, the Average Treatment Effect (ATE) cannot be precisely estimated and can only be given within some bounds (range of possible values). This article will explain how to obtain those bounds and make them narrow enough to obtain valuable business insights by using prior knowledge about the business context. The article proceeds as follows: first it introduces the concepts of instrumental variables and the ATE. It explains under which conditions bounds can be given instead of a point-estimate (i.e., a single value) . After that, the logic behind these bounds is explained. And finally, it is shown how to make these bounds tighter using prior knowledge, and therefore more useful for decision making.

The Instrumental Variable Approach

Booking.com is a platform that connects travellers with memorable experiences, a range of transportation options, and incredible places to stay — from homes to hotels and much more. Booking.com constantly utilises AB testing to take data driven decisions. The following business case should serve as an example:

Property partners sometimes leave our platform. Thus, dedicated Account Managers (AMs) contact those partners in order to try to reactivate them. The impact of their work can be measured in an AB test with the following setup:

Treatment X
‣ X = 1: Partner is contacted by AM
‣ X = 0: Partner is not contacted by AM
Outcome Y
‣ Y = 1: Partner converts (reactivates account)
‣ Y = 0: Partner does not convert

The Average Treatment Effect is defined as the average difference in the probability that a partner converts, P(Y = 1) had the partner been treated (X = 1) vs had the partner not been treated (X = 0):

It equals the difference between the mean of Y in the treatment group nd the mean of Y in the non-treatment group if the treatment was randomized, because all confounding variables are in expectation the same in both groups. Although this AB test setup appears to be a proper randomized controlled experiment, it misses a crucial component. While hotels can be randomly assigned to treatment, the AMs can still deviate from this assignment. I.e., they might not call a partner in the treatment group, for example because they believe this partner is unlikely to reactivate or vice versa. Thus, this setup (see figure 1) requires a different approach with the following variables:

Instrument Z
‣Z = 1: Partner assigned for contact
‣Z = 0: Partner not assigned for contact
Treatment X
‣ X = 1: Partner is contacted by an AM
‣ X = 0: Partner is not contacted by an AM
Outcome Y
‣ Y = 1: Partner converts
‣ Y = 0: Partner does not convert

This so-called encouragement design differentiates between the randomized instrument Z and the treatment X. It relies on three conditions (Pearl & Mackenzie, 2017):

Z and X are not independent.
Z does not directly affect Y except through its effect on X.
Z and Y do not share common causes.

The first condition can be tested with data. It says that assigning a partner to be contacted by the AM increases the chance they will be contacted. The second is a statistical assumption that can be refuted but not verified. The third one is known as the exclusion-restriction assumption and holds by design if the instrument was randomly assigned.

Figure 1. Necessary conditions for the IV setup illustrated with an example. The instrument Z influences the treatment X but not the outcome Y directly, and the outcome and instrument do not share common causes.

Crucially, these three assumptions are not sufficient to obtain a point-estimate for the ATE. Getting a point estimate requires additional assumptions described in the Addendum. These assumptions are hard to justify. But even without them it is still possible to derive bounds for the ATE.

Compliance Types

Since subjects can choose if they take the treatment they were assigned to (i.e. call the partner that they were supposed to, or not), they can be classified into four compliance types:

never-takers — don’t take the treatment, assigned to it or not
compliers — take the treatment if and only if assigned to it
defiers — take the treatment if and only if not assigned to it
always-takers — take the treatment, assigned to it or not

In practice, it is impossible to know the compliance type of any given subject (subject in our example refers to an AM-partner pair), since the behaviour can only be observed in one of the two possible cases, assigned or not assigned. Similarly, the outcome under treatment and the outcome under non-treatment, the so-called potential outcomes, can only be partially observed. The hypothetical knowledge of all possibilities is illustrated in table 1.

Table 1. Hypothetical case of complete knowledge about the factual and counterfactual cases for all subjects. Here for every subject, it is known how they behaved under assignment (X | Z = 1) and under non-assignment (X | Z = 0), as well as their potential outcomes after treatment (Y | X = 1) and under non-treatment (Y | X = 0). Complete knowledge of both potential outcomes enables the identification of the ATE. White cells illustrate what is actually observed and known, whereas grey cells (including the ATE) represent the information that is unobserved and unknown and cannot be obtained with these data.

Actual data reveals a much less complete picture. Counterfactual outcomes and compliance types are unobserved and unknown (grey cells in table 1). The fact that only one of the two potential outcomes is observable, is known as the fundamental problem of causal inference (Rubin, 1974; Holland, 1986). Thus, individual treatment effects cannot be observed and in the Instrumental Variable approach the ATE cannot be identified due to the unknown compliance type of any given subject. Only if all subjects were compliers, the encouragement design IV setup would be identical to a randomized controlled experiment and the ATE could be identified. But if that is not the case, the observed data can still provide bounds for the ATE that depend on the distribution of the compliance types.

Bounds on the Average Treatment Effect

Prior to the data collection, the ATE can potentially take on any value between -1 (if all partners convert unless contacted by AMs) and 1 (if no partner converts unless contacted by AMs). With data, the width of this interval can be reduced, from a width of 2 to a width of 1. This is because only one of the potential values can be observed: each subject was either treated or not. The unobserved one can be imputed, in two different and opposing ways for the two bounds:

Lower bound: for unobserved cases the treatment would have had a negative impact

2. Upper bound: for unobserved cases the treatment would have had a positive impact

Moreover, these bounds can be further tightened by introducing four possible treatment-response types:

never-converters — do not convert, whether treated or not
convert-compliers — convert if and only if treated
convert-defiers — convert if and only if not treated
always-converters — convert, whether treated or not

Combining those with the four compliance groups forms 16 mutually exclusive groups with relative frequencies qij (see table 2).

*Table 2. Based on the relationships between Z and X (4 compliance types in the rows) and X and Y (4 treatment-response types in the columns), subjects can be classified into 16 disjoint sets.*

For each subject, the observed data narrows down the possible compliance and treatment-response types from four to two. For example, given that for 20% of subjects the observed values were Z = 1, X = 0, and Y = 0, the following can be deduced:

Z = 1 with X = 0 can only occur for never-takers or defiers.
X = 0 with Y = 0 can only occur for never-converters or convert-compliers.

Thus, q₀₀ + q₀₁ + q₂₀ + q₂₁ = 0.2.

Similarly, eight such equations can be derived:

Using the notation pij.k := P(Y=i, X=j | Z=k)

The ATE can be expressed in terms of qij as well, since

If everyone was treated

If no one was treated:

Thus

Bounds for the ATE can be obtained using linear programming with the ATE as the objective function and the eight equations as constraints. These are (Balke & Pearl, 1997):

Sampling variation in the estimates of these bounds (Imbens & Manski, 2004) is not taken into account here. It is interesting to note that the width of the bounds is equal to the rate of non-compliance (under monotonicity, i.e. without defiers; Balke & Pearl, 1997, p. 5.). The equations above act as if nothing was known about these groups. This, however, is not necessarily the case. Prior knowledge can often be utilised to gain knowledge about the behaviour of the 16 groups. The next section will provide an attempt to tighten these bounds using this approach.

Tightening the Balke-Pearl Bounds

In many situations, additional information beyond the eight elementary equations is available, and can be represented as constraints in terms of the 16 groups, to further tighten the bounds. Using linear programming in python, new bounds have been generated under different constraints and compared with the Balke-Pearl bounds. An overview of these constraints and their impacts is laid out in table 3. Particularly useful are constraints that raise the lower bound, since positive treatment effects are often aimed for by business.

Table 3. Constraints that can be applied on top of three IV core conditions. Feeding these constraints into the linear programming (in the language of the qij), symbolic bounds were derived. Monte-Carlo simulations were used to test whether or not resulting bounds tightened the Balke-Pearl bounds.

These constraints should be applied only after a well-founded assessment of the specific business case. To illustrate this, here is a numeric example using fabricated data generated for the partner reactivation scenario, which was used throughout the article. In this example, half of all partners were assigned to be called by an Account Manager (AM), and half were not: a randomly assigned instrument of the encouragement type. The data was simulated using these parameters:

Compliance: for half of the partners AMs will not behave like compliers, i.e., calling the partners assigned for calls, and not calling those that are not. In equal cases they act as never-takers or always-takers, but never as defiers (i.e., calling a partner only if they are not scheduled)
True Effect: calling a partner is beneficial and it increases the chance of the partner reactivating by 20%. But the effect is not the same for all compliance types (it is not homogeneous), rather it is 20% on average.

These parameters would, of course, not be known to the researcher. The observable data is shown in table 4:

Partners that are supposed to be called are (unsurprisingly) more likely to be called (75% vs 25%), but many that were supposed to be called were not called (25%) and vice versa for those not supposed to be called (25%).
Partners in the assignment group were more likely to reactivate (35% vs 10%), which amounts to a +25% effect of the encouragement (as opposed to the effect of the treatment). However, calculating this effect is not very useful in most cases. The business wants to know what is the effect of calls, not the encouragement, so it can choose between different strategies, such as giving bonuses to AMs for calling partners, or hiring more AMs.

*Table 4. Hypothetical numerical example of an encouragement design experiment, where AMs call properties for reactivation. Generate the observable experimental data, based on simulated parameters.*

Four bounds were calculated with the aforementioned simulation. The original Balke-Pearl bounds and three additional ones. The Balke-Pearl bounds tend to be wide and therefore uninformative. They led to a lower bound -0.1 and an upper bound of 0.4, not providing any information about the sign of the effect. The three additional bounds were calculated with the following constraints, respectively:

1c. A minimum amount of never-takers and never-converters. Those are partners who closed permanently and thus will never reactivate.
2a. Historical data provides a baseline for the reactivation rate of partners. A conservative upper limit for reactivation, if no partners were contacted, could be set 2 or 3 standard deviations above this estimate.
3c. AMs know the partners well. They intentionally deviate from the assigned treatment to achieve better reactivation rates. This strong assumption can be justified by a stable long-term relationship between AMs and partners. A similar approach in the domain of law enforcement is described in Siddique (2013).

The impact on bounds using each of those additional constraints is demonstrated in figure 2. Compared with the Balke-Pearl bounds, constraints 2a and 3c raise the lower bound above zero, providing evidence for a positive treatment effect.

*Figure 2. ATE bounds under assumptions 1c, 2a, and 3c compared to the Balke-Pearl bounds. All bounds were calculated on a dataset generated conditional on the assumptions.*

In general, these constraints as well as others from table 3 can be justified by prior research, or knowledge about the research context. For some, this is easier to do, such as 2a that limits the conversion rate of the untreated population. Or the minimum number of never-takers and never-converters (constraint group 1c) could be argued by pointing to the fact that a certain percentage of website visitors are bots. Another common case is one-sided compliance, where the treatment is only accessible by subjects assigned to it (for example a personalised discount). In this situation, no one can be an always-taker (constraint 1b). Other constraints are harder to justify. For example, 3b assumes no negative impact (i.e., the intervention is known to be neutral or positive), and 3c assumes that the subjects can, to some degree, foresee the outcome of their actions. In any case, they can and should be justified based on evidence, and supported by continuously collecting more data which provide increasing confidence.

Summary and Conclusions

Estimating the Average Treatment Effect is crucially important for a wide variety of Data Science scenarios from informing business decisions to improving existing Artificial Intelligence systems. Doing it correctly is of crucial importance in estimating causal effects which avoids any kind of bias. One method of doing so is the method of Instrumental Variables. Its three core assumptions, however, are not sufficient to identify the ATE. A hypothetical stratification of the sample into 16 different compliance and treatment-response types provides bounds around the ATE. By using additional constraints based on prior knowledge, these bounds can be tightened and applied to particular business cases. Prior knowledge can be justified or tested. Thus, researchers can engage in meaningful discussions about its plausibility and validity. This is the main advantage of the method introduced here over many methods of causal inference, such as matching, weighting or double robust estimators. Those methods rely on untestable and often implausible assumptions such as the “no unmeasured confounding” assumption, which (in simple words) states that nothing unmeasured influenced the results. Rather than basing impactful business decisions on such wishful thinking, it can be better to accept a lower level of precision and obtain informed bounds instead of point-estimates, which are still narrow enough to meaningfully inform business decisions. Further, wrong assumptions can lead to serious bias whereas bounds based on prior knowledge can mitigate this issue.

Addendum

Assumptions that allow for a point-estimate

There are two main additional conditions that allow for identification in the case of Instrumental Variables, either of which allows for a point-estimate (Hernán & Robins, 2020):

Monotonicity: There are no defiers (see section “Compliance Types”). Assuming monotonicity, however, only allows for getting a point-estimate for compliers. To extend this to everyone, it is necessary to assume that non-compliers (i.e., always-takers and never-takers) have the same effect as compliers.
Homogeneity: Assumes the ATE is the same for encouraged and non-encouraged subjects, both treated and untreated (this is one version of homogeneity named additive homogeneity, see Hernán & Robins, 2020, pp 198.). More precisely:

X stands for the treatment, Z the assigned instrument and Y the potential outcome.

Both of these assumptions are hard to justify, since they assume knowledge about the treatment effects on those that did not take the treatment, i.e., unavailable information.

Literature

Angrist, J. D., Imbens, G. W., ε& Rubin, D. B. (1996), “Identification of Causal Effects Using Instrumental Variables”, Journal of the American Statistical Association, 91: 444–455.

Baiocchi, M., Cheng, J., & Small, D. S. (2014), “Instrumental Variable Methods for Causal Inference”, Statistics in Medicine, 33, 2297–2340.

Balke, A., & Pearl, J. (1994), “Counterfactual Probabilities: Computational Methods, Bounds and Applications”, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, Morgan Kauffman: 46–54.

Balke, A., & Pearl, J. (1997), “Bounds on Treatment Effects From Studies With Imperfect Compliance”, Journal of the American Statistical Association, 92: 1171–1176.

Hernán MA, & Robins JM (2020), “Causal Inference: What If”, Boca Raton: Chapman & Hall/CRC.

Holland, P. W. (1986), “Statistics and Causal Inference”, Journal of the American Statistical Association, 81 (396): 945–60.

Guido W. Imbens, & Charles F. Manski (2004), “Confidence Intervals for Partially Identified Parameters”, Econometrica, 72 (6): 1845–1857.

Pearl, J., & Mackenzie, D. (2017), “The Book of Why”.

Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies”, Journal of Educational Psychology, 66 (5): 688–701.

Siddique, Z. (2013), “Partially Identified Treatment Effects Under Imperfect Compliance: The Case of Domestic Violence”, Journal of the American Statistical Association, 108: 504–513.

Sgaier, S. K., Huang, V., & Charles, G. (2020), “The Case for Causal AI”, Stanford Social Innovation Review, Summer 2020, 18 (3): 50–55.

Swanson, S. A., Robins, J. M., Miller, M., & Hernán, M. A. (2018), “Partial Identification of the Average Treatment Effect Using Instrumental Variables: Review of Methods for Binary Instruments, Treatments, and Outcomes”, Journal of the American Statistical Association, 113 (522): 933–94.