In the smoking study:

**treatment group = smokers**,**control group = non-smokers**.In a vaccine study:

**treatment group = patients who are vaccinated**,**control group = non-vaccinated**.Ideally, the only difference between

**treatment group**and**control group**is whether or not they receive the**treatment**.

- Wealthy families whose children were more vulnerable to polio also were more likely to volunteer for vaccination.
If

**treatment**is assigned based on whoever volunteers, this could bias the experiment against the vaccine, i.e. its apparent effectiveness is diminished.This means there will be differences between

**treatment**and**control**groups other than just the vaccine.New study: both treatment and control groups are composed of children whose parents consented to vaccine; 50-50 chance assigned to the two groups.

- Subjects in the
**control**group should be given a "treatment" with no effect. That is, they should ideally be*blinded*. - Why? So the response is not due to the idea of a vaccine, but the vaccine itself.
- In the vaccination example, children were given an injection of salt and water.
- This treatment is called a
**placebo**.

- If the doctors know who receives treatment and who receives placebo, they may also bias the results by their reporting.
- For example, polio diagnosis is not perfect, a doctor with interest in the success of the vaccine may declare a treated child with mild polio as healthy; or an untreated (placebo) child who is close to healthy as having mild polio.
- This bias may be conscious or unconscious on the doctors’ part.

- Unlike in randomized controlled experiments, in
*observational studies*, the subjects are assigned to**treatment**or**control**by an*uncontrolled*mechanism. - In a smoking / lung cancer study, subjects choose to smoke or not.
- Very often
**treatment**or**control**groups differ by more than just the treatment.

It is generally impossible to rule out all possible confounding variables.

Therefore establishing a causal link between two observed variables, e.g. smoking and lung cancer can be difficult.

Fisher, one of the greatest statisticians believed there was a confounding variable in the case of lung cancer and cigarettes.

Modern medicine would say he was wrong...

Female admission rates are almost as good or better in all of these majors, but the overall rate is lower.

**Is this evidence of bias against female applicants?**

No, but it is confusing. It is a phenomenon known as Simpson's paradox.

In [1]:

```
%%capture
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from code.probability import Multinomial
table = np.zeros((6,2,2))
table[0,0] = [825*.62,825*.38]
table[0,1] = [108*.82,108*.18]
table[1,0] = [560*.63,560*.37]
table[1,1] = [25*.68,25*.32]
table[2,0] = [325*.37,325*.63]
table[2,1] = [593*.34,593*.66]
table[3,0]= [417*.33,417*.67]
table[3,1]= [375*.35,375*.65]
table[4,0] = [191*.28,191*.72]
table[4,1] = [393*.24,393*.76]
table[5,0] = [373*.06,373*.94]
table[5,1] = [341*.07,341*.93]
UCB = Multinomial(table,
labels=[['A','B','C','D','E','F'],
['Male','Female'],
['Accept', 'Deny']])
UCB_female = UCB.condition_margin(1, 'Female')
UCB_male = UCB.condition_margin(1, 'Male')
UCB_male.sample(10000)
male = np.squeeze(UCB_male.prob)
UCB_female.sample(1000)
female = np.squeeze(UCB_female.prob)
male_major = np.sum(male, 1)
female_major = np.sum(female, 1)
major_fig = plt.figure(figsize=(10,10))
major_ax = major_fig.gca()
major_ax.bar(range(female_major.shape[0]), 100 * female_major, alpha=0.5, facecolor='blue', label='Female')
major_ax.bar(range(male_major.shape[0]), 100 * male_major, alpha=0.5, facecolor='yellow', label='Male')
major_ax.legend()
major_ax.set_xticklabels(['A','B','C','D','E','F'])
major_ax.set_xlabel('Major', fontsize=20)
major_ax.set_ylabel('Percentage', fontsize=20)
major_ax.set_title('Breakdown of Major by Gender', fontsize=20)
overall_rate = table.sum(1)
accept_rate = overall_rate[:,0] / overall_rate.sum(1)
accept_fig = plt.figure(figsize=(10,10))
accept_ax = accept_fig.gca()
accept_ax.bar(range(accept_rate.shape[0]), 100 * accept_rate, alpha=0.5, facecolor='red', label='Acceptance')
accept_ax.set_xticklabels(['A','B','C','D','E','F'])
accept_ax.set_xlabel('Major', fontsize=20)
accept_ax.set_ylabel('Percentage', fontsize=20)
accept_ax.set_title('Acceptance Rate by Major', fontsize=20)
```

Out[1]:

In [2]:

```
major_fig
```

Out[2]:

In [3]:

```
accept_fig
```

Out[3]:

$$ \begin{aligned} \text{Female} &= \frac{0.82 \times 108 + 0.68 \times 25 + 0.34 \times 593}{1835} \\ & \qquad + \ \frac{0.35 \times 375 + 0.24 \times 393 + 0.07 \times 341}{1835} \\ &= 35 \% \end{aligned}$$

- In this example,
*Major*was a*confounder*for the relationship between*Gender*and*Admission status.*

A clearer picture can be found by computing a weighted average of the admission rate.

The average will be weighted by the total number of people applying to each major.

$$\begin{aligned} \text{Male} &= \frac{0.62 \times 933 + 0.63 \times 585 + 0.37 \times 918}{4526} \\ & \qquad + \ \frac{0.33 \times 792 + 0.28 \times 584 + 0.06 \times 714}{4526} \\ &= 39 \% \\ \end{aligned} $$

$$ \begin{aligned} \text{Female} &= \frac{0.82 \times 933 + 0.68 \times 585 + 0.34 \times 918}{4526} \\ & \qquad + \ \frac{0.35 \times 792 + 0.24 \times 584 + 0.07 \times 714}{4526} \\ &= 43 \% \end{aligned}$$