## Experiments (Chapter 1)

• Does smoking cause lung cancer?
• How can this be established – by comparing smokers to non-smokers?
• Is a vaccine effective against some infectious disease?

### Treatment and control groups

• In the smoking study: treatment group = smokers, control group = non-smokers.

• In a vaccine study: treatment group = patients who are vaccinated, control group = non-vaccinated.

• Ideally, the only difference between treatment group and control group is whether or not they receive the treatment .

### Randomized controlled experiments

• The best way to establish smoking causes lung cancer is a randomized controlled experiment.
• In such a study, patients would be assigned randomly to smoking or non-smoking group.
• These experiments are not always possible: we can't force people to smoke.

## 1950s NFIP study: polio vaccine example

Group Size Rate
Treatment (Grade 2 consent) 225,000 25
Control (Grade 1 and 3) 725,000 54
No consent (Grade 2) 125,000 44

### Why randomize?

• Wealthy families whose children were more vulnerable to polio also were more likely to volunteer for vaccination.
• If treatment is assigned based on whoever volunteers, this could bias the experiment against the vaccine, i.e. its apparent effectiveness is diminished.

• This means there will be differences between treatment and control groups other than just the vaccine.

• New study: both treatment and control groups are composed of children whose parents consented to vaccine; 50-50 chance assigned to the two groups.

### Placebo effect and blinding

• Subjects in the control group should be given a "treatment" with no effect. That is, they should ideally be blinded.
• Why? So the response is not due to the idea of a vaccine, but the vaccine itself.
• In the vaccination example, children were given an injection of salt and water.
• This treatment is called a placebo.

## Double blinding

• If the doctors know who receives treatment and who receives placebo, they may also bias the results by their reporting.
• For example, polio diagnosis is not perfect, a doctor with interest in the success of the vaccine may declare a treated child with mild polio as healthy; or an untreated (placebo) child who is close to healthy as having mild polio.
• This bias may be conscious or unconscious on the doctors’ part.

## Double-blind study

Group Size Rate
Treatment 200,000 28
Control 200,000 71
No consent 350,000 46

• The only difference in rates between treatment and control is randomness.
• Later, we will compute the chances seeing a difference in rate as large as (71 - 28) per 100,000 assuming the vaccine has no effect.
• The chances will be very small.

## Observational studies (Chapter 2)

• Unlike in randomized controlled experiments, in observational studies , the subjects are assigned to treatment or control by an uncontrolled mechanism.
• In a smoking / lung cancer study, subjects choose to smoke or not.
• Very often treatment or control groups differ by more than just the treatment.

## Smoking & socio-economic status

Smoking is related to socio-economic status.

Smokers:

• tend to be in lower socio-economic status groups with less access to medical care;
• will tend to have higher incidence of some diseases based on this fact alone.

## Association is not causation

• In children, shoe size is associated to reading ability.
• However, having big feet does not cause children to score high on reading tests.

## Confounding

The big problem with observational studies is confounding

• Confounding means there is a difference between the treatment and control groups

– other than treatment – which affects the response being studied.

• A confounder is a third variable, associated with exposure and with disease.

In our example with show size and reading ability, age is a confounder.

• Both reading and ability are associated to age.
• As children age, their feet grow.
• As children age, their reading improves.

### The problem with confounding

• It is generally impossible to rule out all possible confounding variables.

• Therefore establishing a causal link between two observed variables, e.g. smoking and lung cancer can be difficult.

• Fisher, one of the greatest statisticians believed there was a confounding variable in the case of lung cancer and cigarettes.

• Modern medicine would say he was wrong...

A study from UC Berkeley:

Major # (Male) % (Male) # (Female) % (Female)
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Total 2691 44 1835 35

Female admission rates are almost as good or better in all of these majors, but the overall rate is lower.

Is this evidence of bias against female applicants?

No, but it is confusing. It is a phenomenon known as Simpson's paradox.

### Females were applying to more competitive majors

In [1]:
%%capture

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from code.probability import Multinomial
table = np.zeros((6,2,2))
table[0,0] = [825*.62,825*.38]
table[0,1] = [108*.82,108*.18]
table[1,0] = [560*.63,560*.37]
table[1,1] = [25*.68,25*.32]
table[2,0] = [325*.37,325*.63]
table[2,1] = [593*.34,593*.66]
table[3,0]= [417*.33,417*.67]
table[3,1]= [375*.35,375*.65]
table[4,0] = [191*.28,191*.72]
table[4,1] = [393*.24,393*.76]
table[5,0] = [373*.06,373*.94]
table[5,1] = [341*.07,341*.93]

UCB = Multinomial(table,
labels=[['A','B','C','D','E','F'],
['Male','Female'],
['Accept', 'Deny']])

UCB_female = UCB.condition_margin(1, 'Female')

UCB_male = UCB.condition_margin(1, 'Male')
UCB_male.sample(10000)
male = np.squeeze(UCB_male.prob)
UCB_female.sample(1000)
female = np.squeeze(UCB_female.prob)
male_major = np.sum(male, 1)
female_major = np.sum(female, 1)

major_fig = plt.figure(figsize=(10,10))
major_ax = major_fig.gca()
major_ax.bar(range(female_major.shape[0]), 100 * female_major, alpha=0.5, facecolor='blue', label='Female')
major_ax.bar(range(male_major.shape[0]), 100 * male_major, alpha=0.5, facecolor='yellow', label='Male')
major_ax.legend()
major_ax.set_xticklabels(['A','B','C','D','E','F'])
major_ax.set_xlabel('Major', fontsize=20)
major_ax.set_ylabel('Percentage', fontsize=20)
major_ax.set_title('Breakdown of Major by Gender', fontsize=20)

overall_rate = table.sum(1)
accept_rate = overall_rate[:,0] / overall_rate.sum(1)

accept_fig = plt.figure(figsize=(10,10))
accept_ax = accept_fig.gca()
accept_ax.bar(range(accept_rate.shape[0]), 100 * accept_rate, alpha=0.5, facecolor='red', label='Acceptance')
accept_ax.set_xticklabels(['A','B','C','D','E','F'])
accept_ax.set_xlabel('Major', fontsize=20)
accept_ax.set_ylabel('Percentage', fontsize=20)
accept_ax.set_title('Acceptance Rate by Major', fontsize=20)

Out[1]:
<matplotlib.text.Text at 0x10a7feb50>
In [2]:
major_fig

Out[2]:
In [3]:
accept_fig

Out[3]:

## Weighted acceptance rate

\begin{aligned} \text{Male} &= \frac{0.62 \times 825 + 0.63 \times 560 + 0.37 \times 325}{2691} \\ & \qquad + \ \frac{0.33 \times 417 + 0.28 \times 191 + 0.06 \times 373}{2691} \\ &= 44 \% \\ \end{aligned}

\begin{aligned} \text{Female} &= \frac{0.82 \times 108 + 0.68 \times 25 + 0.34 \times 593}{1835} \\ & \qquad + \ \frac{0.35 \times 375 + 0.24 \times 393 + 0.07 \times 341}{1835} \\ &= 35 \% \end{aligned}

### Confounder

• In this example, Major was a confounder for the relationship between Gender and Admission status.

### Weighted average

A clearer picture can be found by computing a weighted average of the admission rate.

The average will be weighted by the total number of people applying to each major.

\begin{aligned} \text{Male} &= \frac{0.62 \times 933 + 0.63 \times 585 + 0.37 \times 918}{4526} \\ & \qquad + \ \frac{0.33 \times 792 + 0.28 \times 584 + 0.06 \times 714}{4526} \\ &= 39 \% \\ \end{aligned}

\begin{aligned} \text{Female} &= \frac{0.82 \times 933 + 0.68 \times 585 + 0.34 \times 918}{4526} \\ & \qquad + \ \frac{0.35 \times 792 + 0.24 \times 584 + 0.07 \times 714}{4526} \\ &= 43 \% \end{aligned}