I decided that today I would just do a fun little post about math, the topic for today is Simpson's Paradox. The reason I like this topic is that it is counter-intuitive at first, but it is also simple enough that anyone can understand what is going on with a reasonably short explanation.
In 1973, UC Berkeley was sued for gender discrimination in regard to their graduate admissions. The numbers looked pretty cut and dry, 44% of men who applied were accepted, where only 35% of women got in. This is a pretty big difference and it certainly seems a reasonable guess that it is not just by chance.
However, we find something interesting if we look at the individual departments. In many departments, it actually goes the other way, a larger percentage of the female applicants were admitted compared the the percentage of male applicants. This is the apparent paradox, when we look at the individual departments, it seems that the women have an advantage, but when we aggregate the data, it appears that the men have an advantage. This sounds at first to be an impossibility. If the women have the advantage in each department, why wouldn't that advantage also be present when the data is combined?
The answer to this puzzle lies in the fact that in this group of applicants, women have tended to apply to the competitive departments, while men have tended to apply where it is easier to get in. If a large number of men apply to a department where most will get in, and relatively few apply to a competitive department, the failure rate of the men in the competitive department will be hidden in the aggregate data. On the other hand, if few women apply to the noncompetitive department and a huge number apply to the competitive one, it will seem in the combined data that women did poorly overall.
If you click on the wikipedia link at the top of this post, you can see the top 6 departments and how the admissions break down. I decided to focus on just 2 departments from their list as simply looking at the 2 most extreme cases really demonstrates how this effect works.
In the competitive department, we have 341 women apply but only 24 got in, and we had 272 men apply with 16 earning admission. So women had about a 7% admissions rate while the men had a 6% rate, so the women did slightly better.
In the less competitive department, we have 108 women applying, and 89 got in, but we have 825 men applying and 511 got in. In this department the men only had a 62% success rate while the women's was 82%. In this department the women did much better than the men.
But what happens when we combine the data? There are 449 women applicants, of whom only 113 got admitted, while there were 1097 men and 527 were admitted. Only 25% of women compared to 48% of men. If we only look at this number, it would seem that the women did much worse and it is not unreasonable to think that something unfair is going on here. But if we look at where the data came from, the opposite picture actually emerges.
I originally saw this example years ago in some class, but it really stuck with me. Statistics is a wonderful thing, but it can be dangerous too. A careless or malicious person might get the data to tell us the opposite of what is really going on. I often think of this example whenever I hear a salesman or politician rattling off statistics.