**Statistics Done Wrong**

Alex Reinhart

https://www.statisticsdonewrong.com/

**P Values**

- Measure of ‘surprise’. The smaller the p the larger the ‘surprise’. P values work by assuming that there is no difference between the 2 samples. If you want to show a drug works you counterintuitively show that “the data is inconsistent with the drug not working”
- P values say nothing about the magnitude of the effect. A small p can reveal a massive effect or a tiny effect with great certainty (say if you collected massive data). Statistical significance does not mean practical significance! Similarly, statistical insignificance does not mean zero and is simply the best evidence-based on the trial data you studied.
- Neyman-Pearson uses p-values in a conceptually different way. They estimate an acceptable false positive rate called ‘alpha’. Statistical significance allows us to reject the null hypothesis for the established alpha or false positive rate. This rate is informed by the experimenter’s understanding of the procedure.

**Confidence Intervals**

- These are preferable to p-values since they provide a point estimate and a measure of uncertainty and if they can be supplied instead of just p-values they should.
- They are less common in the literature perhaps b/c they can be very wide

**Statistical Power**

- The probability that a study of a given amount of data is capable of showing statistical significance. For example, how many coin flips do you need to be 95% sure that your experiment can reveal a biased coin that is 60% weighted towards heads? Statistical power is a function of :
- The size of the effect; a smaller effect requires more data
- Sample size; more data means the study has higher statistical power
- Measurement error; more subjective measurements have less power

- Many studies are underpowered because there is not enough data. This often occurs because it is expensive/risky (ie drug studies) or unethical (studies on animals)
- While an antidote for multiple comparison problems can be to require lower p values, the trade-off is that studies will become underpowered
- The concept of power is often forgotten because it is not taught in intro stats and is not readily intuitive. Again, confidence intervals which are wide can reveal a lack of statistical power again supporting their use over p values.
- Truth inflation or type M error (‘magnitude’) is the effect of there being many experimenters ‘competing’ to publish extreme results.
- Small samples have more variance; be careful to draw conclusions from them since they are more likely to be underpowered
- Rural states have counties with both the lowest AND highest kidney cancer rates; this is likely due to small populations, not a real effect. The same is true for test scores in smaller schools; we may interpret them to be ‘better’ based on test scores but this is because their average extremes are higher than the average extremes of bigger schools!
- Remedies for this include
*shrinkage.*Weighting the average from a small sample with a weighted average from a larger population (ie weighting a small county with a higher weighted national average). This will, unfortunately, bias truly abnormal cases too much towards normal. The best remedy is to try to find a larger sample (ie use congressional districts instead of counties). Shrinkage is a good technique in measuring average product reviews (products with few reviews are shrunk towards a generic version of the product).

**Pseudoreplication**

- Using additional measurements that are highly dependent on highly correlated to previous data. This form of replication doesn’t allow you to generalize inferences.
- If you cannot eliminate hidden sources of correlation between variables you must try to statistically adjust for confounding factors.

**Base rate fallacy**

- A low p-value is often touted as evidence of significance but significance also depends on the base rate. Consider Bayesian examples like mammogram testing. If mammograms have a false positive rate of 5% and a 90% chance of accurately identifying cancer then if you test 1000 people and 50 of them test positive then it is still quite unlikely that most of those people have cancer. Why? Because the base rate is a mere 1%. Only 10 people in that sample have cancer and we expect 9 of them to be accurately identified but more than 50 will test positive! When testing for conditions with very low base rates false-positive rates will swamp true positive rates.
- An extreme example of these false discovery rates you are looking for an effect which definitely does not exist, no matter how low you set your p threshold we know your so-called significant results are still false positives, and you are bound to record significance results with a large enough sample.
- Combatting false discovery rates with multiple comparisons is challenging but important since you expect many false discoveries. Tips include:
- Remember p < .05 doesn’t mean there’s a 5% chance your result is false
- When making multiple comparisons using a procedure such as Bonferroni or Benjamin-Hochberg will make your required p values much more conservative by accounting for the number of tests
- Be aware of stat techniques specific to your field for testing data
- Have an idea of base rates to estimate how prevalent false positives are likely to be

**Confounding variables**

- Correlation is not causation

- Because you do not know if there is a confounding variable. If you create a model that predicts heart attack rates based on weight, exercise, and diet it’s tempting to say that if you change one of them x% that the heart attack rate will change by y%. However, that is not what you tested. You didn’t change the variables in a real experiment and measure the outcomes. It is not clear that a confounding variable is actually influencing the heart attack rate.
- Also, to say a variable changes all else equal is a fantasy. In reality, it is unusual for single variables to change in a vacuum.

- Simpsons Paradox

- When a trend in the data disappears when the data is divided into natural groups. It tends to occur in observational studies with biased samples thus obscuring a confounding variable.
- Examples:
- Berkeley admission bias against women in 1973. In aggregate, women looked discriminated against but at the department level, the opposite was true. The bias occurred bc women applied in higher numbers to competitive, underfunded departments. The bias happened earlier in the process: women were systematically pushed towards these fields
- Penicillin appeared to improve outcomes for meningitis cases in the UK. At a closer look, the sample was biased since it was only administered to children who were not rushed to the hospital so they were the milder cases. Isolating the sample to those who visited a general practitioner first we find that penicillin, in fact, seemed to correlate with worse outcomes (there are theories about the breakdown of the contagion causing shock but there aren’t experiments for testing if penicillin actually causes meningitis patients to die)
- Looking at aggregate data United flights are delayed more frequently than Continental. But at individual airports, the trend reverses. The aggregate data doesn’t account for the fact that United flys out of more airports with bad weather.