Thursday, September 17, 2015

Why are are handwashing studies and their reporting so broken?



For around ten years, I've had my introductory biology students perform experiments to attempt to determine the effectiveness of soaps and hand cleansing agents.  This is really a great exercise to get students thinking about the importance of good experimental design, because it is very difficult to do an experiment that is good enough to show differences caused by their experimental treatments.  The bacterial counts they measure are very variable and it's difficult to control the conditions of the experiment.  Since there is no predetermined outcome, the students have to grapple with drawing appropriate conclusions from the statistical tests they conduct - they don't know what the "right" answer is for their experiment.

We are just about to start in on the project in my class again this year, so I was excited to discover that a new paper had just come out that purports to show that triclosan, the most common antibacterial agent in soap, has no effect under conditions similar to normal hand washing:

Kim, S.A., H. Moon, and M.S. Ree. 2015. Bactericidal effects of triclosan in soap both in vitro and in vivo. Journal of Antimicrobial Chemotherapy. http://dx.doi.org/10.1093/jac/dkv275 (the DOI doesn't currently dereference, but the paper is at http://m.jac.oxfordjournals.org/content/early/2015/09/14/jac.dkv275.short?rss=1)

The authors exposed 20 recommended bacterial strains to soap with and without triclosan at two different temperatures.  They also exposed bacteria to regular and antibacterial soap for varying lengths of time.  In a second experiment, the authors artificially contaminated the hands of volunteers who then washed with one of two kinds of soap.  The bacteria remaining on the hands were then sampled.

The authors stated that there was no difference in the effect of soap with and without triclosan.  They concluded that this was because the bacteria were not in contact with the triclosan long enough for it to have an effect.  Based on what I've read and on the various experiments my students have run over the years, I think this conclusion is correct.  So what's my problem with the paper?

Why do we like to show that things are different and not that they are the same?


When talking to my beginner students, they often wonder why experimental scientists are so intent on showing that things are significantly different?  Why not show that they are the same - sometimes that's what we actually want to know anyway.

When analyzing the results of an experiment statistically, we evaluate the results by calculating "P".  P is the probability that we would get results that are this different by chance, if the things we were comparing are actually the same.  If P is high, then it's likely that the differences are due to random variation.  If P is low, it's unlikely that the differences are due to chance variation, but rather that they are caused by the real effect of the thing we are measuring.  The typical cutoff for statistical significance is when P<0.05 .  If P<0.05, then we say that we have showed that the results are significantly different.

The problem lies in our conclusion when P>0.05 .  A common (and wrong) conclusion is that when P>0.05 we have shown that the results are not different (i.e. the same).  Actually, what has happened is that we have failed to show that the results are different.  Isn't that the same thing?

Absolutely not.  In simple terms, I put the answer this way: if P<0.05, that is probably because the things we are measuring are different.  If P>0.05, that is either because the things we are measuring are the same OR it's because our experiment stinks!  When differences are small, it may be very difficult to perform a good experiment and show that P>0.05 .  On the other hand, any bumbling idiot can do an experiment that produces P>0.05 by any number of poor practices: not enough samples, poorly controlled experimental conditions, or doing the wrong kind of statistical test.

So there is a special burden placed on a scientist who wants to show that two things are the same.  It is not good enough to run a statistical test and get P>0.05 .  The scientist must also show that the experiment and analysis was capable of detecting differences of a certain size if they existed.  This is called a "power analysis".  A power analysis shows that the test has enough statistical power to uncover differences when they are actually there.  Before claiming that there is no effect of the treatment (no significant difference), the scientist has to show that his or her experiment doesn't stink.


So what's wrong with the Kim et al. 2015 paper???


The problem with the paper is that it doesn't actually provide evidence that supports its conclusions.

If we look at the Kim et al. paper, we can find the problem buried on the third page.  Normally in a study, one reports "N", the sample size, a.k.a. the number of times you repeated the experiment.  Repeating the experiment is the only way you can find out whether the differences you see are due to differences or bad luck in sampling.  In the Kim et al. paper, with regards to the in vitro part of the study, all that is said is "All treatments were performed in triplicate."  Are you joking?????!!!  Three replicates is a terrible sample size for this kind of experiment where results tend to be very variable.  I guess N=2 would have been worse, but this is pretty bad.


My next gripe with the paper is in the graphs.  It is a fairly typical practice in reporting results to show a bar graph where the height represents the mean value of the experimental treatment and the error bars show some kind of measure of how well that mean value is known.  The amount of overlap (if any) provides a visual way of assessing how different the means are.

Typically, 95% confidence intervals or standard errors of the mean are used to set the size of the error bars.  But Kim et al. used standard deviation.  Standard deviation measures the variability of the data, but it does NOT provide an assessment of how well the mean value is known.  Both 95% confidence intervals and standard error of the mean are influenced by the sample size as well as the variability of the data.  They take into consider all of the factors that affect how well we know our mean value.  So the error bars on these graphs based on standard deviation really don't provide any useful information about how different the mean values are.*

The in vivo experiment was better.  In that experiment there were 16 volunteers who participated in the experiment.  So that sample size is better than 3.  But there are other problems.


First of all, it appears that all 16 volunteers washed their hands using all three treatments.  There is nothing wrong with that, but apparently the data were analyzed using a one-factor ANOVA.  In this case, the statistical test would have been much more powerful if it had been blocked by participant, since there may be variability that was caused by the participants themselves and not by the applied treatment.  

Secondly, the researchers applied an a posteriori Tukey's multiple range test to determine which pairwise comparisons were significantly different.  Tukey's test is appropriate in cases where there is no a priori rationale for comparing particular pairs of treatments.  However, in this case, is is perfectly clear which pair of treatments the researchers are interested in: the comparison of regular and antibacterial soap!  Just look at the title of the paper!  The comparison of the soap treatments with the baseline is irrelevant to the hypothesis that is being tested, so its presence does not create a requirement for a test for unplanned comparisons. Tukey's test adjusts the experiment-wise error rate to adjust for multiple unplanned comparisons, effectively raising the bar and making it harder to show that P<0.05; in this case jury-rigging the test to make it more likely that the effects will NOT be different.  

Both the failure to block by participant and using an inappropriate a posteriori test makes the statistical analysis weaker, not stronger, and a stronger test is what you need if you want to show that the reason why you failed to show differences was because they weren't there.  

The graph is also misleading for the reasons I mentioned about the first graph.  The error bars here apparently bracket the range within which the middle 80% of the data fall.  Again, this is a measure of the dispersion of the data, not a measure of how well the mean values are known.  We can draw no conclusions from the degree of overlap of the error bars, because the error bars represent the wrong thing.  They should have been 95% confidence intervals if the authors wanted to have meaning in the amount of overlap.  

Is N-16 an adequate sample size?  We have no idea, because no power test was reported.  This kind of sloppy experimental design and analysis seems to be par for the course in experiments involving hand cleansing.  I usually suggest that my students read the scathing rebuke by Paulson (2005) [1] of the Sickbert-Bennett et al. (2005)[2] paper that bears some similarities to the Kim et al. paper.  Sickbert-Bennett et al. claimed that it made little difference what kind of hand cleansing agent one used or if one used any agent at all.  However, Paulson pointed out that the sample size used by Sickbert-Bennett (N=5) would have needed to have been as much as 20 times larger (i.e. N=100) to have made their results conclusive.  Their experiment was way to weak to draw the conclusion that the factors' had the same effect.  This is probably also true for Kim et al., although to know for sure, somebody needs to run a power test on their data.

What is wrong here???

There are so many things wrong here, I hardly know where to start.

1. Scientists who plan to engage in experimental science need to have a basic understanding of experimental design and statistical analysis.  Something is really wrong with our training of future scientists if we don't teach them to avoid basic mistakes like this.

2. Something is seriously wrong with the scientific review process if papers like this get published with really fundamental problems in their analyses and in the conclusions that are drawn form those analyses.  The fact that this paper got published means not just that four co-authors don't know basic experimental design and analysis, but two or more peer reviewers and an editor can't recognize problems with experimental design and analysis.

3. Something is seriously wrong with science reporting.  This paper has been picked up and reported online by newsweek.com, cbsnews.com, webmd, time.com, huffingtonpost.com, theguardian.com, and probably more.  Did any of these news outlets read the paper?  Did any of them consult with somebody who knows how to assess the quality of research and get a second opinion on this paper.  SHAME ON YOU, news media !!!!!

--------------
Steve Baskauf is a Senior Lecturer in the Biological Sciences Department at Vanderbilt University, where he introduces students to elementary statistical analysis in the context of the biology curriculum. 

* Error bars that represent standard deviation will always span a larger range than those that represent standard error of the mean, since standard error of the mean is estimated by s/ sqroot(N).  The 95% confidence interval is + or - approximately two times the standard error of the mean.  So when N=3, the 95% confidence interval will be approximately +/- 2s/ sqroot(3) or +/-1.15s.  So in the case where N=3, the square root error bars span a range that is slightly smaller than the 95% confidence interval error bars would span.  This makes it slightly easier to get have error bars that don't overlap than it would be if they represented 95% confidence intervals.  When N=16, the 95% confidence interval would be approximately +/- 2s/ sqroot(16) or +/-s/2.  In this case, the standard deviation error bars are twice the size of the 95% confidence intervals, making it much easier to have error bars that overlap than if the bars represented 95% confidence intervals.  In a case like this where we are trying to show that things are the same, making the error bars twice as big as they should be makes the sample means look like they are more similar than they actually are, which is misleading.  The point here is that using standard deviations for error bars is the wrong thing to do when comparing means.

[1] Paulson, D.S. 2005. Response: comparative efficacy of hand hygiene agents. American Journal of Infection Control 33:431-434. http://dx.doi.org/10.1016/j.ajic.2005.04.248
[2] Sickbert-Bennett E.E., D.J. Weber, M.F. Gergen-Teague, M.D. Sobsey, G.P. Samsa, W.A. Rutala. 2005. Comparative efficacy of hand hygiene agents in the reduction of bacteria and viruses.  American Journal of Infection Control 33:67-77.  http://dx.doi.org/10.1016/j.ajic.2004.08.005

Saturday, September 5, 2015

Date of Toilet Paper Apocalypse Now Known


Introduction


Fig. 1. Charmin toilet paper roll in 2015

Have you noticed how rolls of toilet paper fit into their holders more loosely than they did in the past?  Take a look the next time you are doing your Charmin paperwork, and you will see the first signs of the impending Toilet Paper Apocalypse.

Methods and data


Like the proverbial frog in boiling water, the first signs of the Toilet Paper Apocalypse were not obvious.  I first became aware that it was coming in 2009, when I happened to buy two 8-packs of Charmin Giant rolls and noticed that one of them was noticeably shorter than the other:

Fig. 2. Roll size change observed in 2009.
Careful examination of the quantity details showed that the width of each roll had been decreased from the standard width of 11.4 cm to 10.8 cm.  Although this decrease of 0.6 cm wasn't that apparent when comparing single rolls, it was pretty obvious when the rolls were stacked four high.  When I noticed this, I had a sinking feeling that I was seeing the beginning of a nefarious plot being carried out by faceless bureaucrats at Procter and Gamble.  But without additional data, it was hard to know whether this 5% decrease in the size of rolls was a fluke or part of a pattern.

Fast-forward to 2015 and another trip to the store.  This time it was a purchase of two 6-packs of Charmin Mega rolls:
Fig. 3.a. Charmin roll size change in 2015.

Close examination of the details tells the story:
Fig. 3.b. Charmin roll size change in 2015 (detail).

Sometime between 2009 and 2015, P&G decreased the width of their rolls again, from 10.8 to 9.9 cm, another 8% decrease in the size per sheet.  Now in 2015 the number of sheets per Mega roll was reduced from 330 to 308, a further 7% decline in the amount of toilet paper per roll and a corresponding increase in P&G's profits (since the price of the product has stayed the same with the size decreases).  

What first seemed an idle conspiracy theory is now a known fact.  Being an analysis nerd, I had to search for additional data from the years between 2009 and 2015.  Google did not disappoint.  I was able to find this 2013 post from the Consumerist, which documented another decline in roll width, from 10.8 cm to 10.0 cm (a  7% decrease in sheet size):

Fig. 4. Charmin roll size change in 2013.  Image from http://consumerist.com/2013/10/31/charmin-deploys-toilet-paper-sheet-shrink-ray-slims-rolls-down/

Analysis

Unfortunately, not all of the changes involved rolls of the same size.  I also did not have exact dates for each change.  However, by estimating the dates and doing some size conversions, I was able to assemble the following data:
Fig. 5. Data showing relationship between year and roll size.

 I conducted a regression analysis on the data and the trend was clear:
Fig. 6. Regression analysis predicting the relationship between year and roll size.
The regression had P=0.0203, so there is no question about the significance of the relationship.  Note also that the regression line has an R2 value of 0.96, which means we can safely extrapolate into the future to predict the course of events leading up to the apocalypse.  Rearranging the regression line and solving for the year when the toilet paper surface area goes to zero gives a year value of 2046.  So we can now confidently predict the date when the apocalypse will come to pass.

Discussion

This analysis has three important implications. 

1. Charmin toilet paper will disappear on or about the year 2046.  For those of us who are devoted Charmin users, that means shortages, hoarding, and chaos in the grocery stores during the 2040's.  Remember the horror that was Y2K?  We have that to look forward to all over again.

2. Since each Charmin roll size decrease was accomplished without a corresponding decrease in price, the profits for Procter and Gamble will increase as an inverse function of the area of paper per roll.  A little elementary math will prove that a linear decrease in the roll area to zero results in Procter and Gamble's profits increasing to infinity in the year 2046.  The implications are clear: by taking advantage of their customer's addiction to their product, P&G intends to dominate the world economy by mid-century. 

3. As the amount of toilet paper per roll decreases, other effects will begin to appear.  For example, if we assume a constant number of sheets per roll and a length per sheet of 10.1 cm, we can project based on the regression line from Fig. 6 that the width of sheets will reach 2.0 cm in approximately 2040 (Fig. 7).

Fig. 7. Simulated Charmin roll size in 2040.
When the sheet size reaches that shown in Fig. 7, it becomes questionable as to whether the toilet paper can actually perform its intended function. In that case, we can expect cascades of related effects caused by the increase in transmission of fecal-borne pathogens, such as pandemics of cholera, hepatitus A, and typhoid fever.  It is possible that the human population may collapse due to these pandemics several years before the actual disappearance of toilet paper altogether.

Conclusions

This paper should be considered a clarion call for action.  The most likely solution to the problem is probably some form of government regulation of roll and sheet size.  However, given the widespread gridlock in facing lesser issues, such as climate change and refugee crises, it is likely that inaction by governments on this issue will continue.  In that case, we should expect widespread hoarding, followed by rationing as 2046 approaches.

--------------------------------------------------------------------------------------------------------------

Steve Baskauf is a Senior Lecturer in the Biological Sciences Department at Vanderbilt University, where he introduces students to elementary statistical analysis in the context of the biology curriculum.