I recently had the opportunity to hear Dr. Mark Lipsey, Director of the Peabody Research Institute at Vanderbilt University, speak at an Atlanta Area Evaluation Association event. One of the main themes in his talk was reconsidering the statistical decision criteria we use in evaluation. Today, the industry standard in evaluation, and in social science in general, is to use p < .05 to determine whether an effect exists or not. The “p” there stands for the probability of Type I error, or the statistical chance of mistakenly identifying an effect where there is none. In other words, we typically require 95% certainty that a particular program has had an effect in order to recognize it as “working.” The problem of course, is this criteria does not properly account for potential cost of Type II error, or the chance of not recognizing a program that is working.
Dr. Lipsey noted that relaxing the decision criteria doesn’t actually help statistical power all that much compared to other strategies. My thinking then is that since “the state of the art” in statistics may not always be up to the task, perhaps increasing our use of qualitative methods could help fill the gap. At the same time, as evaluators we will of course continue to ensure that our statistical analyses are as accurate and credible as possible to best support the needs of our clients. To that end, p < .05 remains a necessary evil for now as the field does not currently view more relaxed decision criteria as credible. Given that limitation, it’s important to find other ways to increase statistical power whenever possible.
Do you agree that we should be more flexible about our statistical decision standards? What other methods do you use to avoid Type II error? Share your thoughts in the comments section below.