Close
Updated:

A method to represent data with more balance: Winsorize it

To lessen the influence of outliers, a set of numbers can be Winsorized. Named after Charles Winsor, to Winsorize data, tail values at the extremes are set equal to some specified percentile of the data, such as plus and minus four standard deviations. Here’s how it’s done. For a 90 percent Winsorization, the bottom 5 percent of the values are set equal to the value corresponding to the 5th percentile while the upper 5 percent of the values are set equal to the value corresponding to the 95th percentile. This adjustment is not equivalent to throwing some of the extreme data away, which happens with trimmed means (See my post of May 28, 2007: five percent trimmed mean.). Once you Winsorize your data, your medians will not change but your average will.

Using the data from the 652 participants so far in the GC Metrics benchmark survey, I Winsorized the data for number of lawyers. Using a 90 percent process, at the small end meant changing 9 of them to 1 lawyer, which was the 5th percentile value from the bottom (the other 23 were already 1). At the high end, I changed the 32 with the most lawyers to the 95th percentile figure, 110 lawyers. After Winsorization, the median of 6 lawyers stayed the same but the average dropped from 25.96 lawyers to 18.03 – a decline of 30 percent because several very large law departments were drastically Winsorized.