Stephen Watts (University of Manchester): A Goldilocks statistic for histograms - is your histogram just right?
https://zoom.us/j/97041428849
The histogram is a key method for visualizing data and estimating the underlying probability density function (pdf). Incorrect conclusions about the data result from under-binning or over-binning. Computer software that automatically decides upon a suitable bin width can fool the unwary user. Scientists often adjust the bin size to get a visually appealing plot. This is subjective and liable to bias. Many algorithms exist to choose the bin size. The number of entries, N, and some measure of the variance are key to the choice of bin size. Algorithms either use a formula (e.g. Scott’s Rule) or minimise a risk or cost function. Some algorithms are of dubious utility. The optimal histogram is when the Mean Integrated Square Error (MISE) between the actual and binned probability distribution is a minimum. However, one does not know the actual pdf, the reason one makes a histogram. Information theory will be used to show that a histogram has a Shannon entropy of the form (1/M)logN. This leads directly to a new binning formula based on the Shannon entropy. This simple formula uses the differential entropy estimated from nearest-neighbour distances in the data. M is a “Goldilocks statistic”. It will be shown that for M less than two, the histogram is over-binned, and for M greater than 3, the histogram is under-binned. For M between 2 and 3, MISE is minimal, and the histogram is just right. M can either be fixed using the new algorithm, or estimated from a histogram binned by some other technique. Shannon’s source coding theorem is used to show that the optimal bin choice is for equi-probable binning and setting the number of bins to the square root of N. The new algorithm is compared to other methods and its performance shown using real data. Comments are made on the application of these ideas to higher dimensional data.
For more details see, https://arxiv.org/abs/2210.02848, The Shannon Entropy of a Histogram