A Synthetic Variance Designed For Hadoop And Big Data

Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments.

The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.

Synthetic Metrics

This new metric is synthetic: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). Bysynthetic, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design – from data to theory, rather than the bottom-up traditional approach – from theory to data.

Other synthetic metrics designed in our research laboratory include:

Hadoop, numerical and statistical stability

There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, … , xn) = {SUM(xi^2)/n} – {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.

Read full article.

Source: http://www.datasciencecentral.com/xn/detail/6448529:Topic:172299