Stats aggregator

Algorithm of the aggregator is the same with that of apache hive. This is the description in GenericUDAFVariance in hive.

Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in “Algorithms for computing the sample variance: analysis and recommendations” The American Statistician, 37 (1983) pp. 242—247.

variance = variance1 + variance2 + n/(m(m+n)) \ pow(((m/n)*t1 - t2),2)

variance is sum(x-avg^2) (this is actually n times the variance) and is updated at every step.
n is the count of elements in chunk1
m is the count of elements in chunk2
t1 is the sum of elements in chunk1
t2 is the sum of elements in chunk2

This algorithm was proven to be numerically stable by J.L. Barlow in “Error analysis of a pairwise summation algorithm to compute sample variance” Numer. Math, 58 (1991) pp. 583—590

User can specify expected input type as one of “float”, “double”, “long”, “variance” for ingestion, which is by default “float”.

To query for results, “variance” aggregator with “variance” input type or simply a “varianceFold” aggregator must be included in the query.

{
  "type" : "varianceFold",
  "name" : <output_name>,
  "fieldName" : <metric_name>,
  "estimator" : <string>
}

To acquire standard deviation from variance, user can use “stddev” post aggregator.

Query examples:

  "queryType": "timeseries",
  "dataSource": "testing",
  "granularity": "day",
  "aggregations": [
    {
      "type": "variance",
      "fieldName": "index_var"
    }
  ],
  "intervals": [
    "2016-03-01T00:00:00.000/2013-03-20T00:00:00.000"
  ]
}

{
  "queryType": "groupBy",
  "dataSource": "testing",
  "dimensions": ["alias"],
  "aggregations": [
    {
      "name": "index_var",
      "fieldName": "index"
    }
  ],
  "postAggregations": [
    {
      "type": "stddev",
      "name": "index_stddev",
      "fieldName": "index_var"
    }
  ],
  "intervals": [
    "2016-03-06T00:00:00/2016-03-06T23:59:59"
  ]