Stats aggregator
Algorithm of the aggregator is the same with that of apache hive. This is the description in GenericUDAFVariance in hive.
Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in “Algorithms for computing the sample variance: analysis and recommendations” The American Statistician, 37 (1983) pp. 242—247.
variance = variance1 + variance2 + n/(m(m+n)) \ pow(((m/n)*t1 - t2),2)
- variance is sum(x-avg^2) (this is actually n times the variance) and is updated at every step.
- n is the count of elements in chunk1
- m is the count of elements in chunk2
- t1 is the sum of elements in chunk1
- t2 is the sum of elements in chunk2
This algorithm was proven to be numerically stable by J.L. Barlow in “Error analysis of a pairwise summation algorithm to compute sample variance” Numer. Math, 58 (1991) pp. 583—590
User can specify expected input type as one of “float”, “double”, “long”, “variance” for ingestion, which is by default “float”.
To query for results, “variance” aggregator with “variance” input type or simply a “varianceFold” aggregator must be included in the query.
{
"type" : "varianceFold",
"name" : <output_name>,
"fieldName" : <metric_name>,
"estimator" : <string>
}
To acquire standard deviation from variance, user can use “stddev” post aggregator.
Query examples:
"queryType": "timeseries",
"dataSource": "testing",
"granularity": "day",
"aggregations": [
{
"type": "variance",
"fieldName": "index_var"
}
],
"intervals": [
"2016-03-01T00:00:00.000/2013-03-20T00:00:00.000"
]
}
{
"queryType": "groupBy",
"dataSource": "testing",
"dimensions": ["alias"],
"aggregations": [
{
"name": "index_var",
"fieldName": "index"
}
],
"postAggregations": [
{
"type": "stddev",
"name": "index_stddev",
"fieldName": "index_var"
}
],
"intervals": [
"2016-03-06T00:00:00/2016-03-06T23:59:59"
]