HyperLogLog Functions

    Presto implements HyperLogLog data sketches as a set of 32-bit buckets which store a maximum hash. They can be stored sparsely (as a map from bucket ID to bucket), or densely (as a contiguous memory block). The HyperLogLog data structure starts as the sparse representation, switching to dense when it is more efficient. The P4HyperLogLog structure is initialized densely and remains dense for its lifetime.

    HyperLogLog implicitly casts to , while one can explicitly cast to P4HyperLogLog:

    Data sketches can be serialized to and deserialized from varbinary. This allows them to be stored for later use. Combined with the ability to merge multiple sketches, this allows one to calculate approx_distinct() of the elements of a partition of a query, then for the entirety of a query with very little cost.

    For example, calculating the HyperLogLog for daily unique users will allow weekly or monthly unique users to be calculated incrementally by combining the dailies. This is similar to computing weekly revenue by summing daily revenue. Uses of with GROUPING SETS can be converted to use HyperLogLog. Examples:

    1. CREATE TABLE visit_summaries (
    2. hll varbinary
    3. );
    4. INSERT INTO visit_summaries
    5. FROM user_visits
    6. GROUP BY visit_date;
    7. SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS weekly_unique_users
    8. WHERE visit_date >= current_date - interval '7' day;

    approx_set(x) → HyperLogLog

    approx_set(x, e) → HyperLogLog

    Returns the HyperLogLog sketch of the input data set of x, with a maximum standard error of e. The current implementation of this function requires that e be in the range of [0.0040625, 0.26000]. This data sketch underlies approx_distinct() and can be stored and used later by calling .

    cardinality(hll) → bigint

    This will perform on the data summarized by the hll HyperLogLog data sketch.

    empty_approx_set() → HyperLogLog

    empty_approx_set(e) → HyperLogLog

    Returns an empty HyperLogLog with a maximum standard error of e. The current implementation of this function requires that e be in the range of [0.0040625, 0.26000].

    merge(HyperLogLog) → HyperLogLog

    Returns the HyperLogLog of the aggregate union of the individual hll HyperLogLog structures.

    merge_hll(array(HyperLogLog)) → HyperLogLog