Approximate count distinct overview

For large datasets and datasets with high cardinality (many distinct values), this can be much more efficient in both CPU and memory than an exact count using count(DISTINCT). The estimation uses the hyperloglog++ algorithm. If you aren’t sure what parameters to set for the hyperloglog, try using the approx_count_distinct aggregate, which sets some reasonable default values. This function group uses the two-step aggregation pattern. In addition to the usual aggregate function, hyperloglog, it also includes the alternate aggregate function approx_count_distinct. Both produce a hyperloglog aggregate, which can then be used with the accessor and rollup functions in this group.

Two-step aggregation

This group of functions uses the two-step aggregation pattern. Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function. Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions. The two-step aggregation pattern has several advantages:

More efficient because multiple accessors can reuse the same aggregate
Easier to reason about performance, because aggregation is separate from final computation
Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
Perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result

To learn more, see the blog post on two-step aggregates.

Samples

Roll up two hyperloglogs

Roll up two hyperloglogs. The first hyperloglog buckets the integers from 1 to 100,000, and the second hyperloglog buckets the integers from 50,000 to 150,000. Accounting for overlap, the exact number of distinct values in the combined set is 150,000. Calling distinct_count on the rolled-up hyperloglog yields a final value of 150,552, so the approximation is off by only 0.368%:

SELECT distinct_count(rollup(logs))
FROM (
    (SELECT hyperloglog(4096, v::text) logs FROM generate_series(1, 100000) v)
    UNION ALL
    (SELECT hyperloglog(4096, v::text) FROM generate_series(50000, 150000) v)
) hll;

Output:

 distinct_count
----------------
         150552

Approximate relative errors

These are the approximate errors for each bucket size:

precision	registers (bucket size)	error	column size (in bytes)
4	16	0.2600	12
5	32	0.1838	24
6	64	0.1300	48
7	128	0.0919	96
8	256	0.0650	192
9	512	0.0460	384
10	1024	0.0325	768
11	2048	0.0230	1536
12	4096	0.0163	3072
13	8192	0.0115	6144
14	16384	0.0081	12288
15	32768	0.0057	24576
16	65536	0.0041	49152
17	131072	0.0029	98304
18	262144	0.0020	196608

Available functions

Aggregate

hyperloglog(): aggregate data into a hyperloglog for approximate counting

Alternate aggregate

approx_count_distinct(): aggregate data into a hyperloglog without specifying the number of buckets

Accessors

distinct_count(): estimate the number of distinct values from a hyperloglog
stderror(): estimate the relative standard error of a hyperloglog

Rollup

rollup(): combine multiple hyperloglogs

Approximate count distinct

Statistical and regression analysis

Minimum and maximum

Financial analysis

Percentile approximation

Counters and gauges

Time-weighted calculations

Downsampling

Frequency analysis

State tracking

Saturating math

Approximate count distinct overview

Two-step aggregation

Samples

Roll up two hyperloglogs

Approximate relative errors

Available functions

Aggregate

Alternate aggregate

Accessors

Rollup

Approximate count distinct

Statistical and regression analysis

Minimum and maximum

Financial analysis

Percentile approximation

Counters and gauges

Time-weighted calculations

Downsampling

Frequency analysis

State tracking

Saturating math

​Two-step aggregation

​Samples

​Roll up two hyperloglogs

​Approximate relative errors

​Available functions

​Aggregate

​Alternate aggregate

​Accessors

​Rollup

Two-step aggregation

Samples

Roll up two hyperloglogs

Approximate relative errors

Available functions

Aggregate

Alternate aggregate

Accessors

Rollup