RedisBloom

Bloom filters and other probabilistic data structures for Redis

RedisBloom contains a set of useful probabilistic data structures. Probabilistic data structures allow developers to control the accuracy of returned results while gaining performance and reducing memory. These data structures are ideal for analyzing streaming data and large datasets.

Use these data structures to answer a set of common questions concerning data streams:

HyperLogLog: How many unique values appeared so far in the data stream?
Bloom filter and Cuckoo filter: Did the value v already appear in the data stream?
Count-min sketch: How many times did the value v appear in the data stream?
Top-K: What are the k most frequent values in the data stream?
t-digest can be used to answer these questions:
- What fraction of the values in the data stream are smaller than a given value?
- How many values in the data stream are smaller than a given value?
- Which value is smaller than p percent of the values in the data stream? (what is the p-percentile value)?
- What is the mean value between the p1-percentile value and the p2-percentile value?
- What is the value of the n-th smallest / largest value in the data stream? (what is the value with [reverse] rank n)?

Answering each of these questions accurately can require a huge amount of memory, but if you are willing to sacrifice accuracy, you can reduce the memory requirements drastically. Each of these data structures allows you to set a controllable tradeoff between accuracy and memory consumption.

Bloom vs. Cuckoo filters

Bloom filters typically exhibit better performance and scalability when inserting items (so if you're often adding items to your dataset, then a Bloom filter may be ideal). Cuckoo filters are quicker on check operations and also allow deletions.

About t-digest

Using t-digest is simple and straightforward:

Creating a sketch and adding observations

TDIGEST.CREATE key [COMPRESSION compression] initializes a new t-digest sketch (and error if such key already exists). The COMPRESSION argument is used to specify the tradeoff between accuracy and memory consumption. The default is 100. Higher values mean more accuracy.

TDIGEST.ADD key value... adds new observations (floating-point values) to the sketch. You can repeat calling TDIGEST.ADD whenever new observations are available.
Estimating fractions or ranks by values

Use TDIGEST.CDF key value... to retrieve, for each input value, an estimation of the fraction of (observations smaller than the given value + half the observations equal to the given value).

TDIGEST.RANK key value... is similar to TDIGEST.CDF, but used for estimating the number of observations instead of the fraction of observations. More accurately it returns, for each input value, an estimation of the number of (observations smaller than a given value + half the observations equal to the given value).

And lastly, TDIGEST.REVRANK key value... is similar to TDIGEST.RANK, but returns, for each input value, an estimation of the number of (observations larger than a given value + half the observations equal to the given value).
Estimating values by fractions or ranks

TDIGEST.QUANTILE key fraction... returns, for each input fraction, an estimation of the value (floating point) that is smaller than the given fraction of observations.

TDIGEST.BYRANK key rank... returns, for each input rank, an estimation of the value (floating point) with that rank.

TDIGEST.BYREVRANK key rank... returns, for each input reverse rank, an estimation of the value (floating point) with that reverse rank.
Estimating trimmed mean

Use TDIGEST.TRIMMED_MEAN key lowFraction highFraction to retrieve an estimation of the mean value between the specified fractions.

This is especially useful for calculating the average value ignoring outliers. For example - calculating the average value between the 20th percentile and the 80th percentile.
Merging sketches

Sometimes it is useful to merge sketches. For example, suppose we measure latencies for 3 servers, and we want to calculate the 90%, 95%, and 99% latencies for all the servers combined.

TDIGEST.MERGE destKey numKeys sourceKey... [COMPRESSION compression] [OVERRIDE] merges multiple sketches into a single sketch.

If destKey does not exist - a new sketch is created.

If destKey is an existing sketch, its values are merged with the values of the source keys. To override the destination key contents, use OVERRIDE.
Retrieving sketch information

Use TDIGEST.MIN key and TDIGEST.MAX key to retrieve the minimal and maximal values in the sketch, respectively.

Both return nan when the sketch is empty.

Both commands return accurate results and are equivalent to TDIGEST.BYRANK key 0 and TDIGEST.BYREVRANK key 0 respectively.

Use TDIGEST.INFO key to retrieve some additional information about the sketch.
Resetting a sketch

TDIGEST.RESET key empties the sketch and re-initializes it.

Academic sources

Bloom Filter

Space/Time Trade-offs in Hash Coding with Allowable Errors by Burton H. Bloom.
Scalable Bloom Filters

Cuckoo Filter

Cuckoo Filter: Practically Better Than Bloom

Count-Min Sketch

An Improved Data Stream Summary: The Count-Min Sketch and its Applications

Top-K

HeavyKeeper: An Accurate Algorithm for Finding Top-k Elephant Flows.

t-digest

The t-digest: Efficient estimates of distributions

References

Webinars

Probabilistic Data Structures - The most useful thing in Redis you probably aren't using

Blog posts

Mailing List / Forum

Got questions? Feel free to ask at the RedisBloom forum.

License

RedisBloom is licensed under the Redis Source Available License 2.0 (RSALv2) or the Server Side Public License v1 (SSPLv1).

Commands

Commands Overview

Quick start

"Quick start guide"

Client Libraries

RedisBloom has several client libraries, written by the module authors and community members - abstracting the API in different programming languages. While it is possible and simple to use the raw Redis commands API, in most cases it's easier to just use a client library abstracting it.

Configuration Parameters

RedisBloom supports multiple module configuration parameters. All of these parameters can only be set at load-time.