How we use Bloom filters to check for profanity
Do you worry about randomly generated strings containing not-so-savory language? We do! So, we had some fun over-engineering a solution to guard our strings from profanity.
By the Offsyte Team, January 12, 2022
Imagine this: you’ve built a data model that relies on some form of randomly generated strings. Think: coupon codes or reservation IDs. At Offsyte, we have many of these types of strings, but among the most important are our booking IDs – a randomly generated 6-character string.
Everything goes smoothly until one of these strings contains a bad word. Now, your customer is sharing links containing this ID with others in reference to your product. As funny as this could be – depending on the gravity of the word – it has the potential to set a bad tone for customers and the brand.
We wanted to tackle this problem before anything offensive could sneak its way into the system, and we settled on a technically interesting and efficient solution.
In pseudocode, this is a very simple problem to solve:
def generate() -> str: while True: candidate = random_string() if not contains_profanity(candidate): return candidate
We’re going to focus on the contains_profanity() function in this blog post. The naive implementation looks like this:
- Download and check in a list of bad words
- Load this list of bad words into memory
- Check if any substring in the candidate matches one of the bad words
This is a perfectly acceptable solution. It works, and even though memory usage would scale linearly with the size of the list, it would be manageable.
Here’s where we decided to have a bit of fun to expand our engineering knowledge: how could we solve this without checking in the list of bad words, and with better than O(n) space efficiency? The answer involves a Bloom filter.
What is a Bloom filter?
To summarize from Wikipedia: A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
A Bloom filter can tell you if a given element is possibly a member of the set, or is definitely not a member of the set. We interact with Bloom filter implementations all the time, for example:
- CDN caches, for only storing keys that are accessed more than once
- Databases, for reducing disk lookups for non-existent rows
- Web browsers, for checking against a set of malicious URLs
At its core, a bloom filter is a bit array. Items added to the set will be hashed multiple times and the hash output is mapped to positions in the bit array.
To check an element, we run it through the same hash functions and the positions in the bit array are checked. Since multiple items can hash to the same positions, we can only know if it’s possible that an element is in the set. However, if any of the bits are not set when checking, then we know for certain that the element does not exist. For a deeper explanation, we recommend reading this amazing tutorial.
First, we needed an implementation of the Bloom filter data structure. After looking at libraries and other approaches we decided to adapt a reference implementation from a blog about bloom filters in python. This gave us the flexibility to extend as needed.
bloom_filter = BloomFilter() bloom_filter.add("word") assert bloom_filter.check("word") # => True assert not bloom_filter.check("other") # => False
Now, we need to populate this Bloom filter. We created a small script to create a Bloom filter that contained all of the bad words. This script does the following:
- Download a profane words text file from Github. We use this list (warning: this list has a lot of bad words), and mix in our own internal list
- Load all the words into our new Bloom filter data structure
- Base64 encode the bit array and include the parameters (eg. item count and false probability percentage) in a JSON blob, which is checked in to the codebase
At runtime, we simply load the JSON blob back into a Bloom filter instance (once), and then we can use it in our contains_profanity() function.
The final piece of the puzzle is to use the Bloom filter to implement our contains_profanity() function.
We have a function that tells us if a word is definitely not a bad word, but we can’t just check the whole string once; we have to check that every substring does not contain a bad word, which is required even for the simple set-based approach.
If we assume that DOG is a bad word, then for the string LUVDOG, we need to check: LUVDO, UVDOG, LUVD, … and so on. We created a helper function outlined in this gist to do substring generation, but when implemented, our contains_profanity() function looks like this:
def contains_profanity(string) -> bool: for substr in gen_substrings(string, min_length=3): if bloom_filter.check(substr): # Even if it _might_ be contained, we assume it is, and move on return True return False
This seems like a lot to do for something that could have been a regex or check against a set of words. However, it is important to think about more than just the easiest way to do things and to make trade offs and decisions about how to spend time. For us, we decided to do this for the following reasons:
Our own learning and understanding
Investing time into learning a new data structure or approach to engineering a solution is valuable. This exercise allowed us to think outside of normal conventions to create a solution that fit our needs even better in addition to being more performant. Plus, it’s maintainable: the Bloom filter clocked in at less than 100 lines of code, and is sectioned off behind a nice interface.
Scalability and Adaptability
This solution scales. As our ID generation use cases grow, the code path that checks for profanity will remain fast, memory efficient, and predictable.
Additionally, we can adapt this to our needs – maybe we want to mix in other words, or include multiple languages, or include derivations of words. This is trivially possible using our Bloom filter generation script.
Engineering can be fun! At Offsyte, we believe that pursuing interesting and technically expansive solutions is a great way to grow and stay engaged at work. Not to mention, it makes for great blog content to share.
Get more team-nurturing insights
We're always looking for new ways to help you deepen relationships with your team. We'll email you the newest findings as we discover them. No spam. We promise.