1133. Data Structure - Bloom Filter
Bloom Filter
Bloom filter and its implementation.
1. Bloom Filter
Bloom filter
is a data structure designed to rapidly and memory-efficiently tell whether an element is present in a set.
The cost paid for this efficiency is that a Bloom filter is a probabilistic data structure
: it tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector
.
2. Implementation
Use an integer array as bit vector. Each cell has the value either 0 or 1. One key will be hashed by three hash functions, then it will be stored into the array.
public class BloomFilter { private int capacity; private int[] array; public BloomFilter(int capacity) { this.capacity = capacity; array = new int[capacity]; } public void add(String key) { int first = hash_function1(key); int second = hash_function2(key); int third = hash_function3(key); array[first%capacity] = 1; array[second%capacity] = 1; array[third%capacity] = 1; } public boolean contains(String key) { int first = hash_function1(key); int second = hash_function2(key); int third = hash_function3(key); int firstIndex = array[first % capacity]; if (firstIndex == 0) { return false; } int secondIndex = array[second % capacity]; if (secondIndex == 0) { return false; } int thirdIndex = array[third % capacity]; if (thirdIndex == 0) { return false; } return true; } private int hash_function1(String key) { int hash = 0; for (int i = 0; i < key.length(); ++i) { hash = 33 * hash + key.charAt(i); } return Math.abs(hash); } private int hash_function2(String key) { final int p = 16777619; int hash = (int) 2166136261L; for (int i = 0; i < key.length(); ++i) { hash = (hash ^ key.charAt(i)) * p; } hash += hash << 13; hash ^= hash >> 7; hash += hash << 3; hash ^= hash >> 17; hash += hash << 5; return Math.abs(hash); } private int hash_function3(String key) { int hash, i; for (hash = 0, i = 0; i < key.length(); ++i) { hash += key.charAt(i); hash += (hash << 10); hash ^= (hash >> 6); } hash += hash << 13; hash ^= hash >> 11; hash += hash << 15; return Math.abs(hash); } }
3. Performance Test
3.1 HashSet
We define a hashset with capacity 100,000,000 and insert 10,000,000 entries into the set. Then, we assert several elements to test whether they are in the set.
public class HashSetTest { @Test public void testHashSet() { int capacity = 100000000; int count = capacity / 10; long start = System.currentTimeMillis(); Set<Integer> set = new HashSet<>(capacity); for (int i = 0; i < count; i++) { set.add(i); } assertTrue(set.contains(1)); assertTrue(set.contains(2)); assertTrue(set.contains(3)); assertTrue(set.contains(999999)); assertFalse(set.contains(10000001)); assertFalse(set.contains(400230340)); long end = System.currentTimeMillis(); System.out.println("Executed Time: " + (end - start)); } }
It takes more than 6 seconds to execute the test.
Executed Time: 6442
3.2 Bloom Filter
Create a similar test with the Bloom Filter structure we built previously.
class BloomFilterTest { @Test public void testBloomFilter() { int capacity = 100000000; int count = capacity / 10; long start = System.currentTimeMillis(); BloomFilter bloomFilter = new BloomFilter(capacity); for (int i = 0; i < count; i++) { bloomFilter.add(i + ""); } assertTrue(bloomFilter.contains(1 + "")); assertTrue(bloomFilter.contains(2 + "")); assertTrue(bloomFilter.contains(3 + "")); assertTrue(bloomFilter.contains(999999 + "")); assertFalse(bloomFilter.contains(10000001 + "")); assertFalse(bloomFilter.contains(400230340 + "")); long end = System.currentTimeMillis(); System.out.println("Executed Time: " + (end - start)); } }
It takes only about 1 second to complete the test, which is much faster than the hash set.
Executed Time: 1171