I am working on a scale-out graph rules engine project that requires us to check if an item is NOT on one of many long lists of items (hundreds of thousands of items on thousands of lists). Other NoSQL systems use a probabilistic hashing algorithm called a Bloom filter to do this. Has anyone created a Bloom Filter User Defined Function (UDF) for GSQL yet?
If anyone else is interested in graph rule engine design, please let me know.
As a background, this article is very well written
Doesn’t look like it, but should be very doable as a UDF.
There is some internal conversation!
My first thoughts:
Architecturally, we’d need a ListOfStuff vertex to contain the Bloom state. That would need to be an encoded representation (base64 string or whatever).
Keeping the functions simple is a great idea. I like just having two queries, one for addValue and one for checkValue. One other option is to pass a collection Set attribute to an addValue since this is really the datatype we are working with. I wonder if that would be faster than calling addValue N times?
Yes it should be a little quicker, particularly the problematic bit in this design would be the serialization and deserialization of the filter state on every call.
The filter would be some small multiple of the bloom filter size which would depend on the number of elements etc. Parameterising those would require a createFilter function.
If the ruleset is static, i.e. we generate the filter once, and then test against it a lot, then more succinct structures can be used e.g. cuckoo or xor filters.
Then there wouldn’t be an addValue operation, just a createFilter operation taking all of your rules as a parameter and returning the serialized filter state.