Without support for NULL, what is the recommended way to deal with properties that are… nullable (that is may be missing according to business logic). There are many scenarios where using a sentinel value is inappropriate, particularly when aggregating values. For example, if looking for average “age” of all “person” vertices, there is no value that may be used that would not distort the results. While there may be some hacks available in that particular situation (i.e. an “age” cannot be negative, so use a sentinel -1 and exclude it from the aggregation), this will not always work. Another option would be to duplicate every property with a shadow “*_is_present” boolean property, but this seems horribly hacky as well. What’s the best move?
The flag is the best way. This may seem a bit hacky, but in my experience, there are some compensating advantages.
You’ll generally be NULL testing anyway, which may as well be against a flag, and it means the underlying functionality is super-quick as it eliminates a lot of redundant NULL testing and NULL return handling, with all the special types that implies.
The graph’y approach is to materialise a vertex per attribute, and rely on the presence or absence of an edge to that vertex to signify NULL. Probably overkill in a lot of cases, but really useful for ordinal types.
I see. Using a flag to indicate presence is certainly doable, but it raises a performance concern for me. For use cases where a lot of aggregation needs to be done across many attributes it becomes impossible to do it in a single pass. Now a query must be defined for every attribute we want to aggregate in order to exclude unset values. I’m not clear enough on the internals of TigerGraph to know whether or not this would have any impact (maybe multiple queries on the same vertex set with varying predicates can be executed in a single pass underneath?), but perhaps you could enlighten me. The concern isn’t exactly academic, it’s common for us to do 5 aggregates (min, max, sum, avg, percentiles[1,5,50,95,99]) on each of a dozen or more attributes over large portions of the primary vertices.
Materializing a vertex per attribute value doesn’t seem feasible for numeric values, though would be a solid solution for enums as you say.
You can always do it in a single pass.
The POST-ACCUM clause is ideal for that, though you have to watch out for the alternative command separator if you want to do multiple things (which is a comma inside the ACCUM and POST-ACCUM clauses).
Each sub-clause can perform an IF myattr_flag THEN my_avg_accum += myattr END to treat the NULL equivalents correctly.
I’ve never had a performance issue with respect to vertex processing. If you hit any specific concerns, let us know! If you are hitting a lot of the graph, then there is the optimize option: https://docs.tigergraph.com/dev/gsql-ref/querying/query-operations#optimize-option . And for queries that comply, there is also the fully distributed query mode: https://docs.tigergraph.com/dev/gsql-ref/querying/distributed-query-mode . This has some restrictions, but they can often be worked around.
Thanks Rik! Didn’t realize there was logic allowed in POST-ACCUM.