Could Someone Give me Advice on Optimizing Graph Queries for Large Datasets?

Jim_Limprasert · January 3, 2025, 4:07pm

I have a couple questions:

May I ask which TigerGraph version you are currently using? E.g. 3.10.1 or 3.9.3?
What is the rough runtimes of the queries you are mentioning? E.g. 3 minutes, 30 minutes…?
What is your cluster setup? What is your cluster’s partition number and replica number
Are you currently using TG Cloud? TG On-prem? Or TG in Docker?

Here is a list of some of the optimization options you could look more into:

Follow the Best Practices:
- E.g. hard-batching, avoid hub nodes, using fewer container-type accumulators, …
- Best Practice videos seen here:
Try using a distributed mode query (query with DISTRIBUTED keyword), or if you already use one, try to make your query into a single mode query
- Distributed Query Mode :: GSQL Language Reference
Try to modify the part(s) or SELECT statements that take considerable longer amount of time compared to other parts. This can be especially helpful if there are a lot of SELECT statements
Use batching for queries when applicable. While this can result in longer runtimes, it will result in less memory used per batch (e.g. more free memory).
Consider moving the subquery logic into the main query, especially if the subquery has SELECT statements and it was used in ACCUM/POST-ACCUM clauses
- Small subqueries that are run outside of ACCUM clauses could be fine
Try not to do too many SELECT statements in your query, especially the ones in loop iteration (if you can avoid it).
- E.g. if you have a loop with 10k iterations and each iteration was doing something like 3 SELECT statements, that means it could be doing 30k SELECT statements. Each SELECT statement have a bit of overhead to it, so it could be beneficial in that case to do something like inverting the logic so the iterations are done inside the SELECT statements instead.

Measuring time of each portion of TigerGraph code:

using datetime functions (already available) - precise up to 1 second: How to time the graph algorithm execution - #3 by Szilard_Barany
using timestamp() function (only for TG 3.10 and above) - precise up to 1 millisecond:

curr_part_start_time = timestamp();
some_vertex_set = SELECT s FROM ...
curr_part_end_time = timestamp();    
INT curr_part_time_ms = curr_part_end_time - curr_part_start_time;
PRINT curr_part_time_ms;

I hope this helps to some degree!

Best,
Supawish Limprasert (Jim)
Solution Engineer, TigerGraph