TigerGarph data distribution in a cluster

We are working on a graph that has 15 different Vertices and 20 different edges (1.5 billion nods and 8 machine cluster environment). Whenever we are running any query with more than 2 different Vertices and 1 different edge, it ends up with out of memory error. My assumption is, TigerGraph is bringing all the data in the machine where it decided to execute and its trying to bring all the missing Vertices (data that doesn’t exists on that machine) before query execution. More Vertices and Edges in a single select statement means more shuffling and more data will be pushed to memory, which is causing the OOM error. My understanding is that tiger graph will distribute the data in all the machine in a cluster, but i don’t know how is it distributed. Can someone give me a scenario how TigerGraph will distribute data for a 8 machine cluster, 15 different Vertices, 20 different edges, and 300 GB data.

We have tried distributed mode, but facing the same issue.

Example query:

source = select t
from V1:s - (e1) - V2:i - (e2) - V3:t
where s.city == “jersey” and i.project == “abc” and t.fund > 10000;

One quick comment is that graph queries that heavily rely on the filtering in the WHERE clause are something of an Anti-Pattern that should be avoided. That’s more of a relational database style of approach. For instance, I would have a City node and/or a State node, and traverse from the City to whatever V1 is. This is especially true when there is a high node count for a Vertex type. Basically, any attribute that is going to be commonly used in a WHERE clause, should be pulled and made into its own vertex.

2 Likes

Consider Where clause is not there. How TigerGraph data is distributed in a 8 node cluster? I am more interested to learn about data distribution and Is there a way to avoid shuffling for more than 2 Vertices? If i can save my target dataset in one machine query will run more efficiently.

TigerGraph by default evenly distributes graph data on all nodes. There is advanced configurations that can be used to specify different graph partition strategy, like putting specific types on specific nodes.

If the GSQL query is created with the normal signature ‘create query xxx’, it will be executed in pull mode, which like you mentioned, pull all data that is involved in each graph traversal step into the chosen worker machine. There is a distributed query, which can be created by signature ‘create distributed query xxx’, which will run each graph traversal step distributed all on nodes in the cluster. However, there are a few gsql features that are not supporter in distributed query. For more detail documentation: Distributed Query Mode :: GSQL Language Reference

Sorry just saw your comment that distributed query still got OOM. We will let gsql team take a look.
A walkaround we can try is to break the query into multiple single traversal step query, like this

V1 = select b from A-e1- B:b where A.xxx == xxx;
V2 = select c from V1-e2-C:c where C.xxx == xxx;
V3 = select d from V2-e3-D:d where D.xxx == xxx;

Print V3;

1 Like