Subgraph Matching in TigerGraph

Hi, I am new to TigerGraph but I feel interested and excited in it. I am looking for materials about how to implement subgraph pattern matching in TigerGraph. For example, here is a query which is a fully connected subgraph with four query vertices:

Customer:u1 -> buy -> Product:p
Customer:u2 -> buy -> Product:p
Customer:u3 -> buy -> Product:p
Customer:u1 <- friend -> Customer: u2
Customer:u1 <- friend -> Customer: u3
Customer:u2 <- friend -> Customer: u3

The query describes that three customers buy the same product and they are friends of each other. I wonder how to write a query with GSQL such that if there is a new event that a customer u1 buys a product p, I can find or count all the [u1, p, u2, u3] tuples that match the query subgraph above.

In fact, I also have other subgraphs with different subgraph structures that may not be fully connected, so I want to find a general approach that can express pattern matching for any subgraph using GSQL.

I read the GSQL102 Pattern Matching but the tutorial mainly talks about pattern matching on path-shape subgraph. I am still not sure how to extend from path-shape to any other shape. I wonder if there is some advanced tutorials for this.

I also read Graph Pattern Matching in GSQL which seems to be more relevant to what I want. But I still find difficulty to extend from this blog to general subgraphs. For example, how should I modify the GSQL query in the blog if I add an edge from A to D in the Fig 1 in the blog? As this blog is posted two years ago, I also wonder if TigerGraph has better support for pattern matching on general subgraphs.

Thanks for your time for reading this. Any help is appreciated.

Couple points here to discuss:

  1. when person u1 buys a new product, are you saying that you want to recommend this new product to person u1’s friends? or are you saying that once person u1 buys product p then find friends that share the purchase history of this new product?
  2. how many hops do you want to allow? i.e. direct friend, friend of a friend,etc.

If you just want the direct friends that purchased product p, here is how I would do it, although there may be other ways:

assume the inputs are cust and prod:
buyer = {cust};
FRIENDS = SELECT c FROM buyer -(friend>)- Customer:c;
item = {prod};
BUYERS = SELECT c FROM item -(<buy) - Customer:c;

MATCHES = FRIENDS INTERSECT BUYERS;

Hi markmegerian, thanks for your reply.

For point 1, I want to find friends of u1 that share the purchase history of the product.
For point 2, I have no idea, so I would say I have no restriction on the number of hops in the query.

As to your GSQL query, if I understand it correctly, the query first looks for (1) the friends of u1 and (2) the buyers that bought the product, and then intersect the two results. But I think the final results does not guarantee the friends of u1 are friends, i.e., Customer:u2 <- friend -> Customer: u3? In fact, that is a question I am interested to know the answer.

You are correct - my example only shows direct friends. But I am not sure why you want to guarantee that u2 and u3 are friends. Think about it, if u1 has 5 friends, would you want to ensure that all 5 friends returned are also direct friends with everyone else? This is certainly possible, but I am not sure if that is what you want. Do you want a closed network where you only get back the number of people that make up a fully interconnected cluster where every direct connection is present?

if you do, this is the general idea, which builds on the previous example

SetAccum @friendCount;
INT groupsize;
groupsize = MATCHES.size();

CLUSTER = SELECT f FROM MATCHES:f
ACCUM f.@friendCount += (f.neighbors() INTERSECT MATCHES)
HAVING f.@friendCount.size() == groupsize - 1;

Thanks @markmegerian!

why you want to guarantee that u2 and u3 are friends

Actually my reason is naive. It is because there is an edge connecting u2 with u3 in the query subgraph. I just want to follow the query subgraph exactly when translating the subgraph to a GSQL query.

Do you want a closed network where you only get back the number of people that make up a fully interconnected cluster where every direct connection is present?

Nope, I am not looking for the size of the closed network among the friends of u1. I am sorry that I did not make myself clear. Supposed u1 has 5 friends, i.e., a, b, c, d and e. All the 5 friends bought the same product p as u1. What I want to find is the matchings of the query subgraph, where there will be at least 10 matchings for this case:

[u1, p, a, b]
[u1, p, a, c]
[u1, p, a, d]
[u1, p, a, e]
[u1, p, b, c]
[u1, p, b, d]
[u1, p, b, e]
[u1, p, c, d]
[u1, p, c, e]
[u1, p, d, e]

It is ok but better not to produce duplicate matchings like [u1, p, b, a] which is actually the same as [u1, p, a, b].

Hi @markmegerian, after spending some time searching and thinking, I think one possible solution for my case is to use subqueries, where the job for each subqueries is to extend one new query vertex and we use select xxx from a-(b)-c to get the neighbors of old query vertices then intersect their neighbors to get the data vertices for the new query vertex.

In fact, I am not 100% sure as I am new to TigerGraph, so I want to show you an example, as given below. Would you please take a look and see if the idea (instead of syntax) is correct?

CREATE QUERY step_3(Customer u1, Product p, Customer u2, Customer u3) FOR GRAPH G returns (INT) {
   return 1;
}

CREATE QUERY step_2(Customer u1, Product p, Customer u2) FOR GRAPH G returns (INT) {
   SumAccum<INT> @@n_matchings;
   U1 = {u1};
   P = {p};
   U2 = {u2};

   u3s = select f from U1:u1 -(Friend:a)- Customer:f -(  Buy>:b)- P:p;
   u3s = select f from U2:u2 -(Friend:a)- Customer:f -(Friend:b)- u3s:u3;
   
   FOREACH u3 IN u3s DO
       @@n_matchings += step_3(u1, p, u2, u3)
   END;

   return @@n_matchings;
}

CREATE QUERY step_1(Customer u1, Product p) FOR GRAPH G returns (INT) {
   SumAccum<INT> @@n_matchings;
   U1 = {u1};
   P = {p};

   u2s = select f from U1:u1 -(Friend:a)- Customer:f -(Buy>:b)- P:p;
   
   FOREACH u2 IN u2s DO
       @@n_matchings += step_2(u1, p, u2s)
   END;

   return @@n_matchings;
}

If the above is correct, I actually more questions about the execution of the subqueries, i.e., to understand how the subqueries are executed and how I can optimize the execution for the best performance. I have read Distributed Query Mode.

My questions are:

  1. If I mark all the subqueries above with DISTRIBUTED keyword, is it the case that the execution of each subquery will be moved across the cluster when and only when processing the select xxx from a-(b)-c statements?
  2. Is there a mechanism that I can make the execution of a subquery to be independent from the subquery who invokes it? E.g., supposed step_2 will invoke step_3 10 times, are the 10 invocations of step_3 processed one after one sequentially inside step_2? Can I process the 10 invocations of step_3 simutaneously while step_2 will wait for the results of all these 10 invocations?

I think the one below is more appropriate. It seems I cannot update my previous reply so I have to make a new reply here.

CREATE QUERY step_2(Customer u1, Product p, Customer u2) FOR GRAPH G returns (INT) {
    SumAccum<INT> @@n_matchings;
    U1 = {u1};
    P = {p};
    U2 = {u2};

    u1_friends = select f from U1:u1 -(Friend:e)- Customer:f;
    p_buyers   = select b from P:p -(<Buy:e)- Customer:b;
    u2_friends = select f from U2:u2 -(Friend:e)- Customer:f;
    
    FOREACH u3 IN u1_friends INTERSECT u2_friends INTERSECT p_buyers DO
        @@n_matchings += step_3(u1, p, u2, u3)
    END;

    return @@n_matchings;
}

CREATE QUERY step_1(Customer u1, Product p) FOR GRAPH G returns (INT) {
    SumAccum<INT> @@n_matchings;
    U1 = {u1};
    P = {p};

    u1_friends = select f from U1:u1 -(Friend:e)- Customer:f;
    p_buyers   = select b from P:p -(<Buy:e)- Customer:b;
    
    FOREACH u2 IN u1_friends INTERSECT p_buyers DO
        @@n_matchings += step_2(u1, p, u2s)
    END;

    return @@n_matchings;
}

I love subqueries and I am currently working on a tigergraph project that makes great use of subqueries and recursion, but as far as I can tell, this will not support DISTRIBUTED. The tigergraph experts can chime in here, since I havent actually seen that explicitly in the docs, but my query installs fail with the DISTRIBUTED keyword.

Can someone give the official statement in distributed queries and subqueries?

Hi @markmegerian, I am so sorry that I thought you are one of the members or staffs of the Tigergraph team and I may have raised too many tough questions to you. XD I still want to thank you for your help.

Just to share some of my experience so far.

Actually I tried marking the subqueries with DISTRIBUTED, but similar to your case, I failed to install my queries and I got the following error (where xxx is a subquery):

Query 'xxx' is also a batch query, a batch query is not allowed to call another batch query.

Then I remove the DISTRIBUTED keyword from all the subqueries and only mark the top root query with DISTRIBUTED. When I run the queries, it seems all the machines are used for computation with high cpu usage and obvious network usage, but the computation runs for an hour and does not seem to be finishing, while the computation is supposed to finish within 10 minutes.