Create a fully connected subgraphs

Hi Gurus,

I have an interesting scenario that I am trying to solve in TigerGraph as Neo4j is unable to process data at such high volume. This is my 2nd day in TigerGraph so pardon my limited knowledge.

I have about a million nodes and even more edges; Nodes exist in densely connected clusters where each Node in a cluster is connected to almost every other Node within the Cluster. There are many such Clusters in the graph data and my requirement is to find missing edges that should be created to fully connect the clusters. Ex. If one of the many clusters has 4 Nodes A,B,C,D ; my data is as shown below
A -> B
A -> C
A -> D
B -> C
B -> D

All edges are directed with lower id on from side and higher id on to side. In the above example C -> D is missing to fully connect all nodes in the cluster and I am struggling to write a query that will identify the missing edge and write the result to file so I can use it to load back into Graph. Once all clusters have their Nodes fully connected to each other, Finally I need to write a query that will return all Node Id’s with the Id of the Node that has the highest Id value connected to it in 1st Hop as shown below

A -> D
B -> D
C -> D

The hunt is to find All Nodes in each cluster that do not have the Max Id value along with the Max Id value in a row & column format. The simplest approach is to connect up all Nodes in Clusters and for each Node and then find the the Node that does not have an outgoing edge using shortest path (as shown above)


Here is how I would structure such a process:

  • First, create a subquery that correctly connects all the nodes of a single cluster. You will need to form a list of all the cluster IDs yourself (in a main query). You will also need to declare reverse edges for your directed edge type between nodes (which I called CONNECTION and reverse_CONNECTION), as well as ensure that the PRIMARY ID (I called it Node_id) of all nodes is saved as an attribute. The body of this subquery should look something like this:

    SumAccum<INT> @@count;
    ListAccum<VERTEX<Node>> @@clusterMembers;
    INT targetID; // should be an argument of your subquery
    all = {Node.*};
    currentCluster = SELECT s from all:s where s.clusterId == targetID 
                     POST-ACCUM @@count += 1, @@clusterMembers += s;
    createEdges = SELECT s from currentCluster:s
                  WHERE (s.outdegree("CONNECTION") + s.outdegree("reverse_CONNECTION") < @@count - 1)
                  POST-ACCUM FOREACH member in @@clusterMembers DO
                    IF s.Node_id > member.Node_id THEN
                      INSERT INTO CONNECTION VALUES (member.Node_id, s.Node_id)
                    ELSE IF s.Node_id < member.Node_id THEN
                      INSERT INTO CONNECTION VALUES (s.Node_id, member.Node_id)
  • Now, at the end of your main query (after you have called the subquery for all clusters), you can make the following request to get the “top” Node (highest ID) for each cluster. This is done as you suggested by checking for Nodes that do not have any outward facing directed edges (only inward facing ones):

    all = {Node.*};
    getTopOfEachCluster = SELECT s from all:s where s.outdegree("CONNECTION") == 0;
  • Make sure that this second code block is not called in the same query as each of the first blocks. You should call the first blocks in subqueries called by the main query, which holds getTopOfEachCluster. Otherwise, getTopOfEachCluster will not be able to read the new edge modifications that have been made through this entire process.

Firstly thanks for responding to the question. I am trying to interpret the response and first query I have is about forming list of Cluster IDs. Since I am running this over very large dataset I don’t know how many clusters exist in my data. Additionally, it looks like the subquery should be executed in loops for each cluster which is also not feasible based on my prior statement. Now I could attempt to build query to get me the clusters but this is seeming to be more complex that it ought to be. Here is how I am doing in Neo4j which is a very simple query and does the job for small datasets. I was super interested in a distributed large scale graph database and Tigergraph caught my interest but I am learning that the query language is quite different and complex for new beginners unlike Neo4j. I did take an Udemy course that helped me get some speed but not sure how best to learn the GSQL language.

MATCH (b:address)
with collect(b) as items
CALL apoc.cypher.mapParallel2("
	MATCH (_:address)-[r1:matched]-(a:address)-[r2:matched]-(c:address)
    where Not(_.add_id=c.add_id) and NOT ((_:address)-[]-(c:address))
    WITH a,case when toInteger(_.add_id)>toInteger(c.add_id) then _ else c end as id_2,case when toInteger(_.add_id)<toInteger(c.add_id) then _ else c end as id_1
    return distinct id_2.add_id as id_2,id_1.add_id as id_1;
    {parallel:True, batchSize:10000, concurrency:8}, items, 6) YIELD value 
RETURN value.id_2 as id_2, value.id_1 as id_1

The reason your request is difficult in TG is because it is a challenge to (pairwise) match 2 vertex sets to each other in a single SELECT statement if they are not already connected by edges. This is the reason for the FOREACH looping structure within createEdges.

To assemble a set of all your clusterIDs, you can do the following:

SetAccum<INT> @@clusterIDs;
all = {Node.*};

fillSet = SELECT s from all:s POST-ACCUM @@clusterIDs += s.clusterId;

You can then iterate over this SetAccum (does not allow duplicates) using FOREACH.