How to identify duplicates based on an attribute?

swarnasravan · January 18, 2024, 2:26am

We have a Vertex named “Connection”, we are saving to this vertex based on JSON we get from an external system. This vertex has an attribute called “DRNID”.

is there a way to check if there are more than one instance of this Vertex with same DRNID? I want to do something like selfjoin like below sql

select * from Connection con1, Connection con2 where con1.DRNID = con2.DRNID and con1.primarykey<>con2.primarykey

Jim_Limprasert · January 18, 2024, 10:40pm

Hi @swarnasravan !

This is a good question - this question shows the flexibility that GSQL provides. I believe there’s multiple ways to achieve this, and this would be one way to do it. Note that this might take some time to perform in a very large graph.

You can see the code attached below. The key thing to note here is the usage of accumulators, and you can find out more information about that in this document.

CREATE OR REPLACE QUERY find_connection_with_duplicate_drnid() FOR GRAPH <graph-name> {
    // Utilize MapAccum to gather all these information
    MapAccum<STRING, INT> @@drnid_count_map;
    MapAccum<STRING, SetAccum<VERTEX<Connection>>> @@dup_drnid_vertex_map;
    
    start = {Connection.*};

    // In the first pass, gathers the count of each drnid
    selected_connections = SELECT s FROM start:s
        ACCUM @@drnid_count_map += (s.DRNID -> 1);

    // In the second pass, gathers the vertices with duplicate drnid
    // (the count is greater than 1)
    selected_connections = SELECT s FROM start:s
        ACCUM IF @@drnid_count_map.get(s.DRNID) > 1 THEN
            @@dup_drnid_vertex_map += (s.DRNID -> s)
        END;

    // By doing this, it'll print out a map mapping from
    // the DRNID to a list/set of vertex ids with duplicate DRNID.
    PRINT @@dup_drnid_vertex_map;
}

I hope this helps! Please let us know if there is any further questions.

swarnasravan · January 27, 2024, 11:47am

Thank you so much for your answer. I have generated some reports based on your code and shared with management. Now they are asking me to merge the duplicates and delete one of the Vertex.

Is there an easy way to do it in GSQL? basically if the attribute “source_system” is set to NMSA then merge this vertex with the duplicated vertex whose “source_system” is SOCS. Update those attributes that are empty and move the edges and then delete the vertex.

I know this can be done in Java, but I would prefer to do it in GSQL directly.

Jim_Limprasert · February 1, 2024, 10:19pm

Hi @swarnasravan ,

I think there would be a way to do that in GSQL, but it would involve several steps which is not that simple. The method as follow would be 1 way to do it.

Find vertex where source_system is set to NMSA that has a duplicated vertex (I assume based on DRNID) where the source_system is SOCS.
- you can base the code here off the GSQL code seen above, though there’ll need to be some modifications done
“Link” those two vertices together. You can perhaps use a local accumulator for SOCS vertex so that it’ll store the duplicated NMSA vertex.
For each attribute in a vertex where source_system is SOCS, update the attributes that are empty with the attribute from the NMSA vertex stored in the local accumulator. Vertex functions methods getAttr and setAttr would be helpful here, see this doc for more information. Note that, you’ll need a list of attribute names of SOCS vertices if you don’t want to write the getAttr and setAttr for each of the vertex’s attribute.
For each vertex where source_system is SOCS and it has a linked NMSA vertex, look at all the edges that the NMSA vertex got.
- This would probably be the most complicated step out of the 5-step process.
- You’d want to collect every edge going out from NMSA vertex here.Then, for each of those edge, generate an edge where one of the endpoint is the SOCS vertex, and the other endpoint is the target of the corresponding edge (from NMSA vertex).
- Some of the edges might have attributes. You’d want to copy the attributes over as well. See this doc for more details on the Edge’s getAttr and setAttr. Some sort of data structure (like a nested Map) would be needed here if you don’t want to write the getAttr and setAttr code for each of the edge’s attribute for each edge.
Remove the NMSA vertex that is linked with SOCS vertex. (I assume you won’t remove the NMSA vertex that doesn’t have a SOCS duplicate). Or, you can keep the NMSA vertex for reference, but not use it in the query, if memory usage is not a concern.

A simpler way to have a linkage between NMSA and SOCS vertex would be to have an edge (new edge type) from NMSA to SOCS vertex. This would be relatively simpler and it won’t involve data deletion. However, in this approach, you’ll need more code to traverse from NMSA to SOCS vertex (or vice versa) if you were to do some query and you want to treat the NMSA and SOCS vertices with the sane DRNID as “the same vertex”.

Based on how complex merging the attributes would be, I think it would depends on your use case what you want to do. For some use cases, the merging of the vertices might be more reasonable, for others, just having a link between those two vertices would be enough.

And just like you said - you can also do the merging in another programming language, like Java, and then load in the merged data.

I hope this helps you in making a decision!

demit733 · April 23, 2024, 4:11pm

Hi @swarnasravan,

Another approach to your problem is that you should consider redesigning your schema. You could “elevate” the attribute “DRNID” to its own Vertex. (i.e. remove the attribute from your “Connection” vertex and create a new Vertex with a primary key called DRNID).

You would then create an edge between the new Vertex DRNID and your Connection Vertex. So, in this design, you would only have one DRNID, but it would have multiple edges to your Connection Vertex (where the edges would represent your duplications).

Under this design approach, it would be much easier to report and resolve duplications. Once you start to “Think” in graph terms, you will discover that there is less code that you need to write

Kind Regards,

Mitch DeFelice