Creating relations based on multiple properties

Phoenix9 · August 12, 2021, 1:34pm

Hi All,

I am trying to create relations based on multiple property values. Consider the following example:

nodes.csv
id,property1,property2,property3
1,34,4,5
2,35,6,1
3,67,8,2
4,56,8,9
5,673,8,3

relations.csv
id1,id2,property1,property3
1,2,34,1
4,5,56,3
3,4,99,9

The relations must be created based on ids and properties. The last relation from 3 to 4 must not be created because property1 for id 3 doesn’t match.

Thanks & Regards,
Vinayak

Leo_Shestakov · August 12, 2021, 2:46pm

Hello,

The conditional loading you describe is in fact a functionality of GSQL, but it will not be possible to do with your CSV files because of the locality of load statements. In a single load statement, you can only access information about the very file you are loading from.

Therefore, when loading relationships.csv, you will not be able to reference the attributes from nodes.csv to check whether the connection is valid by comparing the values of “property1”. A solution is to condense the data into a single file, so that you can reference both “property1” attributes from a single file:

nodesAndRelations.csv
id1, property1, property2, property3, R_id2, R_property1, R_property3

There may be more vertices than edges (or vice versa), which means that there will probably be empty cells in the dataset. This can be accounted for by checking whether key attributes for a vertex/edge load are empty or not.

CREATE LOADING JOB load_name FOR GRAPH name {
    DEFINE FILENAME nodesAndRelations = "~/<exact_path>/nodesAndRelations.csv";

    LOAD nodesAndRelations to VERTEX node VALUES ($"id",$"property1",$"property2",$"property3") WHERE NOT ($"property1" IS EMPTY) USING header="true", separator=",";
    LOAD nodesAndRelations to EDGE relation VALUES ($0, $"R_id2", $"R_property1", $"R_property3") WHERE NOT ($"R_id2" IS EMPTY) AND ($"R_property1" == $"property1") USING header="true", separator=",";
}
RUN LOADING JOB load_name

Here is more documentation about conditional loading using the WHERE statement

Phoenix9 · August 13, 2021, 5:27am

Let’s consider an example in which we have 100 nodes and 10K relations both are stored in two different CSV files. The ideal way to load them can be:

Load nodes first
Extract the nodes which satisfy the condition given in relation CSV file.
Create relations between extracted nodes
We can’t combine these CSV files due to a mismatch in the number of rows. How can we solve such types of scenarios?

markmegerian · August 13, 2021, 1:28pm

I can think of a few ways to solve this:

do the validation check prior to the load. since you only want to load edges where the properties match, you could filter these out ahead of time, then do the load. For instance, loop through the relation CSV file, and for both id1 and id2, look up the values from the node CSV file and filter out mismatches
as suggested above, create a more detailed relations CSV file where you have a column for the original value and a new column for the looked up value. So 3,4,99,9 would become 3,4,99,67,9,9 and you use the WHERE clause on the load to enforce the match. unlike the example, you would have two files
a slightly unusual, but also simple approach, would be to simply load them all, then run a post process GSQL query that removes the EDGES where the property on the edge does not match the property on the VERTEX

Leo_Shestakov · August 13, 2021, 1:33pm

Vinayak,

I think I understand your plan, but in step 2 when you say “Extract nodes which satisfy the condition”, shouldn’t that be all of them? After all, information about all the nodes is required in order to validate any single edge. So, this is starting to sound like combining the CSV files again.

If you look at my example, a mismatch in the number of rows is not a problem because for every vertex/edge load, the WHERE clause makes sure that the given row is valid for loading that portion of the graph (for instance, if a critical attribute for loading a vertex is missing on a line, it will not be loaded)

I also like Mark’s idea of creating all relations first and then using a SELECT statement to remove invalid ones (where the edge attribute != source vertex attribute). I would personally do this.