Choices for PRIMARY_ID

jimwu · January 24, 2021, 10:02pm

My dataset doesn’t have an attribute or a combination of attributes that lends itself as a PRIMARY_ID. Since this is required, what do you suggest on what I should use a primary ID? I looked it up and I don’t see an equivalent of auto ID in SQL.

Szilard_Barany · January 25, 2021, 3:11pm

I am not aware of any auto-generated/auto-incremented ID or sequence-like feature in TigerGraph.
But there should be something that differentiates the nodes/vertices, or else how can you create edges/relationships amongst those? 32 exactly same red dots can not be turned into a (meaningful) graph.
If nothing else, the concatenation of all attributes into a big string should provide a unique value, I would think.

jimwu · January 25, 2021, 7:14pm

Thanks!

I am trying to load the netflow and host event dataset from Los Alamos National Lab. Below is an example of a network transaction:

epoch_time,duration,src_device, dst_device,protocol,src_port ,dst_port,src_packets ,dst_packets ,src_bytes,dst_bytes
118781,5580,Comp364445,Comp547245,17,Port05507,Port46272,0,755065,0,1042329018

As far as I can see, nothing in the header guarantees uniqueness. If I concat all columns, I may get a unique identifier, but that is by chance, not by design.

Szilard_Barany · January 25, 2021, 8:22pm

I think

gsql_concat($0,"|",$2,":",$5,"-",$3,":",$6)

i.e.

epoch_time|src_device:src_port-dst_device:dst_port

would be effectively a unique ID as at any one time there could be only one communication/transmission between two devices on the specific ports.

But I suppose you would not load the entire row into a single vertex. I see at least device, transmission and protocol vertex types.

jimwu · January 25, 2021, 8:33pm

As a small test, I loaded 10k rows out of 3billion from the dataset with concatenating all 11 columns. There are 24 duplicated rows already.

What I am trying to do at the moment is to reconstruct this exercise https://datasets.trovares.com/cyber/LANL/index.html in TG. As far as I understand their script, they are loading one row to a vertex. I have to admit I am very new to the graph database, so I don’t know yet what is the best way to construct a schema.

Also I am not a network security expert, so I don’t know if exactly same events showing up multiple times has any significance in determining a security breach. For now I just wanted to load all records as it is.

jimwu · January 25, 2021, 10:29pm

Now I have read the tutorial more closely. You are right. They only use device as vertex and all other information is used as edges. Thanks very much for your advice. @Szilard_Barany