Local CSV File: Select Subset of Columns to Load?

Ultea99 · August 13, 2024, 6:12pm

Hello. Apologies if this question already exists, I searched thoroughly!

Am new to TigerGraph, and have several CSV files, with various columns, that I’m creating a loading job for. The following works just fine in the GSQL console:

BEGIN
CREATE LOADING JOB load_data FOR GRAPH MyGraph {
DEFINE FILENAME my_file = “ANY:/home/…”;
LOAD my_file TO VERTEX theVertex VALUES("Val1", “Val2”, $“Val3”) USING HEADER = “TRUE”;
}
END

My question is: if the vertex has e.g. 5 attributes, this JOB will fail as there aren’t enough columns in the CSV file to load data in to all of the vertex’s attributes. Is there a way to specify which columns from the CSV file to load? The reason for doing it this way is that the vertex actually needs data from several CSV files, with different columns, to be loaded. Thanks in advance for any suggestions!

I did look here Loading file with column names specified instead of positional parameters - GSQL / Loading - TigerGraph but couldn’t get the example to work in my case.

Jon_Herke · August 14, 2024, 6:23pm

@Ultea99 You can use the underscore to signify that you won’t be loading anything to that attribute. Below I have an example Person with a CSV file. The CSV file is missing dob. When creating my loading job I will use the _ to signify I’m not loading anything to that vertex attribute.

Schema:
Person(vertex): ssn, full_name, first_name, last_name, dob

CSV Headers:
my_CSV: ssn | full_name | first_name | last_name

Sample Code:

LOAD file2 TO EDGE Visit VALUES ($0, $1 $2, $3, _) USING SEPARATOR=",";

Modifying your code:

LOAD my_file TO VERTEX theVertex VALUES("val1", "val2", "val3", _, _) USING HEADER = “TRUE”;

Skipping an attribute:

A LOAD statement can specify that a particular attribute should not be loaded by using the special character _ (underscore) as its attribute expression (attr_expr). You can not skip the primary key attributes for vertices or discriminator attributes for edges.

For example, the following statement skips the next-to-last attribute. This technique is used when it is known that the input data file does not contain data for every attribute.

LOAD TO VERTEX Person VALUES ($0, $1, _, $2)

If the load operation is creating a new vertex or edge, then the skipped attribute will be assigned the default value.
If the load operation is overwriting an existing vertex or edge, then the skipped attribute will retain its existing value.

Docs: Creating a Loading Job :: GSQL Language Reference

Ultea99 · August 15, 2024, 8:43am

Many, many thanks Jon - greatly appreciated, I’ll give that a try! All the best.

Ultea99 · August 15, 2024, 3:56pm

Hi again Jon,

I’m really sorry: still having issues with getting this very simple example to work. I know I’m doing something wrong, but cannot see my error - and have been going over it so much it feels like my eyeballs are spinning in my head!

Here’s an example of the local CSV file I’m using, test_df.csv:

theID code the_date other_id last_id gender DoB age
aaa bbb 15/08/2024 ccc ddd Female 01/01/1999 25
… etc.

Here’s the vertex that I’ve defined:

GSQL > ls
---- Graph MyGraph
Vertex Types:

VERTEX theVertex(PRIMARY_ID theID STRING, gender STRING, DoB DATETIME, age INT) WITH STATS=“OUTDEGREE_BY_EDGETYPE”, PRIMARY_ID_AS_ATTRIBUTE=“true”
VERTEX … (Etc.)

Here’s the loading job I’m trying to run, based upon your very helpful answer from before:

GSQL > USE GRAPH MyGraph
GSQL > BEGIN
GSQL > CREATE LOADING JOB load_data FOR GRAPH MyGraph {
GSQL > DEFINE FILENAME my_file = “test_df.csv”;
GSQL > LOAD my_file TO VERTEX theVertex VALUES($0, _, _, _, _, $5, $6, $7) USING HEADER = “TRUE”, SEPARATOR = “,”;
GSQL > }
GSQL > END

This gives the following error message:

Semantic Check Fails: values() index number is 8 while vertex ‘Patient’ has 4 columns!
Failed to create loading jobs: [load_data].

I thought I’d managed to get this right at last, but it seems I’m still doing something wrong. Any suggestions most gratefully received!

Jim_Limprasert · August 15, 2024, 11:41pm

Hello! @Ultea99

Based on your schema:

VERTEX theVertex(PRIMARY_ID theID STRING, gender STRING, DoB DATETIME, age INT)

theVertex accepts exactly 4 attributes for ID, gender, DoB, and age.

You cannot load in more than 4 attributes in your loading job for theVertex.

To make it work, the number of attributes in the loading job for each vertex and edge has to match the number of attributes of the corresponding vertex and edge. You should be able to run the below loading job

BEGIN
CREATE LOADING JOB load_data FOR GRAPH MyGraph {
    DEFINE FILENAME my_file = "test_df.csv";
    LOAD my_file TO VERTEX theVertex VALUES($0, $5, _, $7) USING HEADER = "TRUE", SEPARATOR = ",";
}
END

Note - I didn’t try to load in the date of birth, because it was not in the DATETIME format that GSQL accepts. The acceptable DATETIME formats are seen here:

So, you’ll have to convert your date format, like 01/01/1999, into something like 1999-01-01 first in the CSV file before loading it in.

I hope this helps!

Best,
Supawish Limprasert (Jim)
Solution Engineer, TigerGraph

Ultea99 · August 16, 2024, 3:57pm

Thanks so much Jim - really appreciate both your’s and Jon’s help with this. Have a great weekend and thanks again to you both!

Ultea99 · August 19, 2024, 12:46pm

Sorry - one last question, please: regarding how to define a cumulative loading job.

Similar example to the above: the vertex I’m trying to load in to is
VERTEX theVertex(PRIMARY_ID theID STRING, gender STRING, DoB DATETIME, age INT)

I have two CSV files, for example:

Two CSV files, one.csv and two.csv.

test_df1.csv has a header, with columns:
$"theID", $"gender", $"City", $"age"

test_df2.csv has a header, with columns:
$"Title", $"theID", $"DoB", $"Salary"

So I’d like to try and define a cumulative loading job for specific columns from both CSV files. Something like this:

CREATE LOADING JOB load_data FOR GRAPH MyGraph {
    DEFINE FILENAME my_first_file = "test_df1.csv";
    DEFINE FILENAME my_second_file = "test_df2.csv";
    LOAD my_first_file TO VERTEX theVertex VALUES($"theID", $"gender", _, $"age") USING HEADER = "TRUE", SEPARATOR = ",";
    LOAD my_second_file TO VERTEX theVertex VALUES(_, _ $"DoB", _) USING HEADER = "TRUE", SEPARATOR = ",";
}
END

As Jon mentioned, it’s not possible to skip loading the primary ID using an underscore, but is there a way to only load the DoB column (I’ve now fixed the DATETIME format, thanks!) from the second load statement?

With many thanks again and all the best!

Jim_Limprasert · August 19, 2024, 3:20pm

Hi @Ultea99 ,

There is no way to load in just the DoB column from the second load statement without the primary id. I saw that the test_df2.csv file has the theID column, so I’d assume that is something like a foreign key to theID column of test_df1.csv.

In that case, you can do the following for my_second_file:

LOAD my_second_file TO VERTEX theVertex VALUES($"theID", _, $"DoB", _) USING HEADER = "TRUE", SEPARATOR = ",";

Now, if theID column in test_df2.csv can be an empty string, you can use the WHERE clause and a token function to just include the rows of test_df2.csv with non-empty theID.

LOAD my_second_file TO VERTEX theVertex VALUES($"theID", _, $"DoB", _) WHERE gsql_is_not_empty($"theID") USING HEADER = "TRUE", SEPARATOR = ",";

See more details about gsql_is_not_empty function here. The loading job will be able to load in the data either way (with vs without the WHERE clause). However, without the WHERE clause, it might display that some rows failed to load (due to empty PRIMARY ID) while other rows are successful.

Note that, if you try to use a _ for a PRIMARY_ID field of a vertex, you’ll get the following error:

Semantic Check Fails: The place holder "_" cannot appear in primary_id when loading vertex [VERTEX_NAME]

Hope this answers your question!

Best,
Supawish Limprasert (Jim)
Solution Engineer, TigerGraph

Ultea99 · August 20, 2024, 8:29am

Many thanks again for all your help with this!