S3 Loader not working

Hi,

I am trying to follow the guidance here: https://docs.tigergraph.com/dev/data-loader-user-guides/s3-loader-user-guide to load in csv data stored in an s3 bucket. I’m using the developer edition locally inside a Docker container.

For now I am just using the csvs from GSQL101, but I cannot successfully create a loading job.

I’ve defined 2 data sources and pointed them to my s3.config JSON file:
GSQL-Dev > SHOW DATA_SOURCE *

Data Sources:

 - s3 friendship_src ({"file.reader.settings.fs.s3a.access.key":"AKIA********","file.reader.settings.fs.s3a.secret.key":"0exj********"})

 - s3 person_src ({"file.reader.settings.fs.s3a.access.key":"AKIA********","file.reader.settings.fs.s3a.secret.key":"0exj********"})

I’ve also created the s3 files config file, one for each csv (bucket path replaced), files_person.config.json and files_friendship.config.json:

{

"tasks.max": 1,

"file.uris": "s3://path/to/csv/person.csv",

"file.regexp": ".*",

"file.recursive": false,

"[file.scan.interval.ms](http://file.scan.interval.ms/)": 60000,

"file.reader.type": "text",

"file.reader.batch.size": 10000,

"file.reader.text.archive.type": "auto",

"file.reader.text.archive.extensions.tar": "tar",

"file.reader.text.archive.extensions.zip": "zip",

"file.reader.text.archive.extensions.gzip": "tar.gz,tgz"

}

When I try to create the loading job using both files simultaneously:

**GSQL-Dev >**  USE GRAPH social

Using graph 'social'

**GSQL-Dev >**  BEGIN

**GSQL-Dev >**  CREATE LOADING JOB load_social FOR GRAPH social {

**GSQL-Dev >**  DEFINE FILENAME f1 = "$person_src:/home/tigergraph/data/files_person.config.json";

**GSQL-Dev >**  DEFINE FILENAME f2 = "$friendship_src:/home/tigergraph/data/files_friendship.config.json";

**GSQL-Dev >**  LOAD f1 TO VERTEX person VALUES ($"name", $"name", $"age", $"gender", $"state") USING header="true", separator=",";

**GSQL-Dev >**  LOAD f2 TO EDGE friendship VALUES ($0, $1, $2) USING header="true", separator=",";

**GSQL-Dev >**  }

**GSQL-Dev >**  END

I get the error:

Semantic Check Fails: The data source in the CREATE LOADING JOB is not unique, or mixed with regular file loading.

The job load_social could not be created!

And when I try to only load f1 on it’s own, I get a different error:

Semantic Check Fails: Exception occurred trying to read $person_src:/home/tigergraph/data/files_person.config.json, error: java.lang.RuntimeException: The loading job used file HEADER but there is no local path in the provided “$person_src:/home/tigergraph/data/files_person.config.json”

The job load_social could not be created!

Could someone help point me in the right direction here? It seems like it isn’t recognising that it should go to look for the data in S3, but otherwise I’m not sure where I’m going wrong.

Thanks!

Hi Matthew,

Currently the Header is not supported in S3 loading job since the mapping of headers to column number is obtained in the creating time of the loading job where the header info is not available for S3 loading jobs. You can use the user defined header or column index to work around this.

Best Wishes,

Dan

Thanks Dan.

This got me further, and I can now define two loading jobs, one for each csv.

I also had to change my file.uri to point to s3a://… (both s3 and s3n result in connection errors)

Now the loading job will run, and instantly loads the 7 lines of the person.csv.

However, it does not complete the process, and instead slowly counts up more loaded lines than there are in the file.

Screenshot 2019-07-29 at 17.19.04.png

Any idea what is causing this behaviour? I haven’t changed the csv from what is provided in the Docker image, other than to remove the column headers.

Thanks!

Actually, I didn’t specify the EOF=“true” parameter when running my job, so it was running in streaming mode!

Hi Matthew,

Thanks for pointing the wrong line number count issue. This is a bug we fixed in our latest 2.4.1 release. It only happens in very small file loading. Please download our latest 2.4.1 package at: http://dl.tigergraph.com/tigergraph-developer-latest.tar.gz

Best,
Renchu

Thanks for the update!

I think I am nearly there now as the load from S3 happens and reads in the correct number of rows when I use EOF=“true”

However, no vertices are being created after the load.

This is my loading job:
Thanks for the update!

I think I am nearly there now as the load from S3 happens and reads in the correct number of rows when I use EOF=“true”

However, no vertices are being created after the load.

This is my loading job:
CREATE LOADING JOB load_social_s3_vertex FOR GRAPH social {

      DEFINE HEADER person_header = "name", "age", "gender", "state";


      DEFINE FILENAME file1 = "$person_s3:/home/tigergraph/data/files_person.config.json";


      LOAD file1 TO VERTEX person VALUES($"name", $"name", $"age", $"gender", $"state") USING USER_DEFINED_HEADER="person_header", SEPARATOR=",", HEADER="false", EOL="\n";


    }

As you can see, I have defined a header and tried to use these to load the vertex, but nothing gets created. I’ve also tried omitting the user defined header and just assigning columns by position using $0, $0, $1, $2, $3 - but this also did not create any vertices.

Do you have any guidance on where I am going wrong here?

Many thanks!

Hi Matthew,

Can u show more about the loading status? Does it report any error?

Best Wishes,

Dan

Hi Dan,

The loading status looks normal to me, I get no error messages in GSQL:


However, no lines are loaded into the person vertex:
GSQL-Dev > select count() from person

[{

 "count": 0,

 "v_type": "person"

}]

This also happens using the column number indexing to assign values, and it happens whether I use an S3 file or a local file. However, if I use a local file with a header and use header = “true”, then vertexes are created, so I think it must be to do with the LOAD statement.

This is what I’m currently using:

- CREATE LOADING JOB load_social_s3_vertex FOR GRAPH social {

Hi Matthew,

Can you check the loading summary from the following link? It will provide more detailed info about the loading status

curl -x GET “localhost:8123/gsql/loadingjob?graph=xxx&jobId=xxx&action=checkprogress”

Please replace the graph with your graph name and jobid with your previous loading jobid.

Best Wishes,

Dan

Hi Dan,

Unfortunately that request is returning an empty response. However, I did find the log trace in case that helps show what is going on:

I@20190730 15:02:33.055  (BaseHandler.java:16) BaseHandler: f

I@20190730 15:02:33.056 tigergraph|127.0.0.1:37650|249863801 (Util.java:314) GSHELL_TEST is empty

I@20190730 15:02:33.056 tigergraph|127.0.0.1:37650|249863801 (Util.java:335) COMPILE_THREADS is empty

I@20190730 15:02:33.056 tigergraph|127.0.0.1:37650|249863801 (Util.java:349) FromGraphStudio is empty

I@20190730 15:02:33.057 tigergraph|127.0.0.1:37650|249863801 (CommandHandler.java:55) RUN LOADING JOB load_social_s3_vertex USING EOF="true"

I@20190730 15:02:33.059 tigergraph|127.0.0.1:37650|249863801 (CatalogLock.java:223) Lock: Short && Read

I@20190730 15:02:33.137 tigergraph|127.0.0.1:37650|249863801 (CatalogManager.java:811) getCatalog: social

I@20190730 15:02:33.138 tigergraph|127.0.0.1:37650|249863801 (Catalog.java:12916) RunLoadingJobs

I@20190730 15:02:33.138 tigergraph|127.0.0.1:37650|249863801 (Catalog.java:14391) Start CheckLoadingJob

I@20190730 15:02:33.144 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:46) FILENAME: $person_s3:/home/tigergraph/data/files_person.config.json configFile is not JSON string

I@20190730 15:02:33.148 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:46) JsonFile: /home/tigergraph/data/files_person.config.json configFile is not JSON string

I@20190730 15:02:33.152 tigergraph|127.0.0.1:37650|249863801 (BaseLoadingJob.java:215) Start Loading job load_social_s3_vertex

I@20190730 15:02:33.153 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:46) JsonFile: /home/tigergraph/data/files_person.config.json configFile is not JSON string

I@20190730 15:02:33.154 tigergraph|127.0.0.1:37650|249863801 (S3LoadingJob.java:208) Start s3 loading ...

I@20190730 15:02:33.154 tigergraph|127.0.0.1:37650|249863801 (LoadingJobRunTimeConfig.java:110) The created s3 job ID: social.load_social_s3_vertex.s3.person_s3.1564498953154

I@20190730 15:02:33.155 tigergraph|127.0.0.1:37650|249863801 (S3LoadingJob.java:70) Sending s3 loading job

I@20190730 15:02:33.175 tigergraph|127.0.0.1:37650|249863801 (BaseLoadingJob.java:128) start url = http://[127.0.0.1:30003/connectors](http://127.0.0.1:30003/connectors)

I@20190730 15:02:33.175 tigergraph|127.0.0.1:37650|249863801 (BaseLoadingJob.java:129) payload = {"name":"social.load_social_s3_vertex.s3.person_s3.1564498953154.file1","config":{"file.uris":"s3a://path/to/csv/person.csv","file.reader.text.eol":"\\n","connector.class":"com.tigergraph.kafka.connect.wrapper.WrapperSourceConnector","tg.loading.job.filename":"file1","file.regexp":".*","tg.loading.job.sep":",","file.reader.batch.size":5000,"tg.loading.job.eol":"\\n","tasks.max":10,"tg.topic":"loading-log","file.reader.settings.fs.s3a.secret.key":"****","tg.mode":"eof","file.reader.type":"text","mode":"eof","tg.record.key":"social.load_social_s3_vertex.s3.person_s3.1564498953154","tg.reader.class":"com.tigergraph.kafka.connect.fs.FsSourceConnector","file.reader.settings.fs.s3a.access.key":"****","file.reader.text.header":"false","[tg.loading.job.graph.name](http://tg.loading.job.graph.name/)":"social","file.recursive":false,"[tg.loading.job.name](http://tg.loading.job.name/)":"load_social_s3_vertex","topic":"s3","tg.connection.servers":"[127.0.0.1:9000](http://127.0.0.1:9000/)","tg.connection.security":false}}

I@20190730 15:02:34.058 tigergraph|127.0.0.1:37650|249863801 (BaseLoadingJob.java:131) ret = {"name":"social.load_social_s3_vertex.s3.person_s3.1564498953154.file1","config":{"file.uris":"s3a://path/to/csv/person.csv","file.reader.text.eol":"\\n","connector.class":"com.tigergraph.kafka.connect.wrapper.WrapperSourceConnector","tg.loading.job.filename":"file1","file.regexp":".*","tg.loading.job.sep":",","file.reader.batch.size":"5000","tg.loading.job.eol":"\\n","tasks.max":"10","tg.topic":"loading-log","file.reader.settings.fs.s3a.secret.key":"****","tg.mode":"eof","file.reader.type":"text","mode":"eof","tg.record.key":"social.load_social_s3_vertex.s3.person_s3.1564498953154","tg.reader.class":"com.tigergraph.kafka.connect.fs.FsSourceConnector","file.reader.settings.fs.s3a.access.key":"****","file.reader.text.header":"false","[tg.loading.job.graph.name](http://tg.loading.job.graph.name/)":"social","file.recursive":"false","[tg.loading.job.name](http://tg.loading.job.name/)":"load_social_s3_vertex","topic":"s3","tg.connection.servers":"[127.0.0.1:9000](http://127.0.0.1:9000/)","tg.connection.security":"false","name":"social.load_social_s3_vertex.s3.person_s3.1564498953154.file1"},"tasks":[]}

I@20190730 15:02:34.068 tigergraph|127.0.0.1:37650|249863801 (BaseLoadingProgress.java:126) DisplayProgress, session = 249863801

I@20190730 15:02:34.300 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:764) Get s3 progress exception:

java.lang.RuntimeException: Failed to GET http://[127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154](http://127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154). HTTP error code : 404

I@20190730 15:02:36.319 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:764) Get s3 progress exception:

java.lang.RuntimeException: Failed to GET http://[127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154](http://127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154). HTTP error code : 404

I@20190730 15:02:38.332 tigergraph|127.0.0.1:37650|249863801 (LoadingUtils.java:764) Get s3 progress exception:

java.lang.RuntimeException: Failed to GET http://[127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154](http://127.0.0.1:30004/log-aggregation/loading-progress/social.load_social_s3_vertex.s3.person_s3.1564498953154). HTTP error code : 404

I@20190730 15:02:40.988 tigergraph|127.0.0.1:37650|249863801 (S3LoadingProgress.java:276) Delete connector 'social.load_social_s3_vertex.s3.person_s3.1564498953154.file1' failed: java.lang.RuntimeException: Failed to DELETE http://[127.0.0.1:30003/connectors/social.load_social_s3_vertex.s3.person_s3.1564498953154.file1](http://127.0.0.1:30003/connectors/social.load_social_s3_vertex.s3.person_s3.1564498953154.file1). HTTP error code : 204

I@20190730 15:02:40.991 tigergraph|127.0.0.1:37650|249863801 (CatalogLock.java:273) Unlock: Short && Read

Hi Matthew,

Thanks for sending us the logs. Can u also help to run the following command?

curl -X GET [http://localhost:30004/log-aggregation/loading-progress/job-id].

Best Wishes,

Dan

{  
   "overall":{  
      "id":"overall",
      "totalTask":1,
      "completedTask":1,
      "completed":true,
      "size":119,
      "totalSize":119,
      "progress":1.0,
      "startTime":1564563195882,
      "endTime":1564563196306,
      "duration":424,
      "currentSpeed":33,
      "averageSpeed":17,
      "statistics":{  
         "validLine":7,
         "rejectLine":0,
         "invalidJson":0,
         "oversizeToken":0,
         "notEnoughToken":0,
         "failedConditionLine":0,
         "edge":[  

         ],
         "vertex":[  
            {  
               "typeName":"person",
               "validObject":0,
               "noIdFound":0,
               "invalidAttribute":7,
               "invalidPrimaryId":0,
               "invalidSecondaryId":0,
               "incorrectFixedBinaryLength":0
            }
         ],
         "deleteEdge":[  

         ],
         "deleteVertex":[  

         ],
         "tempTables":[  

         ]
      },
      "currentValidLine":0,
      "lastCurrentSpeedUpdate":1564563196521
   },
   "workers":[  
      {  
         "overall":{  
            "id":"social.load_social_s3_vertex.s3.person_s3.1564563191764.file1",
            "totalTask":1,
            "completedTask":1,
            "completed":true,
            "size":119,
            "totalSize":119,
            "progress":1.0,
            "startTime":1564563195882,
            "endTime":1564563196282,
            "duration":400,
            "currentSpeed":18,
            "averageSpeed":18,
            "statistics":{  
               "validLine":7,
               "rejectLine":0,
               "invalidJson":0,
               "oversizeToken":0,
               "notEnoughToken":0,
               "failedConditionLine":0,
               "edge":[  

               ],
               "vertex":[  
                  {  
                     "typeName":"person",
                     "validObject":0,
                     "noIdFound":0,
                     "invalidAttribute":7,
                     "invalidPrimaryId":0,
                     "invalidSecondaryId":0,
                     "incorrectFixedBinaryLength":0
                  }
               ],
               "deleteEdge":[  

               ],
               "deleteVertex":[  

               ],
               "tempTables":[  

               ]
            },
            "currentValidLine":0,
            "lastCurrentSpeedUpdate":1564563196521
         },
         "tasks":[  
            {  
               "id":"social.load_social_s3_vertex.s3.person_s3.1564563191764.file1.0",
               "totalTask":1,
               "completedTask":1,
               "completed":true,
               "size":119,
               "totalSize":119,
               "progress":1.0,
               "startTime":1564563195882,
               "endTime":1564563196282,
               "duration":400,
               "currentSpeed":18,
               "averageSpeed":18,
               "statistics":{  
                  "validLine":7,
                  "rejectLine":0,
                  "invalidJson":0,
                  "oversizeToken":0,
                  "notEnoughToken":0,
                  "failedConditionLine":0,
                  "edge":[  

                  ],
                  "vertex":[  
                     {  
                        "typeName":"person",
                        "validObject":0,
                        "noIdFound":0,
                        "invalidAttribute":7,
                        "invalidPrimaryId":0,
                        "invalidSecondaryId":0,
                        "incorrectFixedBinaryLength":0
                     }
                  ],
                  "deleteEdge":[  

                  ],
                  "deleteVertex":[  

                  ],
                  "tempTables":[  

                  ]
               },
               "currentValidLine":0,
               "lastCurrentSpeedUpdate":1564563196513
            }
         ]
      }
   ]
}

So it appears the all the attributes are invalid - I was able to fix this by changing the data type of the age field to int using ‘gsql_to_int’ when defining my loading job! Now it seems to work!

Thanks for the help on this, much appreciated.

Hopefully a final question on this - now I’ve confirmed these S3 loading jobs work via the GSQL terminal, am I able to run them via the REST API?

Hi Matthew,

We have a couple of APIs that can help you with this. All of them can be access through http://localhost:14240/gsqlserver .

START

URL: POST /gsql/loadingjobs?graph=GRAPH_NAME&action=start

Payload:

{

“jobs”: [

{

  "name": JOB_NAME

  "streaming": true|false,

  "dataSources": [{

    "filename": FILENAME,

    "name": DATA_SOURCE,

    "path": PATH_TO_S3

  }] 

}

]

}

PAUSE

URL: GET /gsql/loadingjobs?graph=GRAPH_NAME&action=pause&jobId=ID1&jobId=ID2

RESUME

URL: GET /gsql/loadingjobs?graph=GRAPH_NAME&action=resume&jobId=ID1&jobId=ID2

STOP

URL: GEt /gsql/loadingjobs?graph=GRAPH_NAME&action=stop&jobId=ID1&jobId=ID2

Example: curl -u tigergraph:tigergraph http://localhost:14240/gsqlserver/gsql/loadingjobs?graph=MyGraph&action=pause&jobId=abcdef