Tigergraph server unresponsive when doing large scale ingest

nquinn · October 14, 2023, 3:51pm

I have been trying to use the jdbc library to write to tigergraph for 16 million vertices. I have configured the jobs every single way but keeping to 2 executor instances and 2 cores per executor. I have managed to get 6 or 8 million vertices written, but no matter, I always seem to get into a state where the job seems stuck and the server seems unresponsive. I can see that only two executor nodes are being used, but other than that the spark logs don’t give me any feedback. It seems almost like the spark job is blocked by something. Any suggestions?

Jon_Herke · October 19, 2023, 4:04pm

@nquinn To find a resolution you would most likely need to look into the log files.

To get the log, you need to:

Set "debug" -> "2" // INFO
Set "logFilePattern" -> "path/to/log/file"

https://github.com/tigergraph/ecosys/tree/master/tools/etl/tg-jdbc-driver#connection-properties

I’m not sure if you’re a customer, if you are we could get someone from TigerGraph to help analyze.

My email is jon.herke at tigergraph.com if you want to reply privately.

nquinn · October 19, 2023, 10:11pm

I am noticing that it is writing fine until about 80K vertices request. I do see a max of 1K writes per POST request but once it gets to about the 80K vertex limit, it seems to hit a wall.

INFO: Accepted vertices: 80,666, accepted edges: 0

I am noticing that even with a large client timeout like 30seconds, the RESTPP server will eventually become unresponsive after multiple large batches. It will look something like this and then just spin in cycle (5, 10, 20, 40, …) and then die.

e[31m2023-10-19 22:00:43 [ERROR] RESTPP busy, failed to execute query. Retrying in 5 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:00:43 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 5 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:00:47 [ERROR] RESTPP busy, failed to execute query. Retrying in 5 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:00:47 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 5 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:01:04 [ERROR] RESTPP busy, failed to execute query. Retrying in 10 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:01:04 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 10 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:01:08 [ERROR] RESTPP busy, failed to execute query. Retrying in 10 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:01:08 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 10 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:01:31 [ERROR] RESTPP busy, failed to execute query. Retrying in 20 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:01:31 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 20 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:01:35 [ERROR] RESTPP busy, failed to execute query. Retrying in 20 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:01:35 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 20 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:02:07 [ERROR] RESTPP busy, failed to execute query. Retrying in 40 seconds… Failed to send request: 408: Timeout. e[0m
Oct 19, 2023 10:02:07 PM com.tigergraph.jdbc.log.JULAdapter error
SEVERE: RESTPP busy, failed to execute query. Retrying in 40 seconds… Failed to send request: 408: Timeout.
e[31m2023-10-19 22:02:11 [ERROR] RESTPP busy, failed to execute query. Retrying in 40 seconds… Failed to send request: 408: Timeout. e[0m

nquinn · October 26, 2023, 2:21am

I increased the GSQL timeout to 60 seconds and now I am getting the same error but instead of a 408: Timeout , I am getting a 503: Kafka protection enabled. . Do you have any idea what this means or how to work around this?

Jon_Herke · October 26, 2023, 8:40pm

408 status code indicates that the client (usually a web browser or application) did not produce a complete request to the server within the time the server was prepared to wait. In simpler terms, it means the client took too long to send the request to the server, and the server’s patience ran out. This can happen if the client’s network connection is slow, or if there’s a problem with the client’s request.

This seems to have been resolved when you adjusted the GSQL timeout.

503 status code is used to indicate that the server is temporarily unable to handle the request. This could be due to various reasons, such as server overload, maintenance, or some other temporary condition. It informs the client that the server is aware of the request but cannot fulfill it at the moment. The client can usually try the request again later.

@nquinn Can you run gadmin status to confirm all processes are online?

nquinn · October 26, 2023, 8:47pm

I ran gadmin status and all services are online and running. If 503 can mean all of those different things, why does it say “Kafka protection enabled”?

nquinn · November 8, 2023, 8:06pm

The only way we were able to resolve this is with Tigergraph support. We found that the kafka queue was larger than the remaining EBS storage. Even though we thought we had plenty, we still had to increase it to support the kafka queue size. Finally, there was an issue where it was picking up Kafka topics that were empty. This required a restart of the GPE server. Finally, we were able to complete our ingest.

Jon_Herke · November 9, 2023, 9:46pm

@nquinn Thank you for coming back and updating the community on the resolution steps you took!