GSE Down and restart not working

shawnngtq · April 20, 2021, 7:17am

I am testing enterprise version 3.1.1.

When I run ./gadmin status, I see that GSE is Down for Service Status and Stopped for Processed State.

So I try to restart all services via ./gadmin restart all, then I encounter this error

[Error] ExternalError (Failed to send ServiceCommand to executor, spanId [invoker]@1618902544901995240:service-stop; Failed to send rpc /tigergraph.tutopia.common.pb.ExecutorService/StopExecutables to 127.0.0.1:9177; rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:9177: connect: connection refused"; Incomplete (Failed to execute the StopExecutable cmd in instance EXE_1))

So, I killed all tigergraph processes via linux kill command. I restart all the services, but GSE is still down. I proceed to check the log inside tigergraph/log/gse/GSE_1#1.out, I see:

MessageQueue|ZMQ|Context_New
GraphSQL 3.0 ID Service:
 --- Version ---
...

TigerGraph 3.0 ID Service:
 --- Version ---
...

System hard limit of file descriptors is 4096. The expected is no less than 65535
MessageQueue|ZMQ|Context_Destory

I am not certain how do I fix this since stopping / restarting services don’t work?

Do let me know if you need additional info. Thanks!

Jon_Herke · April 22, 2021, 5:01pm

Error message: The system hard limit for file descriptors is too low. Can you increase that limit?

Docs: https://docs.tigergraph.com/start/get-started/docker#3-run-tigergraph-docker-image-as-a-daemon

Run TigerGraph Docker image as a daemon

Run the following command to pull the TigerGraph docker image, bind ports, map a shared data folder, and start a container from the image. Note: this command is very long; please make sure you copy the whole command by dragging the scroll bar to the end:

$ docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph:latest

Here is a breakdown of the options and arguments in the command:

-d : make the container run in the background.
-p : map docker 22 port to host OS 14022 port, 9000 port to host OS 9000 port, 14240 port to host OS 14240 port.
--name : name the container tigergraph.
--ulimit : set the ulimit (the number of open file descriptors per process) to 1 million.
-v : mount the host OS ~/data folder to the docker /home/tigergraph/mydata folder using the -v option. If you are using Windows, change the above ~/data to something using windows file system convention, e.g. c:\data
-t : allocate a pseudo-TTY
docker.tigergraph.com/tigergraph:latest : download the latest docker image from the TigerGraph docker registry URL docker.tigergraph.com/tigergraph.strong text

shawnngtq · April 23, 2021, 9:11am

Thanks!

FYI for future reference.

I am using a native installation on Redhat 7.9 OS instead of docker image due to environment restriction. Here is what my server have for ulimit

$ su - tigergraph
$ ulimit
unlimited
$ ulimit -a -H
core file size          (blocks, -c) 5000000
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514933
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 102400
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
$ ulimit -a -S
core file size          (blocks, -c) 5000000
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514933
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 102400
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
$ cat /proc/sys/fs/file-max
1000000
$ sysctl fs.file-max
fs.file-max = 1000000
$ vim /etc/security/limits.conf
...
gpadmin soft nofile 65536
gpadmin hard nofile 65536
gpadmin soft nproc 131072
gpadmin hard nproc 131072
* - nofile 65536

The tigergraph user nofile value is 1000000, so I thought that it shouldn’t be a ulimit issue. But after checking, I realized the system hard limit is 4096. The resolution is to change that to a value >= 65535.

shawnngtq · May 10, 2021, 10:45am

@Jon_Herke, recently my testing server sshd_config is changed and now port 22 is no longer allowed.

If I try to start the server, here is what I see:

$ ./tigergraph/app/3.1.1/cmd/gadmin start all
[Info] Starting EXE
[Error] ExternalError (Failed to start executor(s); Failed to ssh to 127.0.0.1 with given credential; dial tcp 127.0.0.1:22: connect: connection refused)

So I reinstall it, this time, when asked about the ssh port number, I use 22222 (for e.g.) instead of the default 22. But I am still seeing the same error, with 1 additional line

[Error]: Failed to initialize the cluster, please check the error message and initialize again.

How can I resolve this?

Chengbiao_Jin · May 10, 2021, 5:50pm

@shawnngtq are you able to connect ssh via command ssh -i ~/.ssh/tigergraph_rsa tigergraph@127.0.0.1 -p 22222 ?

Chengbiao_Jin · May 10, 2021, 6:30pm

@shawnngtq if port 22222 works, you can also try gadmin config entry System.SSH.Port --file ~/.tg.cfg to config the port and restart.

shawnngtq · May 24, 2021, 8:19am

@Chengbiao_Jin,

when I install the software using the following input

accept license: y
custom license: 
default user: tigergraph
default app:
default data:
default log:
default tmp:
default port:
node: 1
default ip:

At the end of installation and starting of service, I will see

[Error] ExternalError (Failed to start executor(s); Failed to ssh to 127.0.0.1 with given credential; ssh: handshake failed: read tcp 127.0.0.1:14232->127.0.0.1:22: read: connection reset by peer.

So what I did was I change the /home/tigergraph/.tg.cfg file Hostname, SignatureAlgorithm, ConfigFileRelativePath and Port.

When I run ./gadmin start all, I am seeing this new error instead of ssh handshake

[  Error] Timeout (The StartExecutable cmd execution gets error in instance EXE_1: Tmeout(1m0s) when waiting executable ZK#1:check_ready to finish)

So I check the log at path/tigergraph/log/zk/ZK#1.out,

...
Using config: path/tigergraph/data/configs/zk/conf/zoo.cfg
grep: path/tigergraph/data/configs/zk/conf/zoo.cfg: No such file or directory
mkdir: cannot create directory `': No such file or directory
...

Then I check path/tigergraph/data/configs and I see only tg.cfg*, I don’t see etcd, kafka, nginx and zk directory that are suppose to be there.

I am thinking if you can share with me the installation steps that can avoid problems related either to sudo user permission / ip / port number due to server restriction.

alonharell · September 4, 2022, 2:21pm

Hey @shawnngtq, have you ever solved the problem?
Im encountering the same Timeout error

chris · October 14, 2022, 11:18am

If you are using a config file, take a look, after the installation (failed or succeed) the password is edited by tigergrpah and is no longer valid, you will need to type it again.