When I run ./gadmin status, I see that GSE is Down for Service Status and Stopped for Processed State.
So I try to restart all services via ./gadmin restart all, then I encounter this error
[Error] ExternalError (Failed to send ServiceCommand to executor, spanId [invoker]@1618902544901995240:service-stop; Failed to send rpc /tigergraph.tutopia.common.pb.ExecutorService/StopExecutables to 127.0.0.1:9177; rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:9177: connect: connection refused"; Incomplete (Failed to execute the StopExecutable cmd in instance EXE_1))
So, I killed all tigergraph processes via linux kill command. I restart all the services, but GSE is still down. I proceed to check the log inside tigergraph/log/gse/GSE_1#1.out, I see:
MessageQueue|ZMQ|Context_New
GraphSQL 3.0 ID Service:
--- Version ---
...
TigerGraph 3.0 ID Service:
--- Version ---
...
System hard limit of file descriptors is 4096. The expected is no less than 65535
MessageQueue|ZMQ|Context_Destory
I am not certain how do I fix this since stopping / restarting services don’t work?
Do let me know if you need additional info. Thanks!
Run the following command to pull the TigerGraph docker image, bind ports, map a shared data folder, and start a container from the image. Note: this command is very long; please make sure you copy the whole command by dragging the scroll bar to the end:
Here is a breakdown of the options and arguments in the command:
-d : make the container run in the background.
-p : map docker 22 port to host OS 14022 port, 9000 port to host OS 9000 port, 14240 port to host OS 14240 port.
--name : name the container tigergraph.
--ulimit : set the ulimit (the number of open file descriptors per process) to 1 million.
-v : mount the host OS ~/data folder to the docker /home/tigergraph/mydata folder using the -v option. If you are using Windows, change the above ~/data to something using windows file system convention, e.g. c:\data
-t : allocate a pseudo-TTY
docker.tigergraph.com/tigergraph:latest : download the latest docker image from the TigerGraph docker registry URL docker.tigergraph.com/tigergraph.strong text
I am using a native installation on Redhat 7.9 OS instead of docker image due to environment restriction. Here is what my server have for ulimit
$ su - tigergraph
$ ulimit
unlimited
$ ulimit -a -H
core file size (blocks, -c) 5000000
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514933
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 102400
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ ulimit -a -S
core file size (blocks, -c) 5000000
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514933
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 102400
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ cat /proc/sys/fs/file-max
1000000
$ sysctl fs.file-max
fs.file-max = 1000000
$ vim /etc/security/limits.conf
...
gpadmin soft nofile 65536
gpadmin hard nofile 65536
gpadmin soft nproc 131072
gpadmin hard nproc 131072
* - nofile 65536
The tigergraph user nofile value is 1000000, so I thought that it shouldn’t be a ulimit issue. But after checking, I realized the system hard limit is 4096. The resolution is to change that to a value >= 65535.
@Jon_Herke, recently my testing server sshd_config is changed and now port 22 is no longer allowed.
If I try to start the server, here is what I see:
$ ./tigergraph/app/3.1.1/cmd/gadmin start all
[Info] Starting EXE
[Error] ExternalError (Failed to start executor(s); Failed to ssh to 127.0.0.1 with given credential; dial tcp 127.0.0.1:22: connect: connection refused)
So I reinstall it, this time, when asked about the ssh port number, I use 22222 (for e.g.) instead of the default 22. But I am still seeing the same error, with 1 additional line
[Error]: Failed to initialize the cluster, please check the error message and initialize again.
At the end of installation and starting of service, I will see
[Error] ExternalError (Failed to start executor(s); Failed to ssh to 127.0.0.1 with given credential; ssh: handshake failed: read tcp 127.0.0.1:14232->127.0.0.1:22: read: connection reset by peer.
So what I did was I change the /home/tigergraph/.tg.cfg file Hostname, SignatureAlgorithm, ConfigFileRelativePath and Port.
When I run ./gadmin start all, I am seeing this new error instead of ssh handshake
[ Error] Timeout (The StartExecutable cmd execution gets error in instance EXE_1: Tmeout(1m0s) when waiting executable ZK#1:check_ready to finish)
So I check the log at path/tigergraph/log/zk/ZK#1.out,
...
Using config: path/tigergraph/data/configs/zk/conf/zoo.cfg
grep: path/tigergraph/data/configs/zk/conf/zoo.cfg: No such file or directory
mkdir: cannot create directory `': No such file or directory
...
Then I check path/tigergraph/data/configs and I see only tg.cfg*, I don’t see etcd, kafka, nginx and zk directory that are suppose to be there.
I am thinking if you can share with me the installation steps that can avoid problems related either to sudo user permission / ip / port number due to server restriction.
If you are using a config file, take a look, after the installation (failed or succeed) the password is edited by tigergrpah and is no longer valid, you will need to type it again.