GPE Crash Unexpectedly

zzxx · May 1, 2021, 4:16pm

Hi, I tried to use Conjunctive Pattern Matching (Beta) to find unlabeled subgraphs with 7 vertices and 16 edges. The queries have been successfully installed and optimized. But during the execution, GPE always crashes unexpectedly with the following exception

tigergraph@ib31:~$ tail -n 20 ~/tigergraph/log/gpe/log.INFO
I0501 22:59:45.196050 71955 workerinstance.cpp:165] Request|WorkerManager,6551.GPE_4_1.1619881185195.N,NNN,2224,0,0|Start worker
I0501 22:59:45.196559 65532 enginejoblistener.cpp:62] Request|WorkerManager,6552.GPE_4_1.1619881185196.N,NNN,2224,0,0|Received
I0501 22:59:45.196571 65532 enginejobrunner.cpp:653] Request|WorkerManager,6552.GPE_4_1.1619881185196.N,NNN,2224,0,0|queryID: livej::default,16777221.RESTPP_1_1.1619879810180.N,NNN,3600,0,0, action: init, functionName: queryDispatcher_worker, source partition: 4, libudf_name_: libudf_livej
E0501 22:59:45.196848 71955 glogging.cpp:130] ============ Crashed with stacktrace ============
  0# 0x000000003E382F58 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 1# 0x00007F46080D10E0 in /home/tigergraph/tigergraph/app/3.1.1/.syspre/usr/lib_ld3/libpthread.so.0
 2# 0x000000003E930820 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 3# 0x000000003E931F48 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 4# 0x000000003E93220D in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 5# 0x000000003E9322F9 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 6# std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/bits/shared_ptr_base.h:158
 7# 0x000000003E691DD1 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 8# 0x000000003E464DB2 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
 9# 0x000000003E46878F in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
10# 0x000000003E46F14B in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
11# 0x000000003F075D69 in /home/tigergraph/tigergraph/app/3.1.1/bin/tg_dbs_gped
12# start_thread at /build/glibc-77giwP/glibc-2.24/nptl/pthread_create.c:456
13# 0x00007F46071DED0F at ../sysdeps/unix/sysv/linux/x86_64/clone.S:99

============ End of stacktrace ============

tigergraph@ib31:~$ gadmin status GPE
+--------------------+-------------------------+-------------------------+
|    Service Name    |     Service Status      |      Process State      |
+--------------------+-------------------------+-------------------------+
|        GPE         |          Down           |         Stopped         |
+--------------------+-------------------------+-------------------------+

Any idea on the root cause of the exception? Or how to find out the root cause? Any help is appreciated. Here is one query I used:

create distributed query node_275(int batch_id) for graph livej returns (int) syntax v2 {
  SumAccum<int> @@n_matchings;
  S = select n0 from
      node:n0 -(forlink>:e)- node:n1,
      node:n0 -(revlink>)- node:n2,
      node:n1 -(revlink>)- node:n2,
      node:n0 -(forlink>)- node:n3,
      node:n1 -(forlink>)- node:n3,
      node:n0 -(forlink>)- node:n4,
      node:n1 -(revlink>)- node:n4,
      node:n2 -(revlink>)- node:n4,
      node:n2 -(revlink>)- node:n3,
      node:n4 -(forlink>)- node:n5,
      node:n0 -(forlink>)- node:n5,
      node:n2 -(forlink>)- node:n5,
      node:n3 -(forlink>)- node:n5,
      node:n4 -(revlink>)- node:n6,
      node:n2 -(revlink>)- node:n6,
      node:n0 -(forlink>)- node:n6
      where e.batch_id == batch_id
      accum @@n_matchings += 1;

  return @@n_matchings;
}

Mingxi_Wu · May 2, 2021, 8:02pm

is your query on a cluster enviroment? How do you invoke this query?

zzxx · May 3, 2021, 3:39am

Thanks @mingxiwu~ Here I just want to try a method that is native to TG for subgraph matching.

I run the queries in a cluster environment with 10 machines. I have a number of distributed subqueries node_??? that are similar to the example above, and I also have one root query start_batch which is not distributed and will invoke the distributed subqueries one after one sequentially. The organization of the queries is shown below.

create distributed query node_xxx(int batch_id) for graph livej returns (int) syntax v2 {
  ...
}

create distributed query node_yyy(int batch_id) for graph livej returns (int) syntax v2 {
  ...
}

create distributed query node_zzz(int batch_id) for graph livej returns (int) syntax v2 {
  ...
}

...

create query start_batch (int batch_id) for graph livej {
  SumAccum<int> @@n_matchings;
  @@n_matchings += node_xxx(batch_id);
  @@n_matchings += node_yyy(batch_id);
  @@n_matchings += node_zzz(batch_id);
  ...
  print @@n_matchings;
}

I start the execution by calling start_batch with a batch_id. The execution crashed after running for some time, so some node_??? subqueries may have been successfully processed before the crash. I tried to print logs between the invocations of the subqueries in start_batch in order to see what subqueries have finished, but it seems the print results will only be shown after the entire start_batch finishes.

I posted the question here just because I am not sure whether it is the way I used TG that caused the crash … In fact, I found that the CPU usage was not high enough if I run each node_??? subquery sequentially using 10 machines, so I am changing the way I invoke the subqueries from “running each of the subqueries in a data-parallel manner” to “running the subqueries in a query-parallel manner”, where the distributed keyword is removed from the node_??? subqueries and I manage to ensure that in any time, there are 10 queries running simultaneously in the cluster and a new query will be launched only if an old one finishes, such that each machine is always busy processing one query. Graph data are pulled during the execution, but I think that is OK because the graph is a small one. I think query parallel manner will lead to much higher CPU usage and overall performance, and I am working on it right now.