Load Data with Massive Edge Types and Non-schema data

Hi,

I am trying to load data from DBpedia. Since there is no fixed schema defined in DBpedia, I tried to parse the dataset and generate the schema by myself.

Since there is no clear classification for vertex, I only generate one vertex type which leads to a large number of properties it contained, i.e.3726. And there are also

lots of edge types generated, i.e. 491 edge types. Then I tried to load the data into TigerGraph and the schema is successfully defined, but the loading job CANNOT

create normally. The loading job is as follows:

USE GRAPH dbpedia
CREATE LOADING JOB load_dbpedia FOR GRAPH dbpedia {
    // define vertex
    DEFINE FILENAME v_link_file;

    // define edge
    DEFINE FILENAME accessdate_file;
    DEFINE FILENAME accessDate_file;
    DEFINE FILENAME action_file;
    DEFINE FILENAME agency_file;
    DEFINE FILENAME agg1_file;
    ... // Totally 491 definitions

    // Load Vertex
    LOAD v_link_file 
    TO VERTEX Link VALUES ($0, $0, $1, ..., $3724) USING header="false", separator="|";

    // Load Edge
    LOAD accessdate_file TO EDGE accessdate VALUES ($0, $1, $2) USING header="false", separator="|";
        LOAD accessDate_file TO EDGE accessDate VALUES ($0, $1, $2) USING header="false", separator="|";
    LOAD action_file TO EDGE action VALUES ($0, $1, $2) USING header="false", separator="|";
    LOAD agency_file TO EDGE agency VALUES ($0, $1, $2) USING header="false", separator="|";
    LOAD agg1_file TO EDGE agg1 VALUES ($0, $1, $2) USING header="false", separator="|";
        ... // Totally 491 LOADs
}

I use ‘gsql dbpedia_load.gsql’ to create the job with the output

[aaron@w23 queries]$ gsql dbpedia_load.gsql 

Using graph 'dbpedia'

null

I have several questions

  1. Are there any hints about why this situation happens?

  2. How to load data without pre-defined schema? Is there any API that could automatically do this?

My Setting:

  1. TigerGraph Developer Version.

  2. CentOS 6.8

  3. 128G Memory

Thanks!

Sorry for the inconvenience. Could you please provide your TIgerGraph Version ( gadmin version ) and the GSQL_LOG? ( gadmin log gsql )

All data has to be ingested by following the schema defined, but a loading job is optional by the following method:

https://docs.tigergraph.com/dev/restpp-api/built-in-endpoints#post-graph-graph_name-upsert-the-given-data

Thanks.

Hi Xinyu,

Thank you for your reply! I have checked my TigerGraph version, which is tg_2.4.0_dev. And the gsql_log is attached below (I skipped part of loading job for better reading).

And it’s clear that the problem is because of JVM OOM. Then how should I config it?

Thanks!

GSQL Shell log

2019-10-23 18:24:12.265

2.4,tg_2.4.0_dev,f422d403dab4db5c96bd22822e79e7fd5a581283,f6b4892ad3be8e805d49ffd05ee2bc7e7be10dff

DEVELOPER_EDITION

PATH:/home/aaron/.venv/bin:/home/aaron/.syspre/usr/bin:/home/aaron/.syspre/opt/rh/devtoolset-2/root/usr/bin:/home/aaron/.syspre/bin:/home/aaron/.syspre/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.201.b09-2.el6_10.x86_64/jre/bin:/data/cghuan/anaconda3/bin:/home/aaron/.gium:/data/aaron/rust/bin:/data/aaron/curl-7.61.1/src:/data/aaron/wukong/deps/hwloc-1.11.7-install/bin:/data/opt/brew/bin:/data/opt/brew/sbin:/data/opt/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data/opt/hadoop-2.6.0/bin:/usr/local/scala-2.10.4/bin:/data/opt/hadoop-2.6.0/hive/bin:/data/opt/hadoop-2.6.0/hbase/bin:/data/opt/spark-2.2.0/bin:/data/opt/flink-0.9.1/bin:/home/aaron/.gium/:/home/aaron/bin:/home/aaron/.gium/

BuildTime: Tue Jun 11 10:53:31 PDT 2019

I@20191023 18:24:12.407  (Util.java:262) gcc path: /home/aaron/.syspre/opt/rh/devtoolset-2/root/usr/bin/g++

I@20191023 18:24:12.412  (Util.java:2064) /home/aaron/tigergraph/.license/lic_endpoint exists false

I@20191023 18:24:12.415  (Util.java:2091) license expires at 01/01/2266

I@20191023 18:24:12.415  (Util.java:2101) lic: 9340876800 cur: 1571826252

I@20191023 18:24:12.419  (AdminServiceClient.java:127) establish connection w/ admin server

I@20191023 18:24:12.468  (AdminServiceClient.java:158) connected to admin server

I@20191023 18:24:12.503  (ZkClient.java:80) Connected to zk server.

I@20191023 18:24:13.041  (ZkClient.java:80) Connected to zk server.

I@20191023 18:24:21.019  (ZkClient.java:80) Connected to zk server.

I@20191023 18:24:24.293  (Driver.java:145) START SERVER!

I@20191023 18:24:25.436  (BaseHandler.java:16) BaseHandler: A

I@20191023 18:24:25.438 tigergraph|127.0.0.1:48436| (VersionHandler.java:34) v

I@20191023 18:26:43.452  (BaseHandler.java:16) BaseHandler: o

I@20191023 18:26:43.453 tigergraph|127.0.0.1:56584|249641201 (Util.java:314) GSHELL_TEST is empty

I@20191023 18:26:43.453 tigergraph|127.0.0.1:56584|249641201 (Util.java:321) session parameter graph is empty

I@20191023 18:26:43.453 tigergraph|127.0.0.1:56584|249641201 (Util.java:335) COMPILE_THREADS is empty

I@20191023 18:26:43.453 tigergraph|127.0.0.1:56584|249641201 (Util.java:349) FromGraphStudio is empty

I@20191023 18:26:43.454 tigergraph|127.0.0.1:56584|249641201 (LoginHandler.java:62) The gsql client is started on the server, and the working directory is /data/aaron/tigergraph/queries

I@20191023 18:26:43.455 tigergraph|127.0.0.1:56584|249641201 (Util.java:2101) lic: 9340876800 cur: 1571826403

I@20191023 18:26:43.455 tigergraph|127.0.0.1:56584|249641201 (LoginHandler.java:89) tigergraph login successfully.

I@20191023 18:26:43.722 tigergraph|127.0.0.1:56584|249641201 (BaseHandler.java:16) BaseHandler: i

I@20191023 18:26:43.722 tigergraph|127.0.0.1:56584|249641201 (Util.java:314) GSHELL_TEST is empty

I@20191023 18:26:43.722 tigergraph|127.0.0.1:56584|249641201 (Util.java:321) session parameter graph is empty

I@20191023 18:26:43.927 tigergraph|127.0.0.1:56584|249641201 (Util.java:335) COMPILE_THREADS is empty

I@20191023 18:26:43.927 tigergraph|127.0.0.1:56584|249641201 (Util.java:349) FromGraphStudio is empty

I@20191023 18:26:43.927 tigergraph|127.0.0.1:56584|249641201 (Util.java:291) New session parameters.

I@20191023 18:26:43.943 tigergraph|127.0.0.1:56584|249641201 (FileHandler.java:45) USE GRAPH dbpedia

DROP JOB load_dbpedia

CREATE LOADING JOB load_dbpedia FOR GRAPH dbpedia {

    // define vertex

    DEFINE FILENAME v_link_file;

    // define edge

    DEFINE FILENAME accessdate_file;

    DEFINE FILENAME accessDate_file;

    DEFINE FILENAME action_file;

    DEFINE FILENAME agency_file;

    DEFINE FILENAME agg1_file;

    ...

    // load vertex

    LOAD v_link_file 

    TO VERTEX Link VALUES ($0, $0, $1, $2, $3, ..., $3724) USING header="false", separator="|";

    // load edge

    LOAD accessdate_file

        TO EDGE accessdate VALUES ($0, $1, $2) USING header="false", separator="|";

    LOAD accessDate_file

        TO EDGE accessDate VALUES ($0, $1, $2) USING header="false", separator="|";

    LOAD action_file

        TO EDGE action VALUES ($0, $1, $2) USING header="false", separator="|";

    LOAD agency_file

        TO EDGE agency VALUES ($0, $1, $2) USING header="false", separator="|";

    LOAD agg1_file

        TO EDGE agg1 VALUES ($0, $1, $2) USING header="false", separator="|";

    ...

}

I@20191023 18:26:43.976 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:223) Lock: Short && Read

I@20191023 18:26:53.346 tigergraph|127.0.0.1:56584|249641201 (CatalogManager.java:811) getCatalog: null

I@20191023 18:26:53.346 tigergraph|127.0.0.1:56584|249641201 (CatalogManager.java:768) switch to graph 'dbpedia'.

I@20191023 18:26:53.347 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:273) Unlock: Short && Read

I@20191023 18:26:53.348 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:208) Lock: Short && Write

I@20191023 18:27:01.965 tigergraph|127.0.0.1:56584|249641201 (CatalogManager.java:811) getCatalog: dbpedia

I@20191023 18:27:01.968 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:270) Unlock: Short && Write

I@20191023 18:27:01.969 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:208) Lock: Short && Write

I@20191023 18:27:10.486 tigergraph|127.0.0.1:56584|249641201 (CatalogManager.java:811) getCatalog: dbpedia

I@20191023 18:27:24.430 tigergraph|127.0.0.1:56584|249641201 (Util.java:1878) Run bash command: rm -f /home/aaron/tigergraph/dev/gdk/gsql/.catalog/0/codegen//GSQL_UDT.cpp

I@20191023 18:27:24.435 tigergraph|127.0.0.1:56584|249641201 (Util.java:1894) Finished

find token udf source files, compile if any one is newer than .so

skip token.so

I@20191023 18:27:24.499 tigergraph|127.0.0.1:56584|249641201 (AdminServiceClient.java:231) send /home/aaron/tigergraph/dev/gdk/gsql/.tmp/TokenBank.so to /home/aaron/tigergraph/bin

E@20191023 18:31:12.558 tigergraph|127.0.0.1:56584|249641201 (QueryBlockHandler.java:132) null

java.lang.OutOfMemoryError

at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)

at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source)

at java.lang.AbstractStringBuilder.append(Unknown Source)

at java.lang.StringBuffer.append(Unknown Source)

at java.io.StringWriter.write(Unknown Source)

at com.google.gson.stream.JsonWriter.newline(JsonWriter.java:603)

at com.google.gson.stream.JsonWriter.beforeName(JsonWriter.java:618)

at com.google.gson.stream.JsonWriter.writeDeferredName(JsonWriter.java:401)

at com.google.gson.stream.JsonWriter.value(JsonWriter.java:480)

at com.google.gson.internal.bind.TypeAdapters$3.write(TypeAdapters.java:148)

at com.google.gson.internal.bind.TypeAdapters$3.write(TypeAdapters.java:133)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97)

at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61)

at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:127)

at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:245)

at com.google.gson.Gson.toJson(Gson.java:704)

at com.google.gson.Gson.toJson(Gson.java:683)

at com.google.gson.Gson.toJson(Gson.java:658)

at com.tigergraph.schema.Catalog.e(Catalog.java:12431)

at com.tigergraph.schema.Catalog.iy(Catalog.java:15125)

at com.tigergraph.schema.Catalog.k(Catalog.java:15203)

at com.tigergraph.schema.Catalog.iz(Catalog.java:15195)

at com.tigergraph.schema.Catalog.l(Catalog.java:10546)

at com.tigergraph.schema.b.p.a(QueryBlockHandler.java:544)

at com.tigergraph.schema.b.p.a(QueryBlockHandler.java:179)

at com.tigergraph.schema.b.p.a(QueryBlockHandler.java:122)

at com.tigergraph.schema.b.i.a(FileHandler.java:50)

at com.tigergraph.schema.b.c.handle(BaseHandler.java:33)

at com.sun.net.httpserver.Filter$Chain.doFilter(Unknown Source)

at sun.net.httpserver.AuthFilter.doFilter(Unknown Source)

at com.sun.net.httpserver.Filter$Chain.doFilter(Unknown Source)

at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(Unknown Source)

at com.sun.net.httpserver.Filter$Chain.doFilter(Unknown Source)

at sun.net.httpserver.ServerImpl$Exchange.run(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

I@20191023 18:31:12.561 tigergraph|127.0.0.1:56584|249641201 (CatalogLock.java:270) Unlock: Short && Write

I@20191023 18:31:12.585 tigergraph|127.0.0.1:56584|249641201 (BaseHandler.java:16) BaseHandler: a

I@20191023 18:31:12.586 tigergraph|127.0.0.1:56584|249641201 (AbortClientSessionHandler.java:29) AbortSession added for session = 249641201

I@20191023 18:31:12.586 tigergraph|127.0.0.1:56584|249641201 (AbortClientSessionHandler.java:30) AbortLoadingProgress added for session = 249641201

This is a known issue when loading a job is too big.

Could you either reduce the schema size or split the loading job into many small jobs?