Preferred way of loading streaming data into TigerGraph?

Jon_Herke · April 8, 2020, 8:53pm

Hi,

We currently are having TigerGraph connected to Kafka to consume an infinite steam of events via the Loading Job. Now we are reasoning regarding the optimal loading mechanism of these continous events. E.g. When shutting down the loading job we sometimes see exceptions in logfiles. We assume these exceptions happens due to the construct of the Loading Job which we assume does not halt new events to be consumed and let running queries finish before shutting down - hence the exceptions we are seeing.

Is the Loading Job the optimal solution for an infinite stream of data or would the additional tooling provided, such as Kafka Connect be recommend as an alternative?

Best Regards,
Anders Lauri

Jon_Herke · April 8, 2020, 8:54pm

The Kafka Loader is the recommended way of loading steam data.

And to load data from a Kafka Topic a loading job is needed.

Could you please show us the exception log so that we can identify the issue?

Thanks.

Jon_Herke · April 8, 2020, 8:54pm

Hej,

The Kafka Loading Job with back pressure conumes high amounts of memory, we have seen that we can use a disk option. How can we enable this via the config file?

Cheerio,

Anders Lauri

Jon_Herke · April 8, 2020, 8:54pm

Hi Anders,

Could you please be more specific on what is the disk option? Where is it on the document?

Also to reduce the memory usage of kafka loader. You can try to config the batch size.

https://docs.tigergraph.com/dev/gsql-ref/ddl-and-loading/running-a-loading-job#options

Jon_Herke · April 8, 2020, 8:54pm

Hej,

There is an option for the binary kafka_loader which is executed as when triggering a loading job via GSQL, we can see this is reflected in a disk_mode property in the configuration file. Given SSD disks we believe the reduced throughout with reduced memory would be acceptable.

We have attempted to configure batch size however the value was not changed.

Thanks,

Anders Lauri