Batch processing on sliced data with Loop

anupom001 · November 14, 2022, 3:39pm

Hi support team,

I have a graph with large datasets and the node by which I am performing the scan query has 1.5 billion elements. When i am running the distributed query, it’s throwing memory error. Is there a way to run batch processing (data sliced by id) with a loop. A SQL equivalent Example given below…

declare min int = 0
decalre max int = 10000
declare batchSize = 1000

while ( min < max )
begin

-- these could be any query, but i want to use "min" and "min + batchSize - 1" in Gsql
update e
	set e.salary = e.salary * 1.2
from employee as e
where e.empid between min and min + batchSize - 1


set min = min + batchSize

end

markmegerian · November 14, 2022, 3:58pm

I have done something similar - I use the modulo (remainder) function to achieve this. The nice thing about this approach is that its very flexible in terms of how big or small to make each batch. So for instance, you can specify mod 10, then loop through all 10 slices

EMPS = SELECT e from employee: e WHERE e.empid % mod == slice

The two arguments are INT mod and INT slice

anupom001 · November 14, 2022, 6:25pm

This is not what i was looking for. I was to use internal id in the where clause as empid can be alphanumeric. here is the logical flow -->I want to find the node count first, determine the batch size then apply a range in where clause. Please help me with a logic which will work for character value in node id.

But I appreciate your reply…

markmegerian · November 14, 2022, 7:55pm

OK, that makes sense, but your example that talked about min, max and batch size clearly implied that the empid is a numeric value, which i why I provided that example.

If this is a character value, you may want to consider an approach where you combine the use of the ORDER BY and the LIMIT clauses to process a batch, then use the last value of the last batch as the starting value for the next batch. That isn’t an ideal usage pattern for graph, however, and I encourage you to look for alternative ways to use the graph schema to execute batches against some other logic partition of your data.