UI Query Performance

Hi Everyone,

I apologize for the lengthy post but I wanted to make sure the issue and desired results were clear. I’m open to any and all suggestions and would happily jump on a call to discuss in detail. I’m tried a few different schema designs and countless queries but I haven’t found anything that’s acceptable. Here we go…

Attachment

Contains the graph schema and the query.

Screen Shots

First, is the graph schema. Second, is the UI parameter section. Third, is the query. Fourth, is the current size of the graph.

NOTE: This graph has many more vertices and edges but I don’t believe they are needed for this issue. This graph currently contains 3 months of data (2019 Jan-Mar). We plan on loading at least all of 2018 and 2019 YTD. Assume the graph will be at least 6 times it’s current size if fully populated excluding the vertices and edges not mentioned in this post.

Background

This query is used in a UI where the user has 4 filtering options (YearMonth, LOB, Site and HealthSystem). If a user selects a LOB then only the YearMonth, HealthSystem and Site for that LOB are available for that user to continue their filtering. Very similar to excel-like filtering where the other filters will update based on your selection. This is what lines 2-44 achieve. The end results of this is a PRINT statement (line 46) that contains all available options in the UI filters. Every time the user selects a parameter value and clicks off, the UI calls the query and updates the available values. The UI can send no parameters to the query or a list for each one. We have not had any issues with this and if all values are selected or no parameters are passed to the query, the query will typically execute in 40 ms or under. However, I’m always open to suggestions.

When the user is finished selecting their filters, they will click the Apply Filters button in the UI. This button will execute the query with the apply parameter of TRUE. This will run the coding between line 48 and 71. The end result of this IF statement is a series of values that the UI will use for charts and other visualizations. To keep this straightforward, I am only printing the size of the Call vertex. The Call vertex is a list of values in the x_CallSummaryID vertex for the values the UI passed to the query. Another way to put it, once I know what values (YearMonth, LOB, Site & HealthSystem) I need to perform various calculations on, I need to filter the x_CallSummaryID vertex. From here I can begin my calculations because all the data points I need are connected to x_CallSummaryID.

Problem

On line 46, I know what values to execute the calculations (line 48 - 71) for. The next step is for me to filter the x_CallSummaryID vertex so I can start calculating data points. This is the slowest part of the query. If the query runs with no parameters then it will run very fast because it skips lines 50-68. For every IF statement that has to run, the execution time drastically increases. I have been unable to figure out how to filter x_CallSummaryID within a reasonable execution time. For the query below, if I run the query with only apply=TRUE, it will execute in around 75ms. If I run the query with 1 parameter, the best execute times I’ve gotten is around 1500 ms (1 small HealthSystem) but filtering for 1 YearMonth typically takes around 5000 ms. When adding 3 or more filters the query will take between 10000 and 15000 ms. For what I’m trying to accomplish in this query, I would imagine execution under 500 ms would be reasonable. With expectations of significant data growth I would prefer the execution time to be under 100 ms not matter what parameters are being used.

We are currently on v2.4.1 on a standalone server. I am using SYNTAX v2 because I’ve been attempting to use multi hops but haven’t had any luck.

1 CREATE QUERY Metrics(SET HS,SET YM, SET LOB,SET SITE, BOOL apply=False) FOR GRAPH SDD SYNTAX v2{


2 SetAccum @@hs, @@ym, @@lob, @@site;
3
4 ym = {x_YearMonth.};
5 hs = {x_HealthSystem.
};
6 lob = {x_LOB.};
7 site = {x_Site.
};
8
9 IF COUNT(YM) > 0 THEN ym = SELECT t FROM ym:t WHERE t.yearMonth IN YM ACCUM @@ym += t.yearMonth; END;
10 IF COUNT(HS) > 0 THEN hs = SELECT t FROM hs:t WHERE t.healthSystem IN HS ACCUM @@hs += t.healthSystem; END;
11 IF COUNT(LOB) > 0 THEN lob = SELECT t FROM lob:t WHERE t.lob IN LOB ACCUM @@lob += t.lob; END;
12 IF COUNT(SITE) > 0 THEN site = SELECT t FROM site:t WHERE t.site IN SITE ACCUM @@site += t.site; END;
13
14 IF COUNT(YM) == 0 THEN
15 ym_hs = SELECT x FROM hs:t-(x_HEALTHSYSTEM_YEARMONTH>:e)-x_YearMonth:x;
16 ym_lob = SELECT x FROM lob:t-(x_LOB_YEARMONTH>:e)-x_YearMonth:x;
17 ym_site = SELECT x FROM site:t-(x_SITE_YEARMONTH>:e)-x_YearMonth:x;
18 ym = ym_hs INTERSECT ym_lob INTERSECT ym_site;
19 ym = SELECT t FROM ym:t ACCUM @@ym += t.yearMonth;
20 END;
21
22 IF COUNT(HS) == 0 THEN
23 hs_ym = SELECT x FROM ym:t-(x_YEARMONTH_HEALTHSYSTEM>:e)-x_HealthSystem:x;
24 hs_lob = SELECT x FROM lob:t-(x_LOB_HEALTHSYSTEM>:e)-x_HealthSystem:x;
25 hs_site = SELECT x FROM site:t-(x_SITE_HEALTHSYSTEM>:e)-x_HealthSystem:x;
26 hs = hs_ym INTERSECT hs_lob INTERSECT hs_site;
27 hs = SELECT t FROM hs:t ACCUM @@hs += t.healthSystem;
28 END;
29
30 IF COUNT(LOB) == 0 THEN
31 lob_hs = SELECT x FROM hs:t-(x_HEALTHSYSTEM_LOB>:e)-x_LOB:x;
32 lob_ym = SELECT x FROM ym:t-(x_YEARMONTH_LOB>:e)-x_LOB:x;
33 lob_site = SELECT x FROM site:t-(x_SITE_LOB>:e)-x_LOB:x;
34 lob = lob_hs INTERSECT lob_ym INTERSECT lob_site;
35 lob = SELECT t FROM lob:t ACCUM @@lob += t.lob;
36 END;
37
38 IF COUNT(SITE) == 0 THEN
39 site_hs = SELECT x FROM hs:t-(x_HEALTHSYSTEM_SITE>:e)-x_Site:x;
40 site_ym = SELECT x FROM ym:t-(x_YEARMONTH_SITE>:e)-x_Site:x;
41 site_lob = SELECT x FROM lob:t-(x_LOB_SITE>:e)-x_Site:x;
42 site = site_hs INTERSECT site_ym INTERSECT site_lob;
43 site = SELECT t FROM site:t ACCUM @@site += t.site;
44 END;
45
46 PRINT @@hs, @@ym, @@lob, @@site;
47
48 IF apply == TRUE THEN
49 Calls = {x_CallSummaryID.*};
50 IF COUNT(HS) > 0 THEN
51 x_Calls = SELECT c FROM hs:s-(x_HEALTHSYSTEM_CALLSUMMARYID>:e)-x_CallSummaryID:c;
52 Calls = Calls INTERSECT x_Calls;
53 END;
54
55 IF COUNT(YM) > 0 THEN
56 x_Calls = SELECT c FROM ym:s-(x_YEARMONTH_CALLSUMMARYID>:e)-x_CallSummaryID:c;
57 Calls = Calls INTERSECT x_Calls;
58 END;
59
60 IF COUNT(LOB) > 0 THEN
61 x_Calls = SELECT c FROM lob:s-(x_LOB_CALLSUMMARYID>:e)-x_CallSummaryID:c;
62 Calls = Calls INTERSECT x_Calls;
63 END;
64
65 IF COUNT(SITE) > 0 THEN
66 x_Calls = SELECT c FROM site:s-(x_SITE_CALLSUMMARYID>:e)-x_CallSummaryID:c;
67 Calls = Calls INTERSECT x_Calls;
68 END;
69
70 PRINT Calls.size();
71 END;
72 }

Graph Size

Thank You,

Nick

Hello Nicholas.

We are currently hosting the Developer Office Hour. If you are available right now, please feel free to join in for a further discussion on optimizing this query.

Thanks,

Adil

Hi Nicholas,

Thanks for your detailed description. After reading through the query. I see great potential in improving performance.

  1. Changing the syntax version from v2 to v1 will significantly boost the performance.
  2. Since line 52, 57, 62, 67 are performance intersection logic. We can do it this way:

OrAccum @hs, @ym, @lob, @site;
// assign those bool to true when you populate the SetAccum
// e.g. site = SELECT t FROM site:t ACCUM @@site += t.site, t.@site += true;

conditions = hs union ym union lob union site;

Calls = SELECT c FROM conditions:s-((x_HEALTHSYSTEM_CALLSUMMARYID|x_YEARMONTH_CALLSUMMARYID|x_LOB_CALLSUMMARYID|x_SITE_CALLSUMMARYID):e)-x_CallSummaryID:c
ACCUM t.@hs += s.@hs, t.@ym += s.@ym, t.@lob += s.@lob, t.@site += s.@site
HAVING (COUNT(HS) == 0 OR t.@hs == true)
and (COUNT(LOB) == 0 OR t.@lob == true)
and (COUNT(site) == 0 OR t.@ym == true)
and (COUNT(SITE) == 0 OR t.@site == true) ;

This way all edge types are executed in parallel and we can avoid doing interset with the entire x_CallSummaryID set;

Also, I have attached the schema design and query writing best practice.

Please let me know if you have further questions.

Thanks,

Should the OrAccums for t be global accums? I’m getting an error message that says the t alias cannot be determined.

Thanks,

Nick

My bad it was my typo. Now I have changed the alias of target vertex to t. Please try.

FYI, I would think to change it to v1 will make a larger impact on the performance.

Calls = SELECT c FROM conditions:s-((x_HEALTHSYSTEM_CALLSUMMARYID|x_YEARMONTH_CALLSUMMARYID|x_LOB_CALLSUMMARYID|x_SITE_CALLSUMMARYID):e)-x_CallSummaryID:t
ACCUM t.@hs += s.@hs, t.@ym += s.@ym, t.@lob += s.@lob, t.@site += s.@site
HAVING (COUNT(HS) == 0 OR t.@hs == true)
and (COUNT(LOB) == 0 OR t.@lob == true)
and (COUNT(site) == 0 OR t.@ym == true)
and (COUNT(SITE) == 0 OR t.@site == true) ;

Here’s what I got so far. Updating Calls and setting the syntax to v1 has really helped the performance not matter what combination of parameters are selected. It’s executing between 1500 and 2000 ms.

  1. I added a CASE statement in the ACCUM clause to filter on the edges, recommended by Adil from a previous question. It seemed to help performance but what are your thoughts?
  1. Having c.@ym += s.@ym wasn’t working so I removed that and updated it to c.@ym = TRUE. Thoughts?
  1. I don’t understand the benefit of site = SELECT t FROM site:t ACCUM @@site += t.site, t.@site += true; if @site is a local accum.
  1. Is there anything else worth trying to improve performance?

FinalParms = ym UNION hs UNION lob UNION site;

Calls =

SELECT c

FROM FinalParms:s-((x_HEALTHSYSTEM_CALLSUMMARYID|x_YEARMONTH_CALLSUMMARYID|x_LOB_CALLSUMMARYID|x_SITE_CALLSUMMARYID):e)-x_CallSummaryID:c
ACCUM

CASE

WHEN e.type == “x_HEALTHSYSTEM_CALLSUMMARYID” THEN c.@hs += TRUE
WHEN e.type == “x_YEARMONTH_CALLSUMMARYID” THEN c.@ym += TRUE
WHEN e.type == “x_LOB_CALLSUMMARYID” THEN c.@lob += TRUE
WHEN e.type == “x_SITE_CALLSUMMARYID” THEN c.@site += TRUE

END

HAVING

(COUNT(HS) == 0 OR c.@hs == TRUE) AND
(COUNT(LOB) == 0 OR c.@lob == TRUE) AND
(COUNT(YM) == 0 OR c.@ym == TRUE) AND
(COUNT(SITE) == 0 OR c.@site == TRUE)

Could you show me your execution log, please?

grep UDF:: /home/tigergraph/tigergraph/logs/GPE_1_1/log.INFO

The latest log period starts with UDF::Start and end with UDF::End is your most recent.

The benefit of using OrAccum is:

  1. avoid doing INTERSECT

  2. reduce the number of SELECT statement.

  3. involve more edges in one SELECT statement can improve the concurrency.

However, this change may not be that significant. By looking at the execution log, we may be able to have more ideas.

Thanks.

I emailed you the LOG file the most recent query. Let me know if you have any questions.

I see currently 90% of the time was spent in the last iteration. Which processed about 27 million edges in about 1.5s.

Based on the number of edges per second, the query performance is already very good. This way, further improvement can be done by reducing the number of edges traversed.

For the four types of filtering conditions. If you start with the one that can traverse the least edges, then start from the result of filter2, see if it is connected with filter2. so on and so forth.

Assume we have # of neighbors HS < LOB < YM < SITE. Please find the example code below:

Calls = {};

if COUNT(HS) != 0 then

Calls = select t from hs-(x_HEALTHSYSTEM_CALLSUMMARYID)-:t;

end;

if COUNT(LOB) != 0 then

if Calls.size() > 0 then // if other previous filter conditions exist. Start from the result vertex set.

Calls = select s from Calls:s-(x_LOB_CALLSUMMARYID)-:t;

else // otherwise start from the filter condition vertex set.

Calls = select t from lob:s-(x_LOB_CALLSUMMARYID)-:t;

end;

end;

if COUNT(YM) != 0 then

if Calls.size() > 0 then

Calls = select s from Calls:s-(x_YEARMONTH_CALLSUMMARYID)-:t;

else

Calls = select t from ym:s-(x_YEARMONTH_CALLSUMMARYID)-:t;

end;

end

if COUNT(SITE) != 0 then

if Calls.size() > 0 then

Calls = select s from Calls:s-(x_SITE_CALLSUMMARYID)-:t;

else

Calls = select t from site:-(x_SITE_CALLSUMMARYID)-:t;

end;

end