Requesting sample GSQL queries for User Engagement scenario

tigerfan · July 7, 2021, 12:31am

Hi,

I have a very simple graph.

Vertex - User - (user_id, age, gender, zip code)
Vertex - Email - (user_id, #email_received_weekly , #email_opened_weekly , #email_received_monthly , #email_opened_monthly)
Vertex - SMS - (user_id, #sms_received_weekly , #sms_received_monthly)
Edge User->Email
Edge User->SMS

Can someone please suggest what type of queries can be written to find the segments of users by age and gender groups and prioritize such segments based on #email , #sms counts ?

It will be also great if someone can point to python code to consume the output of such query and then visualize such groups and also plot by geographic regions (leveraging zipcode)

Thanks much in advance !

Ken

Leo_Shestakov · July 7, 2021, 2:20pm

Hello,

I would first like to make some comments about your graph schema, as slightly modifying it will make writing these queries much easier. I would recommend consolidating your three vertex types into a single user type, because currently, the email and SMS vertices are really just providing additional attributes to be associated with a single user (specifically, a single user_id).

Your new schema would look like this:

CREATE VERTEX user (

PRIMARY_ID user_id INT,
age INT,
gender STRING,
zip_code INT,
email_received_weekly INT,
email_opened_weekly INT,
email_received_monthly INT,
email_opened_monthly INT,
SMS_received_weekly INT,
SMS_received_monthly INT
)

CREATE DIRECTED EDGE email (FROM user, TO user)
CREATE DIRECTED EDGE sms (FROM user, TO user)

Resulting in simpler queries:

CREATE QUERY segment_by_age(INT target_age) {

start = {user.*};

age_group =
SELECT s
FROM start:s
WHERE s.age == target_age
ORDER BY s.email_received_monthly DESC, s.SMS_received_monthly DESC;

PRINT age_group;
}

CREATE QUERY segment_by_gender(STRING target_gender) {

start = {user.*};

gender_group =
SELECT s
FROM start:s
WHERE s.gender == target_gender
ORDER BY s.email_received_monthly DESC, s.SMS_received_monthly DESC;

PRINT gender_group;
}

Leo_Shestakov · July 7, 2021, 3:29pm

If you choose to continue with your original schema, you will not be able to efficiently utilize the ORDER BY feature of SELECT statements (because you are trying to order a set of one type of vertex by the attributes of another type). This will lead to much less intuitive and more clunky queries.

You will also need to declare user_id as a standard attribute for each vertex type. This is because a PRIMARY_ID cannot directly be used in a query body.

Example:

CREATE VERTEX user (

PRIMARY_ID user_id INT,
user_id INT,
age INT,
gender STRING,
zip_code INT
)

Next, you will need to create reverse edges for all of your directed edge types, so that we can travel backwards from a user_to_sms/user_to_email directed edge back to the originating user.

CREATE DIRECTED EDGE user_to_email (FROM user, TO email)
WITH REVERSE_EDGE=“reverse_email”

CREATE DIRECTED EDGE user_to_SMS (FROM user, TO sms)
WITH REVERSE_EDGE=“reverse_sms”

Now, here is the resulting style of query, which is much more complicated and cannot order the resulting user set by BOTH email and sms counts (in this case, I chose to use ordering by email counts).

CREATE QUERY segment_by_age(INT target_age) {
ListAccum @@id_list;

start = {user.*};

age_group =
SELECT s
FROM start:s
WHERE s.age == target_age
ACCUM @@id_list += s.user_id;

start = {email.*};

email_ordering =
SELECT s
FROM start:s
WHERE @@id_list.contains(s.user_id)
ORDER BY s.email_received_monthly DESC, s.email_received_weekly DESC;

user_list =
SELECT t
FROM email_ordering:s - (reverse_email:e) → user:t;

PRINT user_list;
}

Effectively, with the less optimized schema, you need to first identify a list of user IDs that meet the requirement for age (first SELECT statement). Next, you construct a list of email vertices that match these user IDs (second SELECT) and order them as required. Then, you map every one of these email vertices to its corresponding user source vertex with the help of reverse edges (third SELECT).

These steps are extremely redundant and inefficient, so I would highly recommend considering my optimization from the first reply.

tigerfan · July 7, 2021, 4:58pm

Hi @Leo_Shestakov,
Much appreciate your in-depth analysis and showing me the right direction. I am feeling more confident about converting existing table schema into Graph structure and use correct GSQL. Will continue to learn.

Couple of quick questions on using ML Algo on this data:
(1) How can I leverage Similarity algorithm to automatically find the groups of users based on age who exhibit similar engagement metrics either using Jaccard/Cosine/ Clustering ?

I referred to the example of Jacard similarity which is counting the connections. But in my case I have to count the values of the attributes of user nodes (like #emails_sent etc.)

(2) Is there any other type of ML Algo that can be applied to this single-vertex dataset ?
Will add more nodes in coming days. Just trying get something very basic stuff work first.

It would be great if you can share your valued inputs.

Thanks
Ken

Leo_Shestakov · July 7, 2021, 5:52pm

Hey Ken,

I’m going to assume that we are now working with the updated schema.

For the first question, your thinking is on the right track. If you want to use the more advanced algorithms to process similarity/community detection, they will rely on counting connections (edges) instead of attributes like “email_received_monthly”. Examples include Jaccard/Cosine Similarity (as you noted) and Louvain Clustering. This is because vertex and edge counts are easy to abstract, while unique attributes like “sms_received_monthly” are much harder to generalize into a community algorithm that can be applied uniformly to various graph schemas. Thus, you should ensure that every email/sms that is counted in your attributes also exists as an edge in your graph.

It would theoretically be possible to apply such algorithms to your quantitative vertex attributes, but you would need to personally modify complicated sections of the community algorithms to work with your schema. After all, you are just compiling the number of edges that point to a given user during a set time frame.

A simpler approach to accomplish what you desire could also be to manually set your own grouping rules (select rates of email/sms reception that act as boundaries between groups), and then manually filter the users that belong to each group using a SELECT statement that only grabs vertices WHERE the attributes fall into a certain predetermined range (e.g. more than 100 emails received monthly).

Some of the community algorithms may require weighted edges. This can easily be implemented in your schema by allowing a correspondence between 2 users to occur multiple times. For example, an email edge weight of 3 means that one user emailed another during 3 separate occasions. This can be implemented through an edge attribute referring to this weight. If you want to avoid this concept and an algorithm requires it, just set the weight of all edges to the same number.

Another algorithm idea I thought of was to use K-Core Clustering to identify the most influential/active users in your population (those who sent out sms/email to the greatest number of other users). This could be interesting to look at because your vertex attributes currently only relate to the reception of messages (instead of the rate at which they are sent).

It would be interesting to compare those users that receive the most contacts to those that send the most.

Feel free to brainstorm more applications using https://github.com/tigergraph/gsql-graph-algorithms