Graph Embeddings in Neo4j with GraphSAGE

Facial recognition, reverse image search or natural language processing are all based on vector embeddings. Graphs are powerful way to create embeddings as well. Herein, Neo4j graph database comes with out-of-the-box embedding generation feature. In this post, we are going to apply a graph embedding algorithm on a pre-built graph.

Graph of Thrones (Credit: Lia Petrono)

Vlog

Pre-built graph: Game of Thrones

We have mentioned cypher queries in neo4j on a pre-built Game of Thrones graph in recent work. We are going to work with same pre-built graph in this tutorial. You should read that post to have same graph. Because I skip the initial building step of graph but that post already covers it. The following video will help you as well.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Why we need graph embeddings?

Embedding concept might be very abstract. You may ask why we need graph embeddings. We represent an entity with a vector to summarize it in embeddings. To be more clear, if the closest transaction of an online transaction were fraud, then that would be evaluated fraud as well. So, we can find and access similar entities to take an action.

Embeddings in Neo4j

Neo4j wraps 3 common graph embedding algorithm: FastRP, node2vec and GraphSAGE. You should read this amazing blog post: Getting Started with Graph Embeddings in Neo4j by CJ Sullivan. I learnt a lot from that tutorial. It mentions FastRP in production on same GOT graph. We will mention GraphSAGE algorithm on same graph.

GraphSAGE

We are going to mention GraphSAGE algorithm wrapped in Neo4j in this post. This algorithm is developed by the researchers of Stanford University. Firstly, it is mainly based on neural networks where FastRP is based on a linear model. That’s why, its representation results are more powerful but it is much slower than FastRP.

Node embeddings

GraphSAGE uses autoencoders to find embeddings. It tries to re-construct similar nodes with its structures.

Autoencoders in graph analytics

The algorithm considers the 5-depth relationship in the default settings to find embeddings.

Relationship depth

Secondly, if your graph has a high velocity, I mean that if you are going to add new nodes day by day, this method does not require to re-produce embeddings of the existing nodes. You just need to find the embeddings of new nodes. On the other hand, FastRP requires to find embeddings of all nodes when new ones subscribed to the graph.

Thirdly, we add some properties to nodes and edges. For example, if you represent persons as nodes, then you add age as property. GraphSAGE considers the node properties whereas FastRP discards.

Game of Thrones Graph

This pre-built graph represent houses, person (including knight, king, dead), battles with nodes. You can see the giant graph below.





Game of Thrones Graph (Credit: CJ Sullivan)

It seems that there were 352 houses (e.g. Stark, Lannister, Targaryen); 38 battles (The Red Wedding); 384 knights (Jaime Lannister); 2166 persons (Arya Stark, Jon Snow); 7 regions (The North, Beyond the Wall); 28 locations (King’s Landing, Winterfell).

MATCH (n:House) RETURN count(n) as num_houses
MATCH (n:Battle) RETURN count(n) as num_battles
MATCH (n:Knight) RETURN count(n) as num_knights
MATCH (n:Person) RETURN count(n) as num_persons
MATCH (n:Region) RETURN count(n) as num_regions
MATCH (n:Location) RETURN count(n) as num_locations
Creating the graph

Firstly, the name of the graph will be defined. It’s going to be game_of_thrones.

Secondly, nodes and properties will be declared. We will use battle node with defender_size, attacker_size properties; person node with age and pageRank properties; and finally house node with no properties.

Thirdly, edges between nodes will be declared. If you call CALL db.schema.visualization command, it will summarize nodes and edges. BTW, the trick is setting its orientation to undirected for graphsage.

Visualization the graph

We will create the graph in cypher as shown below.

CALL gds.graph.create(
'game_of_thrones',
{
  Battle: {
    label: 'Battle'
    , properties: {
      defender_size: {
        property: 'defender_size'
        , defaultValue: 0
      }, 
      attacker_size: {
        property: 'attacker_size'
        , defaultValue: 0
      }
    }
  },
  Person: {
    label: 'Person'
    , properties: {
      age: {
        property: 'age'
        , defaultValue: 36
      },
      pageRank: {
        property: 'pageRank'
        , defaultValue: 0
      }
    }
  },
  House: {
    label: 'House'
  }
}, 
{
  DEFENDER_COMMANDER: {type: 'DEFENDER_COMMANDER', orientation: 'UNDIRECTED'}
  , ATTACKER_COMMANDER: {type: 'ATTACKER_COMMANDER', orientation: 'UNDIRECTED'}
  , ATTACKER: {type: 'ATTACKER', orientation: 'UNDIRECTED'}
  , DEFENDER: {type: 'DEFENDER', orientation: 'UNDIRECTED'}
}
)

This will create the graph for 2556 nodes and 444 edges.

Creating the graph
Training the embedding model

Remember that graphsage algorithm is based on neural networks. That’s why, it requires a training stage. We will start training as shown below.

CALL gds.beta.graphSage.train(
  'game_of_thrones',
  {
    modelName: 'battle_model'
    , featureProperties: ['defender_size', 'attacker_size', 'age', 'pageRank']
    , projectedFeatureDimension: 4
  }
)

Train function expects at least one argument in the feature properties. Here, we feed the properties of battle and person nodes we defined in the previous step. The function expects common properties in the defined nodes. We will skip this with projected feature dimension argument. If you haven’t set this argument anything, then it will causes the following trouble. It is the number of items in the feature properties. BTW, I spent hours to find this argument :/

Failed to invoke procedure gds.beta.graphSage.train: Caused by: java.lang.IllegalArgumentException: The following node properties are not present for each label in the graph: [defender_size, attacker_size, age, pageRank]. Properties that exist for each label are []

Graph training

Notice that community edition lets you to store just one model in the catalog. If you need to store multiple models, then you should switch your neo4j server to enterprise edition.





Writing embeddings into the nodes

We finally use the trained model to create embeddings and store them as properties in nodes. We will see embedding property in nodes when write function is over.

CALL gds.beta.graphSage.write(
  'game_of_thrones',
  {
    writeProperty: 'embedding',
    modelName: 'battle_model'
  }
)
Writing embeddings into nodes

We will see embedding property now.

Embeddings in nodes
Similarity search

Embedding are now exisiting in battle, person and house nodes because we declare those nodes when we create the graph. The Red Wedding impressed me a lot when I watched game of thrones. Let’s find the most similar battle in the series.

We will firstly find the the red wedding battle in p1, match it to the rest of battles with the condition having different names, and find the euclidean distance between the embeddings of battles. Finally, return the 10 battles with lowest scores.

MATCH (p1:Battle)
MATCH (p2:Battle)
WHERE p1.name <> p2.name and p1.name = "The Red Wedding"
WITH p1, p2, gds.alpha.similarity.euclideanDistance(p1.embedding, p2.embedding) as distance
RETURN p1.name, p2.name, distance
ORDER BY distance
LIMIT 10
Similar battles with the red wedding

Siege of Riverrun, Retaking of Deepwood Motte, Siege of Winterfell and Battle of Deepwood Motte have almost identical to The Red Wedding. Battle of the Camps comes after them.

Now, let’s find the similar characters with Arya Stark. We will run a similar query with the previous example.

MATCH (p1:Person)
MATCH (p2:Person)
WHERE p1.name <> p2.name and p1.name = "Arya Stark"
WITH p1, p2, gds.alpha.similarity.euclideanDistance(p1.embedding, p2.embedding) as distance
RETURN p1.name, p2.name, distance
ORDER BY distance
LIMIT 10
Similar characters to Arya Stark

Her sister Sansa seems to be the most similar one to Arya.

Conclusion

So, we’ve mentioned how to run GraphSAGE algorithm in Neo4j graph database. Graphs are powerful way to create embeddings and Neo4j is a powerful tool for this task. We run an use case covering to find the similar characters and battles in pre-built game of thrones graph.

If you can create embeddings out of the neo4j, it is still a powerful tool for analytics. See its implementation for facial recognition. We will use deepface library for python to create embeddings but neo4j can find relations that human eyes could not find easily.

I hope you enjoy this post. You can support it if you like and make comment the videos I shared in this post, and subscribe my channel 🙏.






Like this blog? Support me on Patreon

Buy me a coffee