Gremlin Snippets are typically short and fun dissections of some aspect of the Gremlin language. For a full list of all steps in the Gremlin language see the Reference Documentation of Apache TinkerPop™. This snippet is based on Gremlin 3.4.8.Please consider bringing any discussion or questions about this snippet to the Gremlin Users Mailing List.
From the moment I saw this question in StackOverflow, I could mostly envision how it would work if I were to solve the problem with a custom VertexProgram in OLAP style. Unfortunately, this was not going to help the person who was asking the question because they were using Python and Neptune. It is not possible to write VertexProgram implementations in Python and Neptune doesn’t yet directly support a GraphComputer implementation to support it even if the implementation had been done on the JVM. Despite a VertexProgram not quite being the right answer, my mind was so locked into it, I had trouble considering the issue any other way and the Gremlin I was writing to try to get an answer reflected that.
Before progressing too much further, let’s cast some attention to the question. It presented an algorithm that involved calculating a “favor score” based on “weight” properties assigned to “favor” edges between “person” vertices. The sample graph looked like this:
From a starting vertex, sum the outgoing edge weights as totalFavor and accept the value of an “incoming” currentFavor or use “1” if this is the starting vertex.
Traverse each outgoing edge and for each adjacent vertex calculate a “proportional favor” by multiplying the currentFavor by the current edge weight divided by totalFavor. This “proportional favor” is then added to the total for the adjacent vertex.
For each adjacent vertex recalculate favor starting from step 1 until there are no more paths to take.
The algorithm is a lot to take in as text and is more clearly expressed as code, so hopefully the examples to follow will help clarify what is happening. I’ve explained that my first instinct after reading this question pointed toward a VertexProgram, but knowing the answer would not satisfy the problem, my second instinct was that this algorithm was best implemented with sack()-step. While I couldn’t quite get that approach to work right, Kelvin Lawrence was good enough to produce the answer which was suprisingly elegant in its base form:
Unable to write the above query in the time I’d allowed myself, I figured I could at least get the VertexProgram approach out of my head and implement it as an alternative answer. As a first step to understanding this solution, it is important to gather an understanding of what a VertexProgram is and what its relation is to a GraphComputer. A GraphComputer is the heart of OLAP-style processing in TinkerPop, where unlike OLTP-style workloads which focus on processing a localized subset of the graph, OLAP instead focuses on large subsets of the graph or the entire graph. The GraphComputer is meant to help orchestrate the work of processing graph in bulk synchronous parallel style. The “work” is the VertexProgram in action, where we can think of the GraphComputer as copying the VertexProgram to each vertex in the graph for execution in parallel until the termination condition of the VertexProgram is met.
We tend to think of OLAP for large-scale graphs (on the order of billions of edges or more) which would typically imply the use of Spark and its GraphComputerimplementation, but for learning purposes and perhaps some use cases this approach might make sense for TinkerGraphComputer. TinkerGraphComputer is designed to work with TinkerGraph and uses a thread pool to distribute work for parallel processing of the graph. In relation to the StackOverflow question, I’d envisioned a solution that would use Java to extract a subgraph() from Neptune and then with that subgraph as a TinkerGraph on the client, TinkerGraphComputer could be used to process the custom VertexProgram. There would potentially be some expense in the serialization cost for the subgraph, but depending on the size of that subgraph the execution of the TinkerGraphComputer running all in-memory would likely be quite fast.
There is a fair bit of boilerplate code to building a VertexProgram and perhaps that points to an area for improvement. With that concern aside, a VertexProgram begins as follows:
The <Double> generic declaration is meant to specify the type of the message that is passed from one vertex to the next. In this case, the Double refers to the “currentFavor” described in step 1 of the algorithm. Recall that VertexProgram is executed over and over again in parallel on each vertex until the termination condition is met. With that context in mind, think of the message passing as a way for those programs on each vertex to communicate with each other. Vertices send messages and receive messages by way of a Messenger and that object is made available in the VertexProgram by way of this method:
The execute() method is the core part of the VertexProgram lifecycle and typically where most of the logic for the VertexProgram resides. It is called by the GraphComputer worker for each iteration of the process until the termination requirements are met. In addition to the Messenger, the execute() method also supplies the current Vertex on which the program is running and an instance of the Memory of the GraphComputer which is a global data structure where vertices can share information. My implementation of this method looked like this:
While the complete code for the FavorVertexProgram will be shown below, the above code is basically the translation of that Gremlin sack() implementation, thus taking us from OLTP to OLAP. Writing VertexProgram implementations requires a different style of thinking than simply writing a bit of Gremlin, yet both can encode the same algorithm. We see this notion evident in a number of the Gremlin Recipes, where we demonstrate both an OLTP and an OLAP approach to getting an answer and is best exemplified in the Connected Components and Shortest Path sections. It is important to figure out whether or not there is an advantage to taking one approach or the other given your use case.
In the context of the original question, I would think that the OLTP approach is best because it works with Neptune in the programming language of the user’s choice, does much of the heavy lifting with sack() which is typically quite efficient and I suspect the traversal itself is not global in nature. If this last point is not the case (issues with Neptune and Python aside), then OLAP may prove to be a better choice and we could likely adapt this FavorVertexProgram to calculate more than one favor set at a time, so rather than just consider “jane” we might do the same calculation in parallel for all the people in the graph.
The full FavorVertexProgram can be found below:
This FavorVertexProgram can then be utilized as follows, where graph is a TinkerGraph with the sample data from the script at the start of this post:
which would print:
While most users won’t require a custom VertexProgram in the first days of days of their work with their TinkerPop graph system, the possibility for needing to use OLAP ends up looming quite quickly as a graph can grow at a suprising rate. In many cases, users can be satisfied by writing the Gremlin that they’ve grown accustomed to and executing it as a distributed traversal in OLAP.
Perhaps users are not aware but when they do the above, their Gremlin is being executed by way of the TraversalVertexProgram which processes Gremlin steps in such a way so as to allow them to be executed in the GraphComputer model. This general purpose VertexProgram with the power of Gremlin can cover a massive set of large scale traversal use cases. For all other cases, it may be necessary to develop your own VertexProgram to solve your problems.