Home | All Posts

[28 Feb 2014] Use Collate to Batch Gremlin Results

Groovy has many great functions and syntax that can expand the capabilities and flexibility of the Gremlin programming language. A recent polyglot query I came across, highlighted how important it can be to know Groovy as well as you know Gremlin.

Consider a situation where you want to write some Gremlin to pull some data from the TinkerPop toy graph to then pass to a query sent to a MySQL database. It might look something like this:

gremlin> g  = TinkerGraphFactory.createTinkerGraph()
==>tinkergraph[vertices:6 edges:6]
gremlin> g.V.name.gather.transform{
             sql.rows("SELECT * FROM toy WHERE n IN ('" + it.join("','") + "')")}.next()
==> ...

So the code above gets the name property from each vertex, uses gather to pull them all into a List and then transforms that List by using it as parameters to an IN clause of a SQL statement. That works nicely with the toy graph because it is small (only six vertices). In a larger graph, it’s possible that the set of returned vertices might exceed the capability of the IN expression or otherwise cause the query to not perform well. In this case, it might be better to batch the results to the IN clause so that they could be executed in smaller bits.

Groovy provides a very succinct way to do this in the form of the collate function. Consider the following Gremlin session, which still uses the toy graph for demonstration purposes:

gremlin> g.V.name.gather                                             
==>[lop, vadas, marko, peter, ripple, josh]

The output above shows that a List of names is being passed to the transform function shown in the first block of code. To batch that list, replace gather with collate as follows:

gremlin> g.V.name.toList().collate(2)
==>[lop, vadas]
==>[marko, peter]
==>[ripple, josh]

Using collate, the names are placed in batches of two, where each batch will be iterated into the transform effectively creating a batched approach to issuing the SQL query. In this case, three separate queries would be executed, each representing one small chunk of the total data to be retrieved. The following code shows the modified code using collate:

gremlin> g.V.name.toList().collate(2).transform{
             sql.rows("SELECT * FROM toy WHERE n IN ('" + it.join("','") + "')")}.scatter.toList()
==> ...

Note that the end of the statement changed as well, where the use of scatter unrolls the rows returned from the database into a List. This part effectively reassembles the result data to a single List, just as would have been done in the single SQL request in the original traversal.

Groovy has many useful functions. Some are better known than others, but every once in a while a lesser known one might be just what is needed to get the job done.

Home | All Posts