Home | All Posts

[20 Nov 2013] Use has() Over filter()

There’s been some discussion on the Aurelius mailing list regarding the advantages of using a single `filter` with a complex closure over usage of multiple `has` steps. You can read more about the discussion here. The core of the discussion is about realizing that Gremlin will compile `has` steps down to query objects that can be used by the underlying graph store to optimize the traversal, whereas a `filter` is merely a function that evaluates an item in the pipeline. Therefore, using `has` is generally better than using `filter`.

In midst of the discussion, I decided to try a small experiment. I wanted to show that two `has` steps were no slower than a single `filter` operation that performed the same function, even when no query optimizations were performed by the underlying store. Titan does perform these optimizations, so I decided to utilize a TinkerGraph instead. TinkerGraph utilizes DefaultQuery to process `has` operations. `DefaultQuery` is used by all Blueprints implementations that do not implement their own optimizations, so the results of this experiment should carry across all Blueprints implementations that don’t optimize.

To do the experiment, I opened a Gremlin REPL and constructed a graph with a half-million vertices with some random property values.

gremlin> g = new TinkerGraph()                                                                        
==>tinkergraph[vertices:0 edges:0]
gremlin> r = new Random()                                                                             
==>java.util.Random@4841a34
gremlin> (0..<500000).each{g.addVertex([i:it,r:r.nextLong().toString().padLeft(21,"0")])};null      
==>null
gremlin> g.v(0).map
==>{r=001842268616820437679, i=0}

I then wrote two queries. The first used `filter` with two property evaluations to determine whether they met the restriction or not and the second used two `has` operations one for each property being evaluated. I ensured that both queries returned the same number of vertices.

gremlin> g.V.filter{it.r>="00000000000" && it.i>=200000}.count()                               
==>150211
gremlin> g.V.has("r",T.gte,"00000000000").has("i",T.gte,200000).count()                        
==>150211

Once assured that I was returning the same data for both traversals, I timed the results.

gremlin> s=System.currentTimeMillis();g.V.has("r",T.gte,"00000000000").has("i",T.gte,200000)
         .iterate();System.currentTimeMillis()-s
==>95
gremlin> s=System.currentTimeMillis();g.V.filter{it.r>="00000000000" && it.i>=200000}
         .iterate();System.currentTimeMillis()-s       
==>5521

Of course, I was quickly reminded that I wasn’t using `getProperty` in my `filter` which balanced the numbers considerably:

gremlin> s=System.currentTimeMillis();g.V.has("r",T.gte,"00000000000").has("i",T.gte,200000)
         .iterate();System.currentTimeMillis()-s
==>95
gremlin> s=System.currentTimeMillis();g.V.filter{
            it.getProperty("r")>="00000000000" && it.getProperty("i")>=200000
          }.iterate();System.currentTimeMillis()-s       
==>150

The use of reflection to get the property value is very expensive and something I don’t use in production code (I shouldn’t have missed that).

The results show a traversal with two `has` steps performs somewhat better than a single `filter` step. What’s interesting about this finding, is the fact that TinkerGraph is not optimizing the query when processing the `has`. The results basically mean that finding a way to use `has` instead of `filter` will generally result in better traversal performance even for Blueprints implementations that don’t support query optimizations.

Home | All Posts