-
Get a monthly update on best practices for delivering successful software.
I have to apologize in advance. I'm in the middle of squeezing out a book chapter and haven't had time to tackle any of the BRE articles on my writing TODO list. So today I'm going to have to do what I usually hate to do -- become a blog echo chamber. The post I want to alert you to is one posted almost a year ago by Charles Young, entitled Microsoft's Rule Engine Scalability Results - A comparison with Jess and Drools. The article claims to show that Microsoft's BRE cannot be criticized about not implementing RETE properly based on MS's published test results. That's a long way to go for such an ephemeral point.
I don't recommend the article because of the claims it makes, but rather because it both demonstrates the use of three different BRE's through actual code samples, and it provides an approach to benchmarking and comparing BRE's for scalability and performance. I'm a firm believer in benchmarking BRE's based on your actual use cases rather than on abstract worst-case scenarios like Miss Manners. Read through this blog entry. Even if his approach is flawed, as I believe it is, he deserves full credit for publishing the full code to his test and opening it up to public scrutiny.
Related posts:
Topics: Benchmarking, Business Rules Engines, Drools, JBoss Rules, Jess
Thanks for your comments, and also for an interesting blog site in general.
I really, really don’t want to open up, again, the debate I was engaged in last year. Suffice to say that there was context and background to why the article was published, and that I was tackling claims (since graciously retracted) about Microsoft’s engine based on some rather poorly conceived performance figures contained in a Microsoft whitepaper. Because the only other approach I could have taken was the rather indefensible…”I’ve used reflector to inspect Microsoft’s code and it really, really is Rete”, my approach was to say “you can’t infer it’s not Rete just because the graph is a straight line”.
A long time ago, now, I implemented a variant of Miss Manners for MS BRE (MS BRE doesn’t support negated conjunctions, so we are forced to ‘cheat’). I’ve been challenged a couple of times to publish my findings, and being the person I am, could not do this by halves. So, hopefully in the next few weeks now, we should be publishing an overly long and complex paper on this. One the main points we make in the paper is that Miss Manners is absolutely useless (we use slightly more moderated language in the paper) as a general-purpose benchmark for comparative performance testing. I won’t go into why, here, but I note the Drools guys have made some similar arguments public. We have gone further, and have some additional, rather compelling (I think), reasons to add to what they have already said.
I’ve spent far more time than I should looking at rules engine benchmarking. One conclusion is that any claims based on a single, or a few, benchmark tests can never be taken seriously. The features and behaviour of these engines are far too varied to be described in such simplistic terms. If I compare CLIPS, Jess and MS BRE, for example, I know how to make each one of them appear to be the ‘winner’ or the ‘looser’ in performance comparisons. I also know how to exaggerate differences which in most scenarios would be near-undetectable in order to get the results I want. CLIPS is often the fastest for small data sets. Jess has excellent memory indexing features which makes it the best in many highly combinatorial scenarios. MS BRE performs automatic Rete optimisation and can execute certain types of condition evaluation faster than the other two engines.
Oh, and the Microsoft engine really, really, really is Rete! The paper we plan to publish demonstrates this fairly conclusively. We even show the MS BRE Rete for our Miss Manners variant which is almost identical to the Jess Rete.
Comment by Charles Young, Saturday, August 19, 2006 @ 10:22 am
I look forward to the paper with the full source and output for manners if you’re given permission by Microsoft. I can say pretty conclusively that a rule engine can get significantly better results if the rule engine “short-cuts” and does not really do full cross product matching + propogation. There are several ways of doing this. Forgy mentions making a subgoal, which switches the runtime to the equivalent of iLog’s sequential. Ignoring the well know hack of changing the order of the NOT CE in the find seating rule, the other way is to run in backward chaining mode like LEAPS. There’s several ways of detecting when and if a rule engine is taking a short-cut. The number of activations added/retracted to the agenda should be identicle to JESS or Drools3. If it isn’t, it means the engine is taking a short-cut. I just posted some results today for Sumatra http://woolfel.blogspot.com/2006/09/lazy-agenda-with-linkedactivationwrapp.html.
The code is all in sourceforge, so anyone can run the test for themselves and turn on (watch all) to see exactly what is happening. Although many people consider manners a bad benchmark, I disagree. It is a useful benchmark. People just shouldn’t use it to predict how their application will perform in production. It does measure how efficient the implementation is and points to potential weak areas. As a developer, having the information is useful and allows me to work around those weaknesses. If I don’t know how it performs under stress, what are my chances of building an efficient application?
Comment by Peter Lin, Wednesday, September 6, 2006 @ 12:24 am
It not Microsoft who are holding things up – the paper was written entirely independently of MS, and they have no say in the matter.
In our implementation, we did our very best, despite the lack of support for negated conjunction, to keep within the spirit of Miss Manners. However, of course, without built-in support for negated conjunction, we were forced to implement our own custom approach to performing tests for non-existence or Path and Chosen facts. We use a rather brittle approach in which, each time a new Path or Chosen fact is asserted, a record of this new fact is maintained in memory (outside of the knowledge of the engine). In order to test for non-existence, we then look up these records to see if we can find a match.
We had to depart from Miss Manners in two other respects. Firstly, MS BRE does not support an equivalent conflict resolution strategy to OPS5 LEX or Jess/Clips depth-first. We therefore needed to add an additional constraint to the find_seating rule to force the intended depth-first approach. Our first attempt failed because, although we introduced a condition that should have made almost no change to the amount of evaluation done, Microsoft’s Rete compiler effectively optimised the Rete, causing a huge decrease in evaluation. We then hit on an approach in which we implemented the same logical constraint, but in custom code, hence making it ‘invisible’ to the Rete compiler. This worked very nicely.
the other change was that, again because of the lack of support for conflict resolution strategies similar to other engines, we has to introduce priority (salience) settings on three rules. This made no significant difference to th amount of evaluation.
The introduction of a custom mechanism to check for non-existence almost certainly reduced the amount of work done overall in comparison to how things would work in negated conjunction is supported. However, we are still a very long way away from the kind of outrageous cheats that some have used. More than that, we provide very full disclosure and make it repeatedly clear that our results cannot be used to determine performance comparisons between MS BRE and other engines.
I believe that Miss Manners is a very poor choice of benchmark *for general purpose performance comparison between different engines*. There are several reasons for this, and in any case, I believe that you cannot adequately measure general comparative performance between two engines using a single benchmark. The other problem is that, as the number of guests is increased, the benchmark quickly becomes almost totally dominated by the work done at a single NotCE node. Even with reasonable beta memory indexing (e,g, as per Jess, Drools), approx. 95% of all evaluations occur at this one node for the 128 guest run. The benchmark is therefore good for comparing performance of single NotCE nodes working under stress, but not a lot else. I admit that, for most engines, the major underlying factor here is really the effectiveness of beta memory indexing. However, NotCE nodes are not strictly mandated by Rete, and not supported by all Rete engines. They can also potentially be implemented in slightly different ways, and might well not, for all engines, exhibit the same efficiency as join nodes. Hence, I maintain that Manners is a really poor choice for making general performance comparisons between different engines.
Comment by Charles Young, Thursday, October 19, 2006 @ 2:32 pm