Tuesday 16 February 2010

Pirate testing

This is an example of pirate testing. It's using a test-suite consisting of language-neutral data. There's also a ruby implementation.

So what's pirate testing? It's a form of data-driven testing. The tests are specified in a language neutral format (XML, JSON, YAML, whatever) and then various test harnesses for different languages are written. These ensure that the different implementations all provide the same functionality.

Let me rephrase that in dictionary-speak:
Pirate Testing is a family of techniques for creating functional tests that are independent of a language or implementation. This usually means that the tests are data-driven or they're written in a neutral language that can be invoked from multiple other languages. The easiest way to think about this is that there is a DSL used for writing the tests (this can be a data format or a programming language) and one or more general purpose languages used for implementing functionality that satisfies the tests.

This is an idea that goes back to the days of Jon Bentley's 'little languages' but the name originated in this post by Sam Ruby. In his case he was literally dealing with the tests for a virtual machine called Pirate.

This: http://code.google.com/p/mimeparse/source/detail?r=14 is an implementation of pirate testing. Some things you should note. I dynamically build a PyUnit TestSuite which has one TestCase per record in the testdata.json file. This ensures that the tests have all the standard benefits (shared setup, shared teardown, independence, reporting, etc) that come from using a *Unit testing framework.

It's vital to make sure your pirate tests are isolated from each other; that the test runner keeps going after it encounters a failure and that you emit detailed diagnostic information when a test fails. One of my patches to the Feedparser project makes it dump out the entire environment when a test fails because it was proving difficult to debug problems. The implementation by Matt Sanford of the Twitter conformance tests currently doesn't do this.

Like any technique pirate testing has benefits but it also has downsides. As it can really only be done with functional tests it means that low-level bugs can hide in individual implementations. To make matters worse these kinds of test suites tend to grow very large and eventually take a long time to run. This dissuades people from running them very often and as such they can easily end up in a state where a large percentage of the tests are always broken. However the most insidious danger is that when 7 tests fail you may just fix each one on it's own rather than spotting the common element responsible for the break.

The error messages from pirate tests tend to be generic and unhelpful unless you take extra steps when writing the test harness. They don't replace the need for unit testing and keeping the test data readable is a challenge. Then there are the problems caused when someone decides to 'refactor' the test data to eliminate the inevitable duplication and they break several different implementations...That's assuming you can even get the various different implementors to agree on which set of pirate tests are the canonical ones.

So if they have all these problems why do people bother? The obvious reason is that it makes it easier to bootstrap new implementations of a tool. But that isn't the biggest reason. The really big benefit of this technique is that it aggregates the lessons learned by all the implementations in one machine-readable format. This means that compatibility, compliance with a specification and interoperability aren't topics for debate but empirical matters which can be settled by a carefully crafted test case.

Monday 1 February 2010

New Frontiers: TDD and Refactoring Workshop at Brunel University

At the first Software Craftsmanship Conference in London I met Steve Counsell. He's an academic at Brunel University with an interest in Object Orientation, Metrics and Refactoring.

A while ago Steve invited me to give a short presentation at one of a series of workshops he's running. These 'Reftest' workshops are aimed at bringing academia and industry together to share our insights and experiences with both Test Driven Development and Refactoring.

There were too many interesting presentations at last week's event for me to describe them all. But they should all eventually end up on the Reftest website so you can read them there. That site doesn't have a feed yet so I've set up this: http://www.google.com/notificationservice/webchanges/webfeeds/11897478376600010732 which basically notifies you of any changes to the website.

Anyway, there was a lovely presentation by Charles Tolman from Quantel about some of the issues you run into when you have a 12 year old codebase containing 17 million lines of C++ and Python. This lead into an interesting discussion about detecting duplicate code. I maintain Same but nowadays I find that Eric Raymond's Comparator is a better solution for most people so I recommended that.

One of my favourite asides from Charles's 30 years in software (20 of them spent at Quantel) was "tools should enable the intelligence of the programmer rather than attempt to encapsulate the intelligence of the programmer."

The folks from the University of Kent showed some impressive results for their tools which provide:
  • model checking
  • property-based testing
  • interactive clone elimination

Best of all many of the principles behind them aren't restricted to Erlang. So these kinds of techniques and tools can be applied to mainstream languages. They just chose Erlang because the tools were built as part of the ProTest project they're doing with Ericsson.

My own presentation looked at some of the problems with TDD and Refactoring.

I discussed problems with teaching the cluster of tacit knowledge that we call TDD and covered issues like the convenient fictions we use, the mis-leading tutorials and the overly simplistic examples that involve starting with a blank slate. I also threw in a brief digression about the Norvig-Jeffries kerfuffle.

The section of my presentation on refactoring mostly involved going back to Opdyke's thesis and looking at some of the consequences of his ideas. One thing that struck me was how insightful Opdyke's work was when it came to the issues like preserving class invariants and the subtle side-effects of most refactorings. Somewhere along the line though as we automated refactoring we seem to have lost the heart of his ideas.

I'd expected the last section with it's references to model checking, property-based testing tools like ScalaCheck, mutation testing and fuzz testing to be controversial but the discussion ended up being a very productive look at ways in which we could fix the problems. The whole day made me very hopeful about the prospects for significant progress in the tools and techniques for TDD and refactoring.

Updated 2010/02/06: Ben Stopford has written up his perspective on the workshop and uploaded his PDF slides.