Tuesday, 16 February 2010

Pirate testing

This is an example of pirate testing. It's using a test-suite consisting of language-neutral data. There's also a ruby implementation.

So what's pirate testing? It's a form of data-driven testing. The tests are specified in a language neutral format (XML, JSON, YAML, whatever) and then various test harnesses for different languages are written. These ensure that the different implementations all provide the same functionality.

Let me rephrase that in dictionary-speak:
Pirate Testing is a family of techniques for creating functional tests that are independent of a language or implementation. This usually means that the tests are data-driven or they're written in a neutral language that can be invoked from multiple other languages. The easiest way to think about this is that there is a DSL used for writing the tests (this can be a data format or a programming language) and one or more general purpose languages used for implementing functionality that satisfies the tests.

This is an idea that goes back to the days of Jon Bentley's 'little languages' but the name originated in this post by Sam Ruby. In his case he was literally dealing with the tests for a virtual machine called Pirate.

This: http://code.google.com/p/mimeparse/source/detail?r=14 is an implementation of pirate testing. Some things you should note. I dynamically build a PyUnit TestSuite which has one TestCase per record in the testdata.json file. This ensures that the tests have all the standard benefits (shared setup, shared teardown, independence, reporting, etc) that come from using a *Unit testing framework.

It's vital to make sure your pirate tests are isolated from each other; that the test runner keeps going after it encounters a failure and that you emit detailed diagnostic information when a test fails. One of my patches to the Feedparser project makes it dump out the entire environment when a test fails because it was proving difficult to debug problems. The implementation by Matt Sanford of the Twitter conformance tests currently doesn't do this.

Like any technique pirate testing has benefits but it also has downsides. As it can really only be done with functional tests it means that low-level bugs can hide in individual implementations. To make matters worse these kinds of test suites tend to grow very large and eventually take a long time to run. This dissuades people from running them very often and as such they can easily end up in a state where a large percentage of the tests are always broken. However the most insidious danger is that when 7 tests fail you may just fix each one on it's own rather than spotting the common element responsible for the break.

The error messages from pirate tests tend to be generic and unhelpful unless you take extra steps when writing the test harness. They don't replace the need for unit testing and keeping the test data readable is a challenge. Then there are the problems caused when someone decides to 'refactor' the test data to eliminate the inevitable duplication and they break several different implementations...That's assuming you can even get the various different implementors to agree on which set of pirate tests are the canonical ones.

So if they have all these problems why do people bother? The obvious reason is that it makes it easier to bootstrap new implementations of a tool. But that isn't the biggest reason. The really big benefit of this technique is that it aggregates the lessons learned by all the implementations in one machine-readable format. This means that compatibility, compliance with a specification and interoperability aren't topics for debate but empirical matters which can be settled by a carefully crafted test case.