Narrative tests are lousy unit tests

I want to stop people abusing Python's doctest format. Many of the tests I've seen written as doctest files would have been better off as plain unittest files. I'm going to try explain why. I have many gripes about how people use doctests, but probably the biggest is that narrative tests are lousy unit tests.

Narratives tell a story. Something happens, then another thing, and another thing, one after the other, in sequence. Earlier events influence later ones as the story gradually assembles a complete picture. Humans like stories, our brains are used to telling them and receiving them.

Technical documentation is often written with a narrative. Tutorials are an obvious case, but not the only one. A guide to an API may show a series of different examples, each contrasting with the others in ways that explain to the reader what they need to understand.

Automated tests can have narratives too, of course. A narrative test is quite easy to write: write some code that does something (and check the result), then do something else (and check that result), and so on until you've done (and checked) everything you want to do (and check). Doctests make this particularly easy. Here's a toy example of a doctest:

   Instantiate a Frobber.

     >>> frobber = Frobber()
     >>> frobber.has_frobbed()
     False

   Now frob it.

     >>> frobber.frob()
     >>> frobber.has_frobbed()
     True

   It can't be frobbed twice.

     >>> frobber.frob()
     Traceback (most recent call last):
     ...
     AlreadyFrobbedError: ...

Narrative tests can be good acceptance tests. An acceptance test often takes the form of a story; an example might be “an unlogged in user visits a web page. They click a particular link that needs a logged in user, so they get taken to a login screen. The user has no account yet, so they walk through the account creation wizard. Once the wizard is completed, the account is created and they logged in, and they are taken to the link they originally clicked on.

So, having shown how they are easy to write, and appropriate for some tests, I'll now explain why narratives make lousy unit tests.

A typical unit test has four phases:

  1. Set up a fixture
  2. Interact with the system-under-test
  3. Verify the outcome
  4. Tear down the fixture
Or phrase it the way Behaviour-driven Development people might, each unit test says: “Given situation X, when I do Y, then Z happens.”

Good unit tests are small and specific: they will test just one condition per test method, i.e. the X and the Y will be as minimal as reasonably possible. There's considerable benefit to this style:

  • Every individual test has a name. I can refer to a failing test precisely by name when communicating with my fellow developers. I can communicate the name to the test runner too: when I am trying to focus on just one problem, it's extremely useful to be able to easily and precisely specify the subset of the full suite I want to run, down to just one test if necessary. I can even jump straight to a test method definition with ctags. Compare that with doctests, where you have to say things like “about line 300 of foo-bar.txt” or “Just after where it says ...”. That's awkward and imprecise, especially when developers are often looking at slightly different versions of the same file.
  • Specific tests give clearer failures, and are easier to debug. Good unit tests keep the context of the test fairly minimal (Meszaros' xUnit Test Patterns book explicitly describes “General Fixture” as a cause of the “Obscure Test”). Narratives inherently accumulate context with every line, whether it's relevant to anything else or not. You have to be aware of everything that happened earlier in the story to understand and debug a failure (and if this isn't the case, then what's the point in having a narrative?). Unit tests also tend to generate more relevant failures, because only tests that are actually affected by the problem fail, rather than everything after line 100 because that's where the first failure was (and if you suppress the secondary failures, you may be suppressing interesting ones along with the irrelevant ones).
  • Specific, narrow tests are better at communicating intent and ensuring converage. If each test is there to verify just one condition, then you can't accidentally lose test coverage just by “tidying” the code (automated coverage analysis tools won't necessarily notice either; there's more to coverage than just tracking lines executed). If you have long, rambling tests, there's a tendency to have a bunch of stuff that's exercised only implicitly, as a side-effect of doing it all in one big eager narrative... so changes to that narrative can easily lose that coverage. Simple, specific code is easier to maintain than single a meandering story that tries to hit as many cases as possible. Make single-condition unit tests an explicit part of your coding standard!

So that's why I think narrative tests are poor unit tests. And I think unit tests ought to be the bulk of most automated test suites.

Tomorrow I'll post about some other problems with the doctest format.

Tests are code, doctests aren't

In my last post I explained why I think narrative-style tests make poor unit tests. That alone is a good reason not to write unit tests in Python's doctest format. Here are more reasons why I don't like doctest for writing tests.

  • Writing test infrastructure becomes harder (any multi-line statement, like defining a class or even a function, becomes awkward), but test code benefits from factoring logic out just as much as any other code — and that means classes and functions.
  • Doctests require contortions to fit the way they compare output, like using sorted(...) when comparing dictionaries to get a deterministic comparision. This detracts from readability. In xUnit, a simple, obvious, and clear assertEqual would just work. In doctests, if this fails:
    >>> foo == bar
    True
    then you get a completely unhelpful error, but doctest leaves you with little choice if you have dynamic values that vary between test runs. Again, this Just Works in xUnit with assertEqual. In general, xUnit custom assertions are more flexible and readable than doctest's output matching. As Guido said on python-dev in July:
    This is an example of the problem with doctest -- it's easy to overspecify the tests. I don't think that whether the repr() of a Decimal uses single or double quotes should be considered a spec cast in stone by doctests.
  • It's hard to see an overview of the tests at glance. With a doctest file, individual tests are typically introduced by a sentence or three. Conventions vary from file to file. There's no tool I know of that can give me an outline of the unit tests in a doctest file. In contrast, almost every code editor I know of has at least one way to display an outline of the classes and methods of a Python file, which gives a good overview of unit tests written in the xUnit framework. (And if your editor can't do it, there's always the amusingly named testdoc.) This sort of outline is useful as it gives you a summary of all the conditions being explicitly tested. This helps you spot gaps in coverage, understand what the code being tested can do, and know where the most appropriate place to add a particular new test is (if you can't easily browse the existing tests, people will just add them in arbitrary places like the end, making the test file a disorganised, unnavigatable swamp).
  • Doctest is a mini-language with ugly corners and outright bugs.You cannot start expected output with an ellipsis. The syntax for blanklines in expected output (“<BLANKLINE>”) is ugly. The syntax for toggling various doctest features inline (“#doctest: +IGNORE_EXCEPTION_DETAIL”) is worse. The language is outright buggy in places — the following doctest passes:
    >>> print 'hello'
    ... print 'world'
    hello
    This one passes too:
    >>> assert True
    ... garbage
    >>> print 1
    1
    Testing APIs like pyunit can and do have ugly corners and bugs too, but the scope for problems is larger with a mini-language. I've never heard of an outright syntax error being silently ignored by pyunit! I might be more forgiving of doctest's quirks if it wasn't almost 10 years old already.

But that's not all. A more fundamental reason why I dislike doctests is that tests are code, and code works better in a .py file than a .txt file. There are a couple of reasons for this:

  • Tool support. Text editors already know how to syntax highlight .py files correctly. Pdb works better with normal code (in doctests the capturing of stdout confuses the prompting). I can use standard profiling tools. I can run PyChecker and Pyflakes on .py files. I can use ctags. I can use bicyclerepairman. I can use pydoctor or epydoc. There are many more examples.
  • Tests are code, and code needs organisation. Tests suites in many ways are just like any other code: logic gets reused. Normal python modules provide well-known, effective ways to manage this: you can make classes that inherit from other classes, you can create modules for storing common utility functions, etc. But you can't import code from a doctest. Defining a function, let alone a class, in a doctest just plain looks weird. And because code is code even inside a doctest, sometimes you want to refactor it. Gerard Meszaros' xUnit Test Patterns book is subtitled “Refactoring Test Code” because tests need refactoring too.
  • Prose isn't always a good substitute for comments in the code. A commonly stated benefit of doctests is that they make prose easier to write — but equally they make code comments and docstrings harder to write. In a Python file you can write:
    class Thing(object):
        """Docstring."""
        # Comment.
    In doctests, you have to write
    >>> class Thing(object):
    ...     """Docstring."""
    ...     # Comment.
    Those tedious “... ” mean that almost every single code snippet I've seen in a doctest has lacked even a single comment or docstring, even when they really needed it. A prose preamble isn't always the best place to explain code.

Tools can be improved to cope with doctest (for instance I heard that my pdb problems may be solved in Python 2.5), but new tools are continually being invented, and I want to be able to use those too. For instance, the 2to3 tool for converting Python 2.6 code to the upcoming Python 3.0 doesn't fix code in doctest files. And I still can't do “set filetype=doctest” in vim, which is hardly a new tool.

With sufficiently improved tool support and infrastructure many (but not all) of my concerns would be reduced. For instance, it would help if there were a way to easily reset all state during a long doctest, so that different parts of the same file could be independent. And then it would be good if there were also then a convenient way to put names on these independent sections. But you'd still be left with a design that gently encourages people to do things a worse way (write a big story), and you'd be reinventing the wheel: xUnit already gives you those things.

In my experience many developers with the best of intentions will produce poor unit tests with doctest because of the way it subtly encourages bad practices. One bad habit I've seen over and over again is copying-and-pasting helper functions, even large, complicated ones, from doctest to doctest. Is it because it's not “real” code, so the instinct to organise it and avoid duplication doesn't trigger? Is it because there's no obvious home for helper functions, because a doctest is not a module? I wish I knew.

I do not think doctests are evil. The doctest format is fine for some things. For “page tests” (e.g. using zope.testbrowser, as demonstrated here), where there's a narrative of a user story driving them, doctests are a pretty good fit. They can be good for writing testable documentation (which is not the same as tests and documentation mixed together!) too. But those things aren't unit tests.

I've mentioned this book a couple of times, and I do recommend it:

Title
xUnit Test Patterns: Refactoring Test Code
Author
Gerard Meszaros
Website
http://xunitpatterns.com/

You can find it on Amazon here.

If nothing else, reading it encourages thinking about the way you write tests, and ways you could do it better.

So despite the hype, I don't think doctest has an advantage over xUnit in producing readable tests. Code needs to be clear (including an appropriate amount of docstrings and comments) whether or not it's test code. If your developers aren't writing clear code, you have a serious problem: you are sure to have difficulty maintaining that code. It is just as possible to write incomprehensible tests using doctest as it is using TestCase classes with test methods. I know this because, unfortunately, I've seen plenty of both. Writing good tests is a skill that takes time and practice to learn. Using doctest is obviously not a silver bullet. Not using doctest isn't a silver bullet either, but I do think it's usually the better choice.

Last modified: 23 October 2008

Powered by backwards