The Value of Tests: It's more than just testing!

Since I have some free time, I’ve been catching up on some of the stuff
I’ve been meaning to read. I’ve got a reading list of stuff that I’ve wanted
to look at that were written by other authors with my publisher. Yesterday, I started looking at Cucumber, which is an interesting behavior-driven development tool. This post isn’t really about Cucumber, but about something that Cucumber reminded me of.

When a competent programmer builds software, they write tests. That’s just
a given. But why do we do it? It seems like the answer is obvious: to make sure that our software works. But I’d argue that there’s another reason, which in the long run is as important as the functional one. It’s to describe what the software does. A well-written test doesn’t just make sure that the software does the right thing – it tells other programmers what the code is supposed to do.

A test is an executable specification. Specifications are a really good thing; executable specifications are even better.

In general, in software, we have a tendency to think that we can do things by the seat of our pants. In my experience, most programmers don’t do much planning before they sit down and start coding. In their heads, they think that they know what they’re building – they’ve got a model of the system, and how they’re going to put it together. But a model like that, which has never been written down, is quite like to have holes in it – because until you work through the details, you don’t necessarily even know where the hard parts are.

Let me give you an example. A while back, when I worked for IBM, I was
working on an open-source SCM (aka “version control”) system called
“Stellation”. In Stellation, we tracked the version history of all of the artifacts that made up a system: we didn’t just version text files, but directories, binaries, etc. And we supported lightweight branches and (of course) branch/change merging.

When you do versioning of directories, merging directories gets pretty complicated. To give you a sense, here’s some of the cases. Suppose you have a directory D containing a file F, and you’re merging two branches:

  1. Branch 1 edits F, branch 2 doesn’t. No problem – this is easy.
  2. Branch 1 edits F, branch 2 edits F. Standard file merge.
  3. Branch 1 edits F, branch 2 renames F. Still no problem.
  4. Branch 1 edits F, branch 2 deletes F. Tricky.
  5. Both branch 1 and branch 2 rename F.
  6. Branch 1 removes F, and branch two adds a new file named F.
  7. Branch 1 renames F, branch 2 adds a new file named F.

There are a bunch more cases – in fact, there are a total of 12 cases.

We built the first version of the system, and thought we’d worked out how to merge directory changes. People started using it, and found a case we’d missed, which caused Stellation to do the wrong thing. I worked out the problem, and fixed it, and released a new version. And another bug report came in – another case I’d missed. And another. And another. Directory merging was just a whole lot harder than I’d expected.

What I ended up doing was taking a software specification tool called Alloy that I’d recently seen a talk on, and spending a couple of days building a thorough, formal specification of the merge component. It helped me understand what I’d missed, and figure out a genuine, exhaustive list of all of the cases that I needed to cover. It also provided a clear, concise, thorough description of the correct behavior of the system. That specification turned out to be incredibly valuable.

Specifications are valuable beyond words. They give you clarity about what your system should do. They make sure that you understand what you’re building, and how it should behave. They force you to think clearly about what you’re building. They provide an invaluable tool for communicating with other programmers.

There are two problems with specifications. First, they’re hard to write. I love Alloy – but it’s not an easy system to use. It takes a lot of time and effort to write a specification. It’s another language that you need to learn – and you need to learn it just to be able to read a specification. Second – and more important – the specification is usually a distinct artifact. It’s not part of your source code. It’s something else – usually a separate file, written in a different syntax. And as a distinct artifact, it’s one more thing that needs to be kept in sync as you make changes. Unfortunately, that doesn’t tend to happen: when you’re in a hurry, you change the code without changing the spec, and then they get out of sync; once they’re out of sync, the spec becomes useless.

That’s where testing comes in. If you write your tests correctly, they are a specification of your system. In particular, if you use a good BDD tool, then they read as a specification! But even if you aren’t using any testing tool at all, but just writing standalone testing code, you can write your tests so that in addition to testing your system, they describe your system. So you get the advantages of specification, while also having something executable. Every time you build your system and run your tests, you’re verifying that the specification is up to date, and that your system conforms to the specification.

This is something that really fascinates me, so I’m probably going to write a series of posts about this. I’ll probably do one or two about Alloy, to give you a sense of what a real formal specification looks like, and what that can buy you – and then a few posts talking about how to write specification-tests using BDD tools.

16 thoughts on “The Value of Tests: It's more than just testing!

  1. Justin

    This illustrates a major gap in computer science education. Students are graded on the correctness of their projects, but professors never teach the students testing methodology, how to design test cases, how to build infrastructure for testing, or (as you mentioned) test cases as a specification. Testing is very rarely a part of a computer scientists education, so it is no wonder that it gets neglected in the professional world.

    Reply
  2. Ofer

    You can take a look at Specs2 or ScalaTest for scala if you haven’t yet. They give a DSL for BDD which is easy and fun to use.

    Reply
  3. Matthias Felleisen

    The idea of exploring examples before coding is an integral part of our “How to Design Programs” (aka Program by Design) curriculum. We have run this curriculum for high schools and college freshman courses for nearly 16 years with great and measured success. So Justin — some colleges do not only pay attention to this idea, they teach it to freshmen. And then we reinforce it with our second semester where we scale it to OOP and where we show students how to explore the idea via automated logics (ACL2 to be precise; it fits better than Alloy to our curriculum but one could use Alloy, too).

    The problem is that most colleges wish to teach currently fashionable programming languages in the first semester plus ‘glitz’ that attract students to the discipline. Program by Design instead puts program design first, glitz second. We also reject the idea that novice courses must use a currently fashionable language. Instead, we use teaching language specifically created for the purpose of teaching design.

    Some more details are available here:

    http://www.ccs.neu.edu/scheme/pubs/#jfp2004-fffk

    http://www.ccs.neu.edu/scheme/pubs/#icfp09-fffk

    The text book is also on-line [1st and 2nd edition, listed below]:

    http://www.ccs.neu.edu/home/matthias/HtDP2e/

    Reply
    1. stigant

      I got about half way through your post and was thinking, “wow, this guy sounds just like my freshman year comp sci prof at Rice.” Then I looked up and realized that you WERE my freshman year comp sci prof at Rice!

      WRT the article, I’m intrigued that the number of cases for the version control program is 12. That’s a very strange number of cases to come out of what appears to be a combinatorial problem. At first I would have thought that the number of cases must be a square (you can do any one of n things on branch 1 and the same number of things on branch 2 ==> n^2 possible cases). Of course, if A and B are actions on files and F(x,y) is the code control action when x is done on branch 1 and y on branch 2, then F(A,B) == F(B, A) (probably), so maybe we should have nC2 possible cases, but 12 is not a solution to x = nC2 for any integer value of n.

      I guess there must be some additional collapsing of cases, but if your intent is to make sure you cover all your bases, isn’t it worth not collapsing those cases in the code?

      Reply
      1. Mark C. Chu-Carroll

        Sorry, but I’ve never even been to Rice, so I’m certainly not your professor :-).

        In terms of the numbers of cases – there is a lot of case collapse. In particular, there are a lot of symmetries to this. Changed in A, unchanged in B is really the same as changed in B, unchanged in A. You get a ton of collapse as a result.

        Reply
        1. stigant

          >> Sorry, but I’ve never even been to Rice, so I’m certainly not your professor 🙂

          I was talking about MF’s comment. 😉

          >> In terms of the numbers of cases – there is a lot of case collapse. In particular, there are a lot of symmetries to this. Changed in A, unchanged in B is really the same as changed in B, unchanged in A. You get a ton of collapse as a result.

          Yeah, but I noted that F(A,B) = F(B,A) which didn’t seem to lead to 12 cases. I guess my back of the envelope calculation isn’t going to be adequate without the total details (don’t provide them, I’m just conceding defeat).

          Reply
          1. ScentOfViolets

            Sorry for the late reply – I haven’t been able to get the Scientopia site for some reason. But anyway, combinatorically at least, you can get 12=4*4-4. You could interpret this as looking at all the cases F(A,B), where the operation is non-commuting, and then subtracting off the diagonal to eliminate the double-counts for F(A,A). Whether this is the case for these particular operations I don’t know. But in the abstract, 12 can indeed come up as a matter of simple combinatorics.

  4. Grant

    Mark,

    I’m looking forward to reading what you have to say. I agree that describing & planning what you are to code is invaluable. My current approach merges literate programming and test-driven development (with some behind-the-scenes planning work), so that the code planning and tests become an integral part of the coding. I’ll be interested to read what you have to write about other tools to assist this general type of approach.

    Justin,

    I have similar feelings when I read papers in my field (bioinformatics or computational biology). I increasingly worry about the lack of clear evidence of testing of the scientific code… Maybe I’m just turning into a grumpy older scientist! – but I think there is something to my concerns… (I hope)

    Reply
    1. Medivh

      Grant, I can tell you from my experience in marine science that you’re not alone in feeling that scientists don’t test their programs. I’ve read a few papers to that effect regarding all scientists as well. Half the problem, I think, is the Dunning–Kruger effect (http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect). Most of the scientists I interact with seem to think that scientific computing (using FORTRAN and Python to do stats, as far as I can tell) is equivalent to software engineering. Not to mention the problems with science education and communication that most scientists completely ignore.

      Reply
  5. Julian Frost

    Justin:

    Testing is very rarely a part of a computer scientists education, so it is no wonder that it gets neglected in the professional world.

    Not entirely true. I’m a Test Analyst and my employer’s clients, typically banks and SARS, make heavy use of our services. Businesses (particularly financial institutions) have learnt the value of testing.

    Reply
  6. Divya

    Hi, really liked your posts.
    Could you post a link to this site directly from good math.blogspot.com?
    I found that in your book and felt it would be nice to have a direct link.

    Reply
  7. Carsten Führmann

    Nice post! Its probably worth noting that a static type system is a basis for writing specifications. As a (slightly degenerate) example, consider a function of type integer. We know it will never return anything but an integer (if it returns at all). The compiler proves that for us without ever running code of ours, and for all possible arguments to our function! By contrast, without a static type system (Lisp, Javascript, Python, whatever), strictly speaking we’d have to write a unit test feeding our function random arguments to “ensure” that it always returns an integer! So far, my example is a bit degenerate. But there are actually type systems (mostly dependent type systems in the style of the “Calculus of Constructions”) that allow us to write down complete specifications an have the compiler check them. These type systems are already fully researched, but bringing them to everyday programming is still work in progress. The buzzword is “dependently-typed programming”. In a dependently-typed programming language, the specification is not a beast from another world – it is simply a type in the language!

    Reply
  8. Jonathan Thornburg

    We usually conceptualize testing as seeing if the software-under-test gives the right answer. But what if we don’t know what the right answer is (indeed, we’re writing the software to try to find it)?

    I’m a computational astrophysicist, and most of the software I write (e.g., to simulate black-hole collisions) is of the don’t-know-the-right-answer variety. This makes testing “interesting”. 😉

    I usually write lots of unit-test drivers for lower-level parts of the software for which I can easily figure out “right answers”. These are great for the parts of the code (maybe 2/3 or 3/4 of my total SLOC) which they cover.
    And I’ve caught one or two system-math-library bugs that way (e.g., a long-double std::sqrt() which was actually only accurate to float accuracy).

    But then we get to the tricky parts of my code — probably some recursive algorithm which is beautifully efficient in practice, but really hard to debug or verify (or visualize unless I have really good software tools).
    The abstract algorithm may fit in one page of pseudocode… but the implementation is a big chunk of C++ (with control flags & debugging options interspersed with the main logic), with lots of complications to avoid copying big data structures unnecessarily.

    The next level up in difficulty is in finding mistakes (bugs) in my 150+ pages of pen-and-paper notes and equations for this project. It’s easy (albeit tedious) to do each piece of algebra twice, but what about conceptual misunderstandings, implicit assumptions which I later unknowingly violate, etc etc? Problems from these may not show up until I get to the final “the code runs fine, but the answers are wrong” stage. 🙁

    Life gets worse when I’m working in a collaboration — some of my colleagues are a bit behind the state-of-the-art in software engineering skills. 🙁 Serious, though, if I’m doing a project in collaboration with someone else, it’s usually because I can’t (don’t know enough, and/or don’t have enough waking-hours-in-a-year) do it myself. I.e., my software *needs*
    their software. So I try to be as paranoid as possible in using their code
    (lots of test drivers to check that I’ve correct intuited the ill-documented parts of their APIs, lots of sanity checks & validation). But it’s still hard to test someone else’s computations which you don’t fully understand.

    Another tricky area is symbolic algebra systems (Maple, Mathematica, Maxima, Reduce, etc etc). We use them a lot in our work… but almost by definition, we use them for algebraic manipulations which are too complicated to do by hand, so it’s really hard to check the answers. What we usually do is hand-check simple cases (carefully chosen to exercise all of the key logic). But that leaves open the chances of bugs which only show up in complicated situations. And the usual output of these computations is a big blob of (almost-unreadable) machine-generated C code, which invariably needs a bit of “massaging” by some Perl/Awk/etc (e.g. to ensure that temp variables are declared, to convert slow pow(x,N) calls into variables that I can precompute) before it’s ready to #include into one of my C++ classes. This sort of code spanns several different levels of abstraction, and can be fertile bug-breeding territory.

    Finally, once I have a “running” code that produces answers that look vaguely plausible, there’s the tricky question of trying to decide if the answers are actually correct. Or if they’re not, where does the problem (problems?) lie? Or is it a deeper conceptual misunderstanding that’s been lurking for N years? Usually I have some invariants or conservation laws or consistency checks (or if I’m really lucky, results from someone else who has solved this same problem) that I can use to try to test the final output… but if these checks fail, I often have little idea where the problem is (the intermediate results of the computation are usually not known a priori, and don’t obey any interesting invariants etc).

    Finally, I should emphasize that this isn’t meant in any way to argue against Mark’s point about good specifications — if these exist, and if they correctly operationalize the intent of what the code was supposed to do, then they’re great.

    Reply
  9. Robert

    “I’m a computational astrophysicist, and most of the software I write (e.g., to simulate black-hole collisions) is of the don’t-know-the-right-answer variety.”

    During my Master’s thesis I worked with a (pretty huge) stellar evolution code. At some point in time I started distrusting the results I was getting and wrote a very simple test program that checked if the law of energy conservation still held. It did not, and I knew there was a bug somewhere.

    My point is that even if you dont know what the answer should be, especially in physics/astronomy you usually have conserved quantities or other information on the properties of your answer that is very easy to test. This is, I think, similar to what the CS guys call type or specifications.

    Reply
  10. Jonathan Thornburg

    @Robert: Very very right!

    But the best end-to-end test is still comparing with the results from someone else’s independently-written code (ideally one using different analytical & numerical methods). When these agree to many decimal places, then the chances that both are correct are usually fairly good. But when they don’t agree, then finding out why can be hard…

    Reply
  11. Robert

    That’s exactly where my distrust came from (although the model I was comparing to did not contain exactly the same physics.) However, when they don’t agree, all you know is that one of the two codes is incorrect, figuring out which one can be tricky as well!

    Reply

Leave a Reply to Jonathan Thornburg Cancel reply