In a thoughful comment on my blog post about writing maintainable automated acceptance tests, Chris Falter suggested a different way to name the variables in my test cases. He mentions that our two naming styles present a tradeoff, and that set me on a long trail of thought.
I’m fascinated by tradeoffs, and I often drive myself nuts making them — I go back and forth and back and forth and… And at some point, I’ll identify the qualities that I’m trying to trade off (expressiveness, test speed, number of places I’d have to change if the code or the requirements change) and move on. Until the next time I visit that file. Then I’ll go back and forth and back and forth… I am very good at revisiting decisions, and not so good at sticking with a decision I made in the past — even minutes in the past.
Chris’s suggestion points out that there are two pieces of information we’re trying to encode into the name: The idea that passwords have a minimum and maximum valid length, and the specific minimum and maximum (6 and 16 characters). I went with one of those pieces of information, Chris went with the other.
As Chris points out, each style leaves readers to infer something important. With Chris’s style, readers must infer what’s special about any given length. With my style, readers must infer what specific lengths form the boundaries. Neither style expresses both pieces of information explicitly — e.g. that the maximum legal length is 16 characters. There’s a tradeoff here: Which piece of information to express? And by making that tradeoff differently, each style not only expresses one piece of information, but also emphasizes it. My style emphasizes the idea of minimum and maximum lengths; Chris’s style emphasizes the specific lengths themselves.
Also, Chris points out (and I agree) that my style requires readers to count string lengths, a tedious, error-prone chore.
Given all of that: Which style do you prefer? More importantly: What do you prefer about it?
Sometimes I can easily see how to trade off possibilities. Other times I can’t see a clear winner. For those times, I recommend experimenting. Try each possibility. Then pay attention to what happens.
In this case, there’s another criterion I can apply: With Chris’s tests, the length of each password appears three times: Once implicitly in the password itself, once in the declaration of the variable, and once in the test that references each variable. Expressing that specific datum three times is potentially troublesome: If we increase the maximum length of a password to 20, we’d have to change six places (three places for a max length password, three places for a password that’s too long). With my style, we’d have to change only two places: the passwords themselves. The variable names would remain the same.
Though I’m not entirely sure which style emphasizes the more important bit of information, the criterion of “how many places I’d have to change” leaves me preferring the style I used in the article. Chris might still reasonably prefer his style, if the extra expressiveness he perceives is valuable enough to outweigh the extra cost of change.
So far, I still prefer my original variable names to Chris’s. And yet his suggestion, his thoughtful explanation of why he prefers it, and especially the contrast it provides with my original tests, make me wonder: Now that we know what we’re trading off, can we find a way to eliminate the tradeoff altogether? Is there another style that allows us to express all of the information we want to express, and without increasing the cost of change?
Uncle Bob, in his video, offers a third style: Instead of conveying information through variable names, express it through comments. His comments express something similar to my variable names: maximumness and minimumness. It would be easy enough to add the information that Chris’s variable names express and that mine lack: “16 characters is just short enough” and “6 characters is just long enough”. I’ve unfortunately trained myself to feel queasy whenever I start to type a comment into code. I’m going to have to get over that. Writing a comment is not necessarily evil; it’s just a tradeoff.
Contrasting my tests to Uncle Bob’s, I notice yet another tradeoff: How to organize tests into suites? I organized my tests around specific validity criteria: One set of tests for character content requirements, another for length requirements. Uncle Bob organized the same tests differently: One set of tests for valid passwords, another for invalid passwords. And each way of organizing requires us to name our groupings, which offers an opportunity to subtly highlight one piece of information or the other. My organization emphasizes that are two classes of validity criteria, content and length. Uncle Bob’s emphasizes that passwords may be valid or invalid.
Which emphasis do you prefer? More importantly: What do you prefer about it?
A few final points about tradeoffs. If you want to get better at making tradeoffs such as these, step one is to notice what tradeoffs you’re making. And a great way to do that is to pair with someone. I wrote the tests on my own, and in the article I mentioned the tradeoffs I was aware of making. But I made other tradeoffs implicitly, without noticing I was making them. It was only when Chris and Bob offered alternatives that I noticed I was making those tradeoff at all.
Thanks Chris and Bob for inviting me to explore the tradeoffs I make, and how I make them!
The article demonstrates how to make automated acceptance tests more maintainable by:
Hiding incidental details
Eliminating duplication
Naming essential ideas
Though the examples in the article use a very nice testing framework called Robot Framework, the ideas work just as well with other other popular open-source testing frameworks, such as FitNesse and Cucumber.
You will be able to follow the article even if you don’t know Robot Framework. But don’t be surprised if it inspires you to give Robot Framework a try.
Because the concept of system responsibility is so foundational to how I develop and test software, I want to expand on my earlier description. Recall that I defined a system responsibility as a system’s obligation to respond to each notification of a specified kind of event under specified circumstances by producing a specified set of planned results.
A system responsibility includes three parts:
A stimulus that triggers the system to respond to an event.
A context in which the system is required to respond to the stimulus.
A set of results that the system is obligated to realize in response to that stimulus in that context.
Stimulus.A stimulus is a message, sent by someone or something outside the boundary of the system, that informs the system of an event to which it is obligated to respond. The stimulus has a name, which may identify either the event that it represents or the planned response that the system must carry out. The stimulus may include additional information about the event.
Stimuli are delivered to a system through its interfaces. An interface defines a set of messages to which a system responds, and the mechanisms by which those messages are delivered. For GUI systems, the interface includes a suite of windows, forms, buttons, text fields, and other mechanisms that translate user gestures (mouse clicks, key presses) into messages. Web-based systems receive stimuli through HTTP requests and other interfaces. Smaller scale systems, such as objects inside a software application, expose Application Programming Interfaces (APIs) that define the set of methods to which internal objects and subsystems respond.
Result.A result is an effect that the system realizes in response to a specified stimulus in a specified context. A result may be either a message delivered to someone or something outside the boundary of the system or a change in the system’s internal state.
GUI systems deliver messages through forms, windows, screens, audio devices, and other output devices. Web-based systems deliver messages through HTTP responses and requests. An application’s internal objects and subsystems deliver messages through method calls and method return values.
In addition to delivering messages to external entities, systems also respond to events by recording information internally, and by making changes to that internal information. The information may be stored inside the running application, in a database, in files on the computer’s file system, or other storage mechanisms. The information that a system stores in order to guide its responses to future events makes up the system state.
Context. Sometimes a system’s planned response depends not just on information delivered through the stimulus, but other information as well. The context for a given responsibility is all of the information other than that delivered in the stimulus that influences the results that the system is obligated to realize in response to an event. The context may include information about the state of the system itself–that is, information that the system previously recorded in its internal memory about prior events. The context may also include information that the system can observe across its boundary–information that the system must request from external entities in order to fulfill the responsibility.
Comments
I first learned about the idea of planned response systems from III, a colleague and friend of mine. I later read about the idea in depth in McMenamin and Palmer’s profound book Essential Systems Analysis.
The idea of planned response systems is fundamental to how I think about programming and testing. I’m posting my thoughts here so that I can refer to these terms and ideas in later blog posts. Until I write those posts, I encourage you to notice what happens when you think about software systems as planned response systems.
A planned response system is a system that responds in planned ways to events in its environment.
For example, a software system is a planned response system—it responds in planned ways to users’ actions.
In an object-oriented software systems, each object is a planned response system—it responds in planned ways to messages sent by other objects.
Planned response systems produce two general kinds of results: They send messages to entities outside of the system boundary, and they make changes to the essential memory of the system.
An event is a significant change in the system’s environment. A change is significant to the system if the system is obligated to respond to the change in a planned way.
Events fall into two broad categories: Changes initiated by entities in the system’s environment (e.g. users or other systems), and temporal events caused by the passage of of time.
For example, an ATM is obligated to respond in a planned way to a user’s request to withdraw cash. The user’s request is an event.
A system responsibility is a system’s obligation to respond to each notification of a specified kind of event under specified circumstances by producing a specified set of planned results.
The specification of a system responsibility consists of three parts: A specification of a kind of event, a specification of a set of circumstances, and a specification of the set of planned results that the system is obligated to produce in response to being notified of an event of that kind under those circumstances.
A system becomes obligated to respond to an event when a system designer allocates that responsibility to the system.
The essence of a planned response system is the set of responsibilities allocated to the system, independent of the choice of technology used to implement the system.
The definition a system’s essence makes no mention whatever of technology inside the system, because the system’s essential responsibilities would be the same whether it were implemented using software, magical fairies, a horde of trained monkeys, or my brothers Glenn and Gregg wielding pencils and stacks of index cards.
One way to identify the essence of a system is to indulge in The Fantasy of Perfect Technology. Imagine a system implemented using perfect technology. Then ask yourself some questions about the quality attributes of the system.
How fast would it respond? If it were made of perfect technology, of course it would respond instantly, with zero delay. How many users could use it at once? An infinite number of users. How much information could it store? An infinite amount. How often would it break? It would never break. How long does it take to start up? None, because it’s always on and always available. How much energy would it use? It would use no energy; heck, it might even generate energy for free.
The one glaring flaw of perfect technology is that it does not exist. Real-world technology is imperfect. That’s what makes this exercise a fantasy. But it’s a useful fantasy, because it helps us to separate the system’s essential responsibilities from the temporary constraints of current technology.
Note that we apply the Fantasy of Perfect Technology only inside the boundary of the system. Even in our fantasy, the world outside of the system is made of real, imperfect stuff, with which the system will have to interact.
Now apply the fantasy to your own system. What responsibilities would your system have even if you could implement it using perfect technology? That set of responsibilities is your system’s essence.
The essential memory of a system is the set of data that the system must remember in order to fulfill its obligations—that is, in order to respond as planned to future events.
For example, an ATM must remember users’ account balances in order to determine whether to satisfy users’ requests to withdraw money.
Testing is an information service. The point of testing is to inform stakeholders about the system. This is not a new sentiment, nor does it originate with me. But I’ve found that many testers have not considered their role from this perspective.
I teach classes about how to test software. Early in each class I describe testing as an information service. Even in classes filled with experienced testers, there are always a few people for whom this is a new idea.
In one class, just as I finished saying that testing is an information service, a man in the back of the room said, “Oh, no!”
“You disagree?” I asked.
“No, no, I agree,” he said. “It’s just that I’ve never thought of it that way before.” He paused and frowned. “And I think I’ve been doing it all wrong.”
I thought it was unlikely that he’d been doing it *all* wrong, so I asked, “How have you been doing it?”
“I just try to break stuff. When I can break it, it’s like I win. And if I can’t break it, I feel like I’m failing.”
“Trying to break stuff,” I said, “is an important part of testing.” I mentioned James A. Whittaker’s excellent book How to Break Software, which teaches testers how to find the kinds of defects that arise from common programming errors.
“I know,” he said, “but that’s all I’ve been doing. And when I find a nice, nasty bug, I run over to the developers and rub it in their faces.”
“Oh, no”, I said.
He laughed and nodded. “Now you understand.”
“How does that work out?” I asked. (I know what you’re thinking, but you’ve got it backwards. Doctor Phil channels me.)
“They hate it. And hate to see me coming. They keep telling me to bring them some good news once in a while.”
“But if your job is only to break stuff…”
“Then I never tell them what’s working. But that’s information, too, and that’s what I just realized. And that’s what they’ve been asking for.”
I’ve had numerous similar conversations with testers who had found themselves mired in unproductive relationships with developers. Shifting your focus from breaking stuff to informing stakeholders (including developers) can help with that.
I’ll say more later about testing as an information service. In the meantime, I’d love to hear your questions and comments about it.
A code coverage tool watches your program executing and reports which lines of code were executed and which were not. Testers are sometimes tempted to use code coverage tools to assess test coverage. And some testers are tempted to set code coverage goals. If you feel these temptations, be careful how you interpret the code coverage tool’s reports.
You can be sure that if a line of code was not executed during a test run, then it certainly was not tested by that run.
But what of a line of code that was executed by the tests? Unfortunately, you can’t tell, just from the fact that it was executed, whether the line was tested.
Elisabeth Hendrickson and I developed a workshop on unit testing. The work of the workshop centered on a small application we had written, a rudimentary HTTP server. Our initial code had exactly thirteen tests, just enough to illustrate a few basic tools and techniques that we’d be teaching in the workshop.
When we ran a test coverage tool called NCover to watch our test suite, it reported that our thirteen tests executed 65 percent of the server’s code. Does that mean that we achieved 65 percent test coverage? Not on your life. Our thirteen tests barely scratched the surface of the responsibilities of even our very simple HTTP server.
If our tests tested so little, why was code coverage so high? Because though we our suite tested little of the code, it executed a lot of the code.
For example, one of our tests sent a GET request to the server and evaluated the response. As the server executed the request, it called a logging function to log information about the request and its response to a file. The logging function was minimal, and did not deal with any of the zillions of possible file system errors it might encounter. It expected the happy path, and nothing but the happy path. So this one test, which did not in any way assess the logging feature, executed all of the logging code. The logging code was 100 percent executed and zero percent tested.
Code coverage does not imply test coverage. If you use code coverage tools to help assess your test coverage, keep that in mind.
Last year I read Brian Button’s wonderful article “Double Duty” in Better Software magazine (the February, 2005 issue). One of the things I learned is that Brian is the world’s best namer of unit tests. I visited Brian’s web site for more of his ideas and found an article called “TDD Defeats Programmer’s Block—Film at 11.” In this article, Brian describes using the Test Driven Development process to write a “continuous integration system” (a tool that automatically (re)builds software systems when programmers change the source code). Here are some examples of his unit test names:
Starting Build With No Previous State Only Starts Build For Last Change
Previous Build Number Is Incremented After Successful Started Build
Last Build Failing Leaves Last Build Set To Previous Build
What makes these names so good? I analyzed a few dozen of Brian’s test names and found this pattern: stimulus and result in context. Let’s examine these names to identify the parts.
Starting Build With No Previous State Only Starts Build For Last Change:
Context: There is no previous state (i.e. no previous builds were done).
Stimulus: Start a build.
Result: A build was started for only the last change.
Previous Build Number Is Incremented After Successful Started Build:
Context: There were zero or more previous builds.
Stimulus: Request a build that will succeed.
Result: The build number is one more than before the build.
Last Build Failing Leaves Last Build Set To Previous Build:
Context: There were previous builds, the most recent of which is recorded in the system as the last build.
Stimulus: Request a build that will fail.
Result: The previously identified last build is still identified as the last build.
One of Brian’s tests from a different system—an “animal factory” (a concept better left unexplained)—is called Default Animal Is Cow.
Context: No animal type has been identified as the desired type of animal for the system to manufacture.
Stimulus: Request that the system manufacture an animal.
Result: A new cow exists.
Now that I’ve learned the pattern that makes Brian’s test names so useful, I can use it deliberately. Using the context-stimulus-result scheme increases the value of tests as documentation. The resulting names make clear what specifically is being tested and under what specific conditions. This helps the reader to understand quickly what each test does, and what is covered by each set of tests.
Another benefit is that the context-stimulus-result naming scheme encourages you to clarify your thinking about each test. Each unit test establishes some set of starting conditions, or context. Each stimulates the system. Each compares the result to a desired result. In order to name these elements you will have to think about the specifics of each and clarify them well enough that you can describe each in a few words.
If you’re having difficulty naming a test using this scheme, that may indicate a problem in your test. Perhaps the test is doing too much work, or your test suite is doing too little. For example, suppose you’re testing software to manage bank accounts, and one test is called Withdrawal Test. We can tell from this name that the test tests the withdrawal feature in some way. But we don’t know what specific aspects of withdrawals this test is testing.
Does Withdrawal Test test only that a withdrawal of less than the account balance reduces the balance by the proper amount? If so, calling this test “Withdrawal Test” may indicate that your suite of tests for the withdrawal feature is missing many important test cases. The name of the test gives readers an overly broad sense of what the test actually tests.
Does Withdrawal Test test a score of different stimuli under a dozen different conditions? If so, it’s probably doing too much work. The name of the test does not quickly tell readers what is being tested.
Whether Withdrawal Test is doing too much work or too little, we can improve the test by applying the context-stimulus-result scheme. If Withdrawal Test is doing too much, we can use the scheme to identify how to break the test into smaller, more focused tests with more descriptive names. If Withdrawal Test tests only one tiny aspect of withdrawals and leaves other aspects untested, we can use the scheme to create a better name for the test and to identify other tests to write.
If you want to test class in isolation, but the class works with a collaborator, you may need to provide a fake collaborator for the class to work with. A fake collaborator provides useful isolation in two directions:
It isolates the test from the quirks of the real collaborators. This makes failures more informative: If the test fails, the fault is likely in the test subject, and not in the collaborator.
It isolates the real collaborators from the test. This is important if the real collaborator is, say, the corporate accounts receivable database. You don’t want your tests messing with that.
Fake collaborators often provide other benefits over real collaborators. One benefit is that fake collaborators increase testability by increasing your control over the test subject’s environment. It’s usually easier to set up a fake collaborator to feed your test subject a particular data value than to set up the real collaborator to do the same thing. And if the real collaborator takes a long time to do its work, you can gain control over the speed of the test by writing a fake collaborator that takes essentially no time at all.
Fake collaborators also increase testability in another way: They give you greater visibility into the results produced by the test subject. Sometimes it’s difficult or time consuming to observe what data the test subject delivered to a real collaborator. If you write a fake collaborator, it’s easy to instruct it to remember the data that the test subject delivered. And it’s easy to gain access to that information so that you can compare it to your expectations.
I’ve identified a number of jobs that I often want fake collaborators to do for me when I’m writing tests. Each of these jobs helps me to gain control over the test environment or visibility into the test results.
Fill in an argument to a method call. Suppose the test subject requires me to pass an argument to it—either through the constructor or through the method I’m testing—but the argument is never used during the test. In this case, all I need the “collaborator” to do is to fill in a value in the method call. If that’s all I need, I can pass null.
Accept calls from the test subject. If the test subject calls the collaborator’s methods, but test doesn’t care what the collaborator does, I can write a fake collaborator with dummy methods. If the interface specifies that a method doesn’t need to return anything, I can simply write a dummy method with an empty body. If the method must return a value, I can write the dummy method to return a simple default value, such as 0, null, or false. Objects like this, and similar objects with very simple default behavior, are often called Null Objects.
Provide inputs to the test subject. Sometimes the test subject requires a value other than 0, null, or false in order to run. And sometimes I’m writing a test to determine whether the test subject responds appropriately when it receives specific interesting values from its collaborators. In either case, I enhance the fake collaborator to store an appropriate value and deliver it to the test subject when called.
Record outputs from the test subject. Sometimes I want to know whether the test subject send the right information to the collaborator. I can write the fake collaborator’s methods to store the inputs it receives from the test subject. And I can write accessor methods in the fake collaborator, if necessary, so that the test method can retrieve them.
Verify outputs from the test subject. Sometimes it’s useful to have the collaborator do the verification itself, rather than having the test retrieve values from the collaborator and verify them. When I want this, I can create a mock object, an object that has expectations and can verify them. I can either write my own mock objects, including the verification methods, or I can use one of the numerous mock object libraries that make mocking easier. I use the simple mock features that come with NUnit.
Verify what methods the test subject calls. Sometimes I want to verify not only whether the collaborator received the right values, but also whether the test subject called all of the right methods. And sometimes I want to make sure the test subject does not call certain methods. Mock object libraries typically provide ways to verify function calls.
Verify the sequence in which the test subject calls method. Every now and then, I want to verify that the test subject not only called the right methods on the collaborator, but also called them in a specific order. This can be useful for testing protocols. Some mock libraries provide a way to verify the order of method calls. The NUnit mock library does not. When I need this feature, I often write a logging collaborator that simply writes each expected method call to a string and each actual call to another string. To verify whether the actual calls matched expectations, my test can direct the logging collaborator to compare the two strings.
Collaborate fully. If the test somehow requires the full behavior of a real collaborator, I can use a real collaborator. So far, I haven’t found a need for this when I’m trying to test classes in isolation. I do use real collaborators when my intention is to test the collaboration, and not just one class or another.
I’ve numbered these features in order of lightness. The lighter features are easier to create; the heavier features take more work. null is the lightest collaborator of all, and the real collaborator is the heaviest.
My preference when writing tests is to use the lighest fake collaborator that gives me the visibility and control that I need for the purposes of my test. This keeps my tests as light and flexible as they can be.
Often I start by passing the lightest collaborator of all, null to the test subject, and then wait for the test tell me when I need to add more behavior to the collaborator. If the test subject needs something other than null, I’ll find out when I try to run the test and get a null reference pointer exception. Then I’ll move to a Null Object. If the default values returned from the Null Object don’t satisfy the test subject, the test usually signals that with an exception or failure of some kind, and I’ll move to a heavier collaborator.
I call this approach The Unbearable Lightness of Faking: start with the lightest possible collaborator, and use it until the lightness becomes unbearable and I absolutely must switch to something heavier.
When I’m talking to programmers about writing tests for their own code, one of the questions that comes up often is: Should we test classes in isolation from each other, or in collaboration with each other?
I like both kinds of tests. Here’s why.
I like tests that isolate classes. When a failure occurs, the tests tell me specifically what class failed, and what method failed. That guides me more directly to the fault—the specific code that is broken—and saves a ton of debugging.
I like tests that exercise collaborations. When a failure occurs, the tests tell me that:
one class or the other is not fulfilling its responsibilities, or
the collaborators disagree about each other’s responsibilities, or
some other class (the “electrician” class that connects the collaborators with each other) has wired the collaborators together improperly.
If the individual classes are well tested, I can focus my collaboration testing specifically on wiring and agreements. And if the individual classes are tested well, collaboration test failures tell me about disagreements and improper wiring.
When I test classes in isolation, failures guide me quickly to faults.
When I test classes in collaboration, failures tell me where the classes disagree about each other’s responsibilities.
Automated tests are software. At first glance, this seems like a non-blinding non-flash of non-insight. But I’m learning a lot about testing by applying this non-insight mindfully.
One thing I’m learning is how often I forget that automated tests are software. When I’m writing tests, I often neglect to apply all of the principles help me to write software well. What if I were to apply some of those principles mindfully?
A key principle is that we write software in order to serve some specific set of needs for some specific set of people. When I’m trying to understand what software to write, I apply this principle in the form of a few questions: Whose needs will the software serve? What needs will trigger those people to interact with the software? What roles will the software play in satisfying those needs?
Let’s apply this principle to the tests we write: Whose needs will these tests serve? What needs would trigger those people to interact with the tests? What roles will the tests play in satisfying those needs?
These days, I write software mostly for my own needs. And mostly I write the software alone. So the “whose needs” question is an easy one: When I write tests, I’m writing them mostly for me, for my own needs.
More enlightening for me—as a solo software developer writing tests solely for my own needs—are other questions. What needs trigger me to interact with the tests, either by running them or by reading test code? What roles do the tests play in satisfying those needs? Here’s a partial list of answers:
I want to know whether my software is ready to deliver.
I want test code to help me understand which parts of the system are tested and which are not.
I want to know whether there are defects in the software I’m writing.
I want tests to expose defects.
I want to know how to correct defects.
I want tests to direct me to the defective part of the software.
I want to understand the meaning of the test results.
I want each test’s code to indicate clearly how the test stimulates the software, and in what conditions.
I want test reports to describe the test stimulus, the relevant test conditions, and the software’s response.
When I’m adding a feature, I want to know when I’m done.
I want tests to tell me which of the feature’s responsibilities the software fulfills, and which it does not.
When I’m editing software, I want to know whether my edits are having unintended effects.
I want tests to detect changes in the behavior of the surrounding software.
When I’m preparing to edit software, I want to know what the existing code does, so that I don’t inadvertently break it.
I want test code to describe clearly what the existing software does.
That’s a partial list needs for a single stakeholder. I’m sure you can think of additional needs that you have when you run tests or read test code, and additional ways that you want tests to help you satisfy those needs. And if we were to consider other people who might interact with our tests, we would discover even more needs. And then there are all of the people who do not interact with the tests and yet are affected by them.
That’s a lot of stakeholders, and a lot of needs. I’m more likely to satisfy all of these people’s needs (including my own) when I’m aware of what the needs are. And I’m more likely to be aware of the needs when I ask questions like the ones I’ve used here. And I’m more likely to ask these questions I remember that tests are software.