Mutation Testing: Diagnose your Test Quality

I recently introduced mutation testing to my team, and we used it to diagnose the quality of our tests. In this article, we’ll talk about why you should not trust code coverage to measure the quality and effectiveness of your tests. Then, we’ll learn about mutants and mutation testing. Next, we’ll see how to automate the mutation testing process and see an example of it.

Finally, I will share my experience with mutation testing and some of the difficulties we faced when we tried to integrate it into our dev workflow.

An analogy

Vaccine developers develop vaccines for viruses like covid 19. As you know, viruses evolve and mutate (you know, covid alpha, delta, and delta plus). Thus in their research and experiments, we want scientists to find ways to assess vaccine quality against mutants of the virus.

Photo by Braňo on Unsplash

What do we mean by virus mutant? A mutation refers to a single change in a virus’s genome. A genome is basically a set of instructions that contains the information that the virus needs to function.

As a scientist, it would be awesome if I could make a tiny change to the virus’s genome sequence to create many covid mutants, then we can test the vaccine’s effectiveness against those mutants.

This is what virus mutation testing is all about. It’s about testing the effectiveness of the vaccine against the mutated virus.
Same with biology in software testing, the mutated virus is your code mutation, and the tests are your vaccine.

Code Coverage

Before talking about mutation testing, I’d like to talk about code coverage and why you should not trust code coverage when evaluating the effectiveness of your tests.

This is a tweet from uncle bob saying that test coverage is a horrible code quality metric from management’s perspective and that it can be helpful in development when you mix it with mutation testing:

There’s a misconception that code coverage is used to measure how much code is tested. But in fact, it is used to calculate how much code is executed. For example, if you try to run this method with code coverage, you’ll notice that code coverage will remain constant with or without test assertions or verifications:

With that said, we can’t rely on code coverage to measure the effectiveness of our tests.

Now you may ask: How should we measure the quality of our tests?

We can answer this question by asking ourselves another question: If we make a semantic change to our production code, will our unit tests fail?

If the answer is positive, that means our tests are good. Otherwise, we need to improve our already-existing tests, maybe even write new test cases.

That leads us to another question, how do we know which tests need improvements?

Intro to Mutants

Well, We’ll need to mutate the semantics of our code (e.g., negate an if-statement or remove a method call), and with each one-by-one semantic change, we run our unit tests until one of the tests fails against that semantic change.

If we don’t get a failing test, then we need to work on our test suite.

If we don’t get a failing test, then we need to work on our test suite. 

Now, each semantic change is called a “mutant”.

You may ask, What is a mutant? It’s a tiny changed version of our production code. Think of it as a bug injected into your code.

Mutants are said to be killed when at least one of our tests in the test suite fails against the mutated program.

The software test suite can then be scored by what’s called the mutation score. The mutation score is the percentage of killed mutants divided by the total number of mutants multiplied by 100.

Mutation testing hypotheses

Now, the process of mutation testing is based on two assumptions.

The competent programmer hypothesis

This hypothesis states that program faults are syntactically small, and the programmer can fix them with few keystrokes. So, The result of this hypothesis is that mutants simulate the likely effect of real faults. Thus, if our test suite is good at catching artificial mutants, it will also be good at catching the real mutants – which are the faults in our program.

The Coupling Effect Hypothesis

The second assumption states that tests that detect simple, small errors should be capable of detecting more complex ones derived from the combination of other errors. This way, by explicitly testing for simple errors, we’re implicitly testing for more complicated ones. That’s why mutation testing doesn’t need to make complex changes to the program. According to this assumption, simple changes will be enough.

So with these assumptions in mind, let’s see what mutation testing is.

Mutation Testing

Mutation testing is a form of white-box testing, and it is about testing your test’s quality by inserting mutants into the code and running your tests against them. If the generated mutants are killed, that means our test suite is solid. If they survived, that means we’ll need to work on our test suite.

This diagram illustrates how mutation testing works in general:

First, we create the mutants (the small changed version of our production code). After creating the mutants, we run our unit tests against these mutants and, of course, our production code. Then, we compare the results of running tests against these mutants and our production code and generate a comparison report.

Now, we don’t want to create mutants manually and run each of our tests against them; that’s time-consuming, not to mention all the possibilities we’ll need to cover. That’s why libraries like pitest exist.


Pitest will automate the mutation testing process, alter the bytecode, create the mutants using the instrumentation API, run the corresponding unit tests against those mutants, and generate a report showing weakly tested pieces of code. 

Pitest will first calculate the code coverage and use that with some Bytecode manipulation magic to detect the appropriate tests with the proper program mutation. This way, only the tests that could kill the mutant will be run. This makes it relatively fast compared to other libraries.

Note that Pitest can generate reports in multiple formats: HTML for local viewing, CSV, and XML for SonarQube. The following is an HTML report generated by Pitest:

We can see that we have 100% code coverage. However, the mutation score is very low. It’s 25% in here.
We can click on the class name to see why:

Here, Pitest shows me weakly tested pieces of code. It does that by applying mutators to the code and running tests against the generated mutants. 

We can see we have three mutations that survived, those in red color. They survived because the test that we saw early missed a verification and a boundary test.

After adding the forgotten verification and the boundary test, I’ve run the mutation coverage goal, and you can see that the mutation coverage went high:

This is the example code we were testing, and you can see all mutations are killed. That’s good since the quality of our test suites is measured by the percentage of mutants that they kill. So we can add new tests, change existing tests to kill mutants.

Pitest supports a variety of configuration options. Let’s see how to configure Pitest and some common config options.

To get started with Pitest, all you have to do is add this plugin in your pom.xml file:


Pitest, by default, will mutate all of the classes in your project. But you can specify the targetClasses config option to tell Pitest which packages need to be mutated and targetTests to specify which test classes to run. These two options are useful if you want to speed up the mutation testing process by running it only on the packages or classes that your care about.

Another config option is mutationThreshold, which is the mutation score threshold at which to fail the build. And last but not least, we have coverageThreshold, which is the line coverage threshold at which to fail the build.

Challenges with Mutation Testing

Not everything in mutation testing is perfect, one thing to keep in mind, mutation testing can be highly time-consuming to run, so If you’re working on a huge codebase, you’ll want to break the mutation tests into small parts. Or you can use features like incremental analysis. Now, let’s see an example of incremental analysis. 

Incremental analysis

Here, I’ve run mutation testing on a relatively large codebase: The popular Apache commons-lang library, and see it took more than 2 hours to complete the mutation testing process:

You can see how time-consuming this is. Performing mutation testing on all classes in a large codebase is not always necessary. Pitest provides a somewhat experimental feature called Incremental Analysis to only analyze the code that has changed. This is useful for large codebases because eventually, due to the amount of code that will have to be mutated, the process will take a long time to finish.

So, using the withHistory configuration option when running the mutation coverage goal will tell Pitest to generate a binary history file where it will store history information about the previous mutation testing runs. Thus, making subsequent runs much faster.

After running the Pitest mutation coverage goal with incremental analysis enabled, you can see it took us two minutes and 28 seconds instead of two hours to complete the mutation testing process.

So with incremental analysis enabled, you can always run mutation testing to diagnose your test’s quality of your newly written functionality.

Pitest with SCM

Another option to speed up the mutation testing process is to run the scmMutationCoverage goal:

mvn org.pitest:pitest-maven:scmMutationCoverage

By default, it will only analyze files that are ADDED or MODIFIED in your project’s VCS. This way, Pitest can only check the mutation coverage of the classes that have been changed before pushing the code to the repository. This goal uses the Maven SCM plugin, which allows you to interact with VCS. By default, this goal works for local changes, and if you want to analyze the files that were added and modified in the last commit, you’ll need to set the option analyzeLastCommit to true. This can be useful when Pitest is run on a continuous integration machine.

Pitest also provides originalBranch and destinationBranch options if you want to analyze the changes between two branches. 

Maven muti-module support

The scm mutation testing and incremental analysis in Pitest aren’t supported in a maven multi-module project.

I can run the process on a maven muti-module project(with a very dirty workaround), and then the only thing that I can do is run it and wait for days to complete which, is definitely not practical. Yes, it’s not a typo! Literally days for it to complete the process when I am working on a very large entreprise codebase!

Other ecosystems

Now, you can run mutation testing in other ecosystems as well. For python, we have a library called Mutpy. Another library is Stryker. Stryker started out as a pure JavaScript mutation testing library, and recently it supports TypeScript, C# and Scala

I didn’t try out these tools, but they’re known in the tech community.

Wrap Up

In this post, I attempted to clarify how concepts of code coverage and mutation testing complement each other to create a solid suite of test. It is not enough to have a high code coverage percentage. That doesn’t mean the code is well tested. However, mutation testing won’t properly work if you don’t have a high level of coverage because the test won’t detect some of the changes made to the program. We’ve also seen some of the challenges that you may face when trying to integrate mutation testing in the real world.

If you happen to find these articles useful, you can buy me a coffee.

Learn more


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s