Item 30: Write more than unit tests
"All companies have test environments.
The lucky ones have production environments separate from the test environment." – @FearlessSon
Like most other modern languages, Rust includes features that make it easy to write tests that live alongside your code, and which give confidence that the code is working correctly.
This isn't the place to expound on the importance of tests; suffice it to say that if code isn't tested, it probably doesn't work the way you think it does. So this Item assumes that you're already signed up to write tests for your code.
Unit tests and integration tests, described in the next two sections, are the key forms of test. However, the Rust toolchain and extensions to it allow for various other types of test; this Item describes their distinct logistics and rationales.
Unit Tests
The most common form of test for Rust code is a unit test, which might look something like:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_nat_subtract() { assert_eq!(nat_subtract(4, 3).unwrap(), 1); assert_eq!(nat_subtract(4, 5), None); } #[should_panic] #[test] fn test_something_that_panics() { nat_subtract_unchecked(4, 5); } } }
Some aspects of this example will appear in every unit test:
- a collection of unit test functions, which are…
- marked with the
#[test]
attribute, and included within… - a
#[cfg(test)]
attribute, so the code only gets built in test configurations.
Other aspects of this example illustrate things that are optional, and may only be relevant for particular tests:
- The test code here is held in a separate module, conventionally called
tests
ortest
. This module may be inline (as here), or held in a separatetests.rs
file. - The test module may have a wildcard
use super::*
to pull in everything from the parent module under test. This makes it more convenient to add tests (and is an exception to the general advice of Item 23 to avoid wildcard imports). - A unit test has the ability to use anything from the parent module, whether it is
pub
or not. This allows for "whitebox" testing of the code, where the unit tests exercise internal features that aren't visible to normal users. - The test code makes use of
unwrap()
for its expected results; the advice of Item 18 isn't really relevant for test-only code, wherepanic!
is used to signal a failing test. Similarly, the test code also checks expected results withassert_eq!
, which will panic on failure. - The code under test includes a function that panics on some kinds of invalid input, and the tests exercise that in a
test that's marked with the
#[should_panic]
attribute. This might be a internal function that normally expects the rest of the code to respect its invariants and preconditions, or it might be a public function that has some reason to ignore the advice of Item 18. (Such a function should have a "Panics" section in its doc comment, as described in Item 27.)
Item 27 suggests not documenting things that are already expressed by the type system; similarly, there's no need to
test things that are guaranteed by the type system. If your enum
types start start holding values that aren't in the
list of allowed variants, you've got bigger problems than a failing unit test!
However, if your code relies on specific functionality from your dependencies, it can be helpful to include basic tests of that functionality. The aim here is not to repeat testing that's already done by the dependency itself, but instead to have an early warning system that indicates whether it's safe to include a new version of that dependency in practice – separately from whether the semantic version number (Item 21) indicates that the new version is safe in theory.
Integration Tests
The other common form of test included with a Rust project is integration tests, held under
tests/
. Each file in that directory is run as a separate test program that executes all of the functions marked with
#[test]
.
Integration tests do not have access to crate internals, and so act as black-box tests that can only exercise the public API of the crate.
Doc Tests
Item 27 described the inclusion of short code samples in documentation comments, to illustrate the use of a particular
public API item. Each such chunk of code is enclosed in an implicit fn main() { ... }
and run as part of cargo test
, effectively making it an additional test case for your code, known as a doc test. Individual
tests can also be executed selectively by running cargo test --doc <item-name>
.
Assuming that you regularly run tests as part of your continuous integration environment (Item 32), this ensures that your code samples don't drift too far from the current reality of your API.
Examples
Item 27 also described the ability to provide example programs that exercise your public API. Each Rust file under
examples/
(or each subdirectory under examples/
that includes a main.rs
) can be run as a standalone binary
with cargo run --example <name>
or cargo test --example <name>
.
These programs only have access to the public API of your crate, and are intended to illustrate the use of your API as a
whole. Examples are not specifically designated as test code (no #[test]
, no #[cfg(test)]
), and they're a poor
place to put code that exercises obscure nooks and crannies of your crate –particularly as examples are not
run by cargo test
by default.
Nevertheless, it's a good idea to ensure that your continuous integration system (Item 32) builds and runs all
the associated examples for a crate (with cargo test --examples
), because it can act as a good early warning system
for regressions that are likely to affect lots of users. As noted above, if your examples demonstrate mainline use of
your API, then a failure in the examples implies that something significant is wrong.
- If it's a genuine bug, then it's likely to affect lots of users – the very nature of example code means that users are likely to have copied, pasted and adapted the example.
- If it's an intended change to the API, then the examples need to be updated to match. A change to the API also implies a backwards incompatibility, so if the crate is published then the semantic version number needs a corresponding update to indicate this (Item 21).
The likelihood of users copying and pasting example code means that it should have a different style than test code. In
line with Item 18, you should set a good example for your users by avoiding unwrap()
calls for
Result
s. Instead, make each example's main()
function return something like Result<(), Box<dyn Error>>
, and then use the question mark operator throughout (Item 3).
Benchmarks
Item 20 attempts to persuade you that fully optimizing the performance of your code isn't always necessary. Nevertheless, there are definitely still times when performance is critical, and if that's the case then it's a good idea to measure and track that performance. Having benchmarks that are run regularly (e.g. as part of continuous integration, Item 32) allows you to detect when changes to the code or the toolchains adversely affect that performance.
The cargo bench
command1 runs special test cases
that repeatedly perform an operation, and emits average timing information for the operation.
However, there's a danger that compiler optimizations may give misleading results, particularly if you restrict the operation that's being performed to a small subset of the real code. Consider a simple arithmetic function:
#![allow(unused)] fn main() { pub fn factorial(n: u128) -> u128 { match n { 0 => 1, n => n * factorial(n - 1), } } }
A naïve benchmark for this code:
#[bench]
fn bench_factorial(b: &mut Bencher) {
b.iter(|| {
let result = factorial(15);
assert_eq!(result, 1_307_674_368_000);
});
}
gives incredibly positive results:
test naive::bench_factorial ... bench: 0 ns/iter (+/- 0)
With fixed inputs and a small amount of code under test, the compiler is able to optimize away the iteration and directly emit the result, leading to an unrealistically optimistic result.
The (experimental) std::hint::black_box
function can help with this; it's an identity function whose implementation the compiler is "encouraged, but not
required" (their italics) to pessimize.
Moving the code under test to use this hint:
#![allow(unused)] #![feature(bench_black_box)] // nightly-only fn main() { pub fn factorial(n: u128) -> u128 { match n { 0 => 1, n => n * std::hint::black_box(factorial(n - 1)), } } }
gives more realistic results:
test bench_factorial ... bench: 42 ns/iter (+/- 6)
The Godbolt compiler explorer can also help by showing the actual machine code emitted by the compiler, which may make it obvious when the compiler has performed optimizations that would be unrealistic for code running a real scenario.
Finally, if you are including benchmarks for your Rust code, the Criterion
crate may provide an alternative to the standard
test::Bencher
functionality which is:
- more convenient (it runs with stable Rust)
- more fully-featured (it has support for statistics and graphs).
Fuzz Testing
Fuzz testing is the process of exposing code to randomized inputs in the hope of finding bugs, particularly crashes that result from those inputs. Although this can be a useful technique in general, it becomes much more important when your code is exposed to inputs that may be controlled by someone who is deliberately trying to attack the code – so you should run fuzz tests if your code is exposed to potential attackers.
Historically, the majority of defects in C/C++ code that have been exposed by fuzzers have been memory safety problems, typically found by combining fuzz testing with runtime instrumentation (e.g. AddressSanitizer or ThreadSanitizer) of memory access patterns.
Rust is immune to some (but not all) of these memory safety problems, particularly when there is no unsafe
code
involved (Item 16). However, Rust does not prevent bugs in general, and a code path that triggers a panic!
(cf. Item 18) can still result in a denial-of-service (DoS) attack on the codebase as a whole.
The most effective forms of fuzz testing are coverage-guided: the test infrastructure monitors which parts of the code
are executed, and favours random mutations of the inputs that explore new code paths. "American fuzzy lop"
(AFL) was the original heavyweight champion of this technique, but in more recent
years equivalent functionality has been included into the LLVM toolchain as
libFuzzer
.
The Rust compiler is built on LLVM, and so the cargo-fuzz
sub-command
exposes libFuzzer
functionality for Rust (albeit only for a limited number of platforms).
To set up a fuzz test, first identify an entrypoint of your code that takes (or can be adapted to take) arbitrary bytes of data as input:
#![allow(unused)] fn main() { /// Determine if the input starts with "FUZZ". fn is_fuzz(data: &[u8]) -> bool { if data.len() >= 3 /* oops */ && data[0] == b'F' && data[1] == b'U' && data[2] == b'Z' && data[3] == b'Z' { true } else { false } } }
Next, write a small driver that connects this entrypoint to the fuzzing infrastructure:
fuzz_target!(|data: &[u8]| {
let _ = is_fuzz(data);
});
Running cargo +nightly fuzz run target1
continuously executes the fuzz target with random data, only stopping if a
crash is found. In this case, a failure is found almost immediately:
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1139733386
INFO: Loaded 1 modules (1596 inline 8-bit counters): 1596 [0x10cba9c60, 0x10cbaa29c),
INFO: Loaded 1 PC tables (1596 PCs): 1596 [0x10cbaa2a0,0x10cbb0660),
INFO: 7 files found in /Users/dmd/src/effective-rust/examples/testing/fuzz/corpus/target1
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: seed corpus: files: 7 min: 1b max: 8b total: 34b rss: 38Mb
#8 INITED cov: 22 ft: 22 corp: 6/26b exec/s: 0 rss: 38Mb
thread '<unnamed>' panicked at 'index out of bounds: the len is 3 but the index is 3', fuzz_targets/target1.rs:11:12
stack backtrace:
0: rust_begin_unwind
at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/std/src/panicking.rs:579:5
1: core::panicking::panic_fmt
at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/core/src/panicking.rs:64:14
2: core::panicking::panic_bounds_check
at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/core/src/panicking.rs:159:5
3: _rust_fuzzer_test_input
4: ___rust_try
5: _LLVMFuzzerTestOneInput
6: __ZN6fuzzer6Fuzzer15ExecuteCallbackEPKhm
7: __ZN6fuzzer6Fuzzer6RunOneEPKhmbPNS_9InputInfoEbPb
8: __ZN6fuzzer6Fuzzer16MutateAndTestOneEv
9: __ZN6fuzzer6Fuzzer4LoopERNSt3__16vectorINS_9SizedFileENS_16fuzzer_allocatorIS3_EEEE
10: __ZN6fuzzer12FuzzerDriverEPiPPPcPFiPKhmE
11: _main
and the input that triggered the failure is emitted.
Normally, fuzz testing does not find failures so quickly, and so it does not make sense to run fuzz tests as part of your continuous integration. The open-ended nature of the testing, and the consequent compute costs, mean that you need to consider how and when to run fuzz tests – perhaps only for new releases or major changes, or perhaps for a limited period of time2.
You can also make subsequent runs of the fuzzing infrastructure more efficient, by storing and re-using a corpus of previous inputs which the fuzzer found to explore new code paths; this helps subsequent runs of the fuzzer explore new ground, rather than re-testing code paths previously visited.
Testing Advice
An Item about testing wouldn't be complete without repeating some common advice (which is mostly not Rust-specific):
- As this Item has endlessly repeated, run all your tests in continuous integration on every change (with the exception of fuzz tests).
- When you're fixing a bug, write a test that exhibits the bug before fixing the bug. That way you can be sure that the bug is fixed, and that it won't be accidentally re-introduced in future.
- If your crate has features (Item 26), run tests over every possible combination of available features.
- More generally, if your crate includes any config-specific code (e.g.
#[cfg(target_os = "windows")]
), run tests for every platform that has distinct code.
Summary
This Item has covered a lot of different types of test, so a summary is in order:
- Write unit tests for comprehensive testing that includes testing of internal-only code; run with
cargo test
. - Write integration tests to exercise your public API; run with
cargo test
. - Write doc tests that exemplify how to use individual items in your public API; run with
cargo test
. - Write example programs that show how to use your public API as a whole; run with
cargo test --examples
orcargo run --example <name>
. - Write benchmarks if your code has significant performance requirements; run with
cargo bench
. - Write fuzz tests if your code is exposed to untrusted inputs; run (continuously) with
cargo fuzz
.
That's a lot of different types of test, so it's up to you much each of them is relevant and worthwhile for your project.
If you have a lot of test code and you are publishing your crate to crates.io
,
then you might need to consider which of the tests make sense to include in the published crate. By default, cargo
will include unit tests, integration tests, benchmarks and examples (but not fuzz tests), which may be more than end
users need. If that's the case, you can either
exclude
some of the files,
or (for black-box tests) move the tests out of the crate and into a separate test crate.
1: Support for benchmarks is not stable, so the command may need to be cargo +nightly bench
.
2: If your code is a widely-used open-source crate, the Google OSS-Fuzz program may be willing to run fuzzing on your behalf.