Generating Software Tests: Breaking Software for Fun and Profit

dbcurtis · on Nov 5, 2018

Disclaimer: I only took a quick glance at the material, not a deep dive. So feel free to discount my criticism as severely as you like. Or whack me with a clue-bat, if needed.

1. In the fuzzing chapter author seems to be showing how to create a fuzzer in Python. OK, so that is great if you want to learn the basics of creating a fuzzer. HOWEVER... there is a great Python module called "hypothesis" that is well developed, well maintained, and 100% production ready. If your goal is to do fuzzed testing of Python, I urge you to check out hypothesis. It integrates nicely with drivers like unittest, and does automatic test case reduction. In other words, if it discovers a failing case, it does a pretty darn decent job of trimming the failing case down to the minimal inputs necessary to provoke the failure. In fact, I urge you to check out hypothesis, period. You might be surprised at how much your current testing leaves out.

2. The first chapter on testing talks about the basics of doing a unit test and writing a test driver. Again, test drivers that are well maintained and have a lot of mileage on them already exist for Python. But more over, the author seems to focus on tactical testing. Unit testing is good, yes, but there is a difference between "testing" and "validation". Testing is tactical. Validation is strategic. I didn't see anything there about analyzing the product test space, and how to make judgment calls about how to decide what to test and how to measure coverage in a way that ensures that your customers will be well-served. For most real products, the entirety of the test space is intractably large, so it is very important to know how to target tests that tickle edge and corner cases, because just throwing random crap at the SUT, while necessary, is unlikely to sensitize the interesting test points. (Imagine each test case as throwing darts at a dart board: You want as many darts to land on the line between the rings as land cleanly within the rings.)

k4st · on Nov 5, 2018

I work on a Google Test-like unit testing framework called DeepState [1] that gives you access to fuzzing and symbolic execution from your C or C++ unit tests. We have a tutorial [2] describing how to use it. It's new and still under development, but shows massive potential, and really brings these software security tools into a more developer-centric workflow.

[1] https://github.com/trailofbits/deepstate [2] http://www.petergoodman.me/docs/secdev-2018-slides.pdf

lootsauce · on Nov 5, 2018

Another approach to automated generation of test cases is through a process of turning traces into tests. Valid uses of code already exist. These are in the application itself and for a library you can find published code that makes use of it.

A recent paper [1] covers this subject and demonstrates in R achieving better code coverage than hand written tests and does so for a large portion of published libraries.

I think this is a very interesting direction for automated test case generation. If you look at snapshot testing [2] it reminds me very much of that.

[1] http://janvitek.org/pubs/issta18.pdf [2] https://blog.kentcdodds.com/effective-snapshot-testing-e0d1a...

nickpsecurity · on Nov 5, 2018

If interested in this topic, here is a survey that tells you about various ways to do automatic, test-case generation:

https://cs.stanford.edu/people/saswat/research/ASTJSS.pdf

My baseline recommendation is using one of every type if you can just run overnight on a server dedicated to verification or testing. I also think hybrid methods have a ton of promise combining static analysis, dynamic analysis, and above methods of test generation. Here's an example:

https://homes.cs.washington.edu/~mernst/pubs/palus-testgen-i...

fouronnes2 · on Nov 5, 2018

Slightly off topic: I remember reading on HN about a tool that explored all code paths in a binary automatically and used it to effectively find bugs and even generate valid data. For example it could generate valid jpeg images just by exploring the valid code path of a jpeg decoder. Does anyone remember that tool's name?

wnoise · on Nov 5, 2018

"afl" or "american fuzzy lop".

https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...

Joky · on Nov 5, 2018

See also for in-process fuzzing: https://llvm.org/docs/LibFuzzer.html

dbcurtis · on Nov 5, 2018

That is genius.

fouronnes2 · on Nov 5, 2018

That's it! Thanks

muro · on Nov 5, 2018

One thing that is rarely mentioned is that fuzzying finds crashes, but not wrong code - you still need an oracle (something that tells you whether the output is correct) and those can be very hard to write.

An useful tool, sure, but not a silver bullet.

gnulinux · on Nov 6, 2018

In my side-projects I use simpler solutions as oracles to more complex solutions if I'm fuzzing. E.g. if I implemented some sophisticated `fac` implementation with secret sauce and mathematical yada yada, I use `def fac(n): return 1 if n==0 else n * fac(n-1)` as oracle, which works but is slow. I never used something like this for work, so I don't know what's the industry standard way to write an oracle, but this works fairly well. Another examples: I implemented polynomial multiplication with FFT and then tested it with canonical O(N^2) algorithm. I implemented Strassen algorithm, tested it with canonical O(N^3) algorithm. You can even have multiple levels of oracles: If canonical polynomial mul is too slow, use FFT PM to test a more sophisticated polynomial multiplication algorithm.

In college, in my OS class my professor said something like in the space missions algorithms were implemented bunch of times by bunch of different teams and then in runtime they voted on output. You can do something like this to orchestrate an oracle too. Implement the same function 5 times, they vote on an output, their agreed upon output is the oracle's output and then now use this to test your real function.

It goes without saying, if the problem is NP, then you can build a function that tests the output. Of course, now you need to test this function...

adrianN · on Nov 5, 2018

Fuzzing works best if you sprinkle assertions generously in your code. If you do algorithm heavy stuff, you can choose a certifying algorithm [1] for your problem. That makes it a lot easier to get the oracle that you need.

[1] https://en.wikipedia.org/wiki/Certifying_algorithm

lmkg · on Nov 5, 2018

That's not entirely true. If you have multiple implementations, you can compare fuzz results of the implementations against each other rather than against ground truth. There's a project that does this with several popular C compilers that's found some good bugs, but I can't remember the name right now.

This approach does still have limitations, e.g. it assumes errors are independent. And, of course, it requires independent implementations. But it is sometimes more accessible than an oracle.

pfdietz · on Nov 5, 2018

You're probably referring to Csmith, the random C generator from John Regehr's group at U. of Utah.

https://embed.cs.utah.edu/csmith/

Fuzzing C compilers this way is difficult, because you can't generate programs that have undefined behaviors, and C has so many of those. The complexity and challenge of their project was to not generate bogus programs while not throwing away too much diversity.

The general technique is called differential testing. For compilers, it was pioneered by William McKeeman at DEC in the 1990s. He also addressed C compilers, with a more limited set of inputs than Csmith produces.

https://www.cs.dartmouth.edu/~mckeeman/references/Differenti...

Personally, I've applied (for the last 15 years) this kind of structured random testing to Common Lisp compilers. It's my contribution to the informal development process for SBCL. In this case, the comparison is between code compiled at different optimization settings, and with functions declared NOTINLINE, and with or without type declarations. Any source of diversity that preserves program meaning can be used to expose bugs. Many of the bug reports I've submitted to SBCL come from the random tester (and some other randomized testing techniques); you can get an idea of the kind of thing the tester finds by looking at them:

https://bit.ly/2PLZDuK

Joky · on Nov 5, 2018

Even there, you need a fuzzer that can generate code that actually execute and does something you can observe.

Otherwise the output of a C compiler (as in the binary itself) is expected to differ even between different versions of the same compiler.

kqr · on Nov 6, 2018

Besides what the others have said,

1. Wrong code doesn't necessarily crash on all inputs, but there's usually at least one input that makes it crash, so it's not a terrible heuristic.

2. Writing oracles along the line of "the output should not be negative" , or "the output should contain this value" , or "this string should only contain printable ascii" gets you much closer for very cheap.

dbcurtis · on Nov 6, 2018

OK, that's fair. But one interesting case in that space is where the production code is highly optimized to be performant at the expense of simplicity. Writing an easy-to-understand but stupid-slow oracle is often a worthwhile strategy.

slow_donkey · on Nov 5, 2018

At risk of sounding ignorant but has anyone fuzzed something like APIs before? Would love to get some anecdotes from fuzzing more abstracted systems.

paulgdp · on Nov 5, 2018

The Rust ecosystem has a team that tries to fuzz as many crates (rust packages) as possible [1].

Unlike C/C++ and like Python, fuzzing Rust code is not really about finding memory bugs but more about finding logical errors [2].

To do this, a project has been set up with 83 (so far) targets fuzzing the public API of 48 (so far) important crates [3].

All those targets can be fuzzed using any of the three major native code feedback-based fuzzers (AFL, LibFuzzer, and Honggfuzz).

[1] https://github.com/rust-fuzz/targets

[2] see the trophy case: https://github.com/rust-fuzz/trophy-case

[3] https://github.com/rust-fuzz/targets/blob/master/common/src/...

Disclaimer: I'm a member of this team and the author of the honggfuzz crate that makes honggfuzz work with Rust code.

cpeterso · on Nov 5, 2018

And Cargo has good support for integrating Rust fuzzers into one's own projects:

https://medium.com/@seasoned_sw/fuzz-testing-in-rust-with-ca...

btw, I'm impressed that the rust-fuzz trophy list includes only one UAF, one uninitialized memory read, and no segfaults. :)

https://github.com/rust-fuzz/trophy-case

QuinnWilton · on Nov 6, 2018

I'm an engineer at Tinfoil Security, working on a security scanner for REST APIs. At a high level, it ingests Swagger specifications, which it uses to build property testing generators to test the implementation against the schema + against various security tests.

It isn't ready for public consumption yet, but we have a few customers using a private beta.

We're still exploring the problem space, but so far it has been very successful at quickly confirming an API conforms to its specification, and automating the detection of most "input validation" bugs.

paulgdp · on Nov 5, 2018

Also, for a toy rust project, I wrote an implementation of a copy-on-write B+tree and I used a fuzzer (honggfuzz) to generate automatically all the test cases.

Ii was insanely effective, there were so much more edge cases than I thought. The fuzzer found all of them and got me a 100% code coverage in no time.

All I had to write as test code was a function mapping an array of random data to a series of btree instructions, apply those to both my implementation and a reference and then check that the two structures had the same data.

Joky · on Nov 5, 2018

See the "trophies" here: https://llvm.org/docs/LibFuzzer.html#trophies

Also: https://github.com/google/oss-fuzz

bch · on Nov 5, 2018

OpenBSD kernel?

https://news.ycombinator.com/item?id=17929234