I was happily tooling around on my macbook at the command line, poking around in the MAME source code as you do, and then this happened:
Record scratching sound. WTF.
grep just segfaulted. grep.
I use grep hundreds of times a day, going back over 15 years. I have used it in countless scripts and shovelled many terabytes of data through it. It has never failed on me before. It is one of the quintessential reliable unix tools. To say I was surprised at its failure is an understatement.
After recovering from my shock, I felt a pressing need to find out why it failed.
First things first - which grep am I using?
Huh. A BSD grep. Interesting. It is also worth noting that I turn color on by default because I like to see my patterns highlighted.
I did a bit of playing around to see if I could create a smaller test case:
After a few back and forths of bisecting the input, I pared it down to a fairly minimal case that consisted of four patterns and one input line:
So, does the debugger offer any help? It probably won't since this is a stripped binary, but lets give it a whirl:
Hmm, no crash. After a head scratch, I realise the debugger is not invoking with --color.
Lets try again:
Huh, yep - thats a segfault alright. And the bug is definitely related to --color.
At this stage it is hard to figure out what is going on, and all this shows is that a bogus pointer
is eventually handed to fwrite. The crash is probably some way off from the original bug. Joy.
Lets make our life a bit easier and see if we can track down the source code at apple.
AFAICT, there doesn't seem to be a nice way to match an apple binary to the source code on their website.
Does the binary tell us anything?
After a bunch of digging around on Apple's website, I came up empty handed.
Time to swim upstream to FreeBSD.
According to FreeBSD, the current stable is 10.1. To save myself some time, I looked around for a suitable vagrant VM image. Thankfully, wunki has done all the hard work:
Lets test it out:
Well that is interesting, they ship GNU grep as the system grep on FreeBSD and it works just fine.
Any other greps on the system?
Ah, that looks closer to our mac version. How well does it work?
Hmm, no output, and an exit code of 1.
That indicates that no match, which is incorrect. So, no crash - but not working properly either.
Maybe the distributed version is weird. Lets pull down a version from ports:
Cool, lets try blow it up:
Great! - how about a debug build?
Sweet - we are starting to home in on the problem
Lets check the behaviour without color:
This version of grep is showing the same behaviour as the apple version, and we have a nice debuggable binary.
What next? Well, it's time to get up close and personal with the source code, and fire up valgrind1.
At this point, I consider that I may need to do a port over to linux, since that is valgrind's native platform.
Lets try the FreeBSD valgrind anyway:
Well - it got halfway there - and it did give us a bit more insight into the problem.
The benefit of valgrind is that it lets you stop closer to the original error - in this case it looks like a read past the end of a buffer.
Ideally, if you understand the source code well enough, you can prod around with gdb at this point and inspect things.
Unfortunately, this appears to not work in FreeBSD.
Thankfully, we can just do a quick and dirty port to linux, and go from there.
Looks like the main problems are related to
__FBSDID macro
missing mmap stuff
OFF_MAX
fgetln
After fixing that, bmake was a little finicky about building a debug version, so I had to hack up a build script.
What does the linux version have for us?
Success - we have replicated the same bug under linux too. We can start debugging in earnest.
It is complaining about line 1031 - we are either reading uninitialised data from either pat_byte or str_byte:
str_byte looks more suss than pat_byte. Why?
According to this, fastexec is operating on l->dat, which is a pointer to a line read from the listing file. The program thinks it has a length of 10, which matches the file, and the string
"192\t./i860\n"
which is 10 characters (without the newline)
So why is fastcmp wandering around at offset 20 ?
That smells like an error, considering the listing file is smaller than 20 bytes.
It looks like the invariant here should be that the valid range for fastexec to access is
After a quick dig through the code, I determined that access to the string was being controlled by the offsets pmatch.rm_so and pmatch.rm_eo, and added a bit of debug output to clarify.
In the non color case:
with color:
Bingo, the offsets are being incremented past the end of the buffer. This incrementing occurs inside tre_fastnexec inside this macro:
It is not clear whether the callee or the caller is in error here, as the intent of the code isn't straightforward.
At this point I had collected enough data to send a bug report - 197531 - to FreeBSD.
Something was still bugging me though - why was the behaviour of the distributed bsdgrep binary different?
I decided to do some more digging - to be continued in part 2.
Footnotes
There is a great treatment of valgrind and linux debugging in general in Fusco2007
Like the article? Please follow me on twitter and check out my bio.