Digging deeper into the grep segfault

In the last article, I discussed tracking down a segfault in grep. This article continues down the rabbit hole.

At the end of the last instalment, I was left with a puzzle. Why was the version of grep from textproc/bsdgrep acting in a different manner to /usr/bin/bsdgrep? Time to put on my debugging cap, and dig down a little further.

Let's start off with trying a few variants of the input set:

$ printf "192\t./i860\n"  | bsdgrep --color -e i860 -e sharc
192 ./i860
$ printf "192\t./i860\n"  | bsdgrep --color -e i860 -e g65816
192 ./i860
$ printf "192\t./i860\n"  | bsdgrep --color -e i860 -e i86
192 ./i860
192 ./i860
Segmentation fault (core dumped)

Excellent, we made it crash! Can we slim it down even more?

[vagrant]$ echo i860  | bsdgrep --color -e i860 -e i86
i860
i860
Segmentation fault (core dumped)

What about the behaviour on mac?

$ echo i860  | grep --color -e i860 -e i86
i860
Segmentation fault: 11

It was around this time that I realised I had been looking at the wrong binary. D'oh!

After actually reading TFM, I determined that FreeBSD builds its system binaries from SVN and not from ports. The binary I had been looking at previously was a different version!

A bit of poking around at freebsd.org and googling for grep.c revealed this as the likely repo path:

base/stable/10/usr.bin/grep/

Lets pull it down from svn:

[vagrant]$ svn checkout http://svn.freebsd.org/base/head --depth=immediates stable
<snip/>
[vagrant]$ svn up stable/usr.bin/ --set-depth=immediates
<snip/>
[vagrant]$ svn up stable/usr.bin/grep --set-depth=infinity
<snip/>
[vagrant]$ cd stable/usr.bin/grep
[vagrant]$ make
<snip/>compiling a non stripped release version</snip>

[vagrant]$ file ./bsdgrep
bsdgrep: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 10.1, not stripped

[vagrant]$ objdump --syms bsdgrep |grep debug_info
[vagrant]$

Note - I use set depth with svn to avoid pulling down the whole FreeBSD source tree, which is quite large.

Lets check the behaviour:


[vagrant]$ ./bsdgrep --version
bsdgrep (BSD grep) 2.5.1-FreeBSD

[vagrant]$ echo i860 | bsdgrep --color -e i860 -e i86
i860
i860
Segmentation fault (core dumped)

So, what does the debugger have for us?

[vagrant]$ echo i860 > /tmp/l
[vagrant]$ gdb --args ./bsdgrep --color -e i860 -e i86 /tmp/l

(gdb) r
Starting program: /usr/home/vagrant/testsvn/stable/usr.bin/grep/bsdgrep --color -e i860 -e i86 /tmp/l
(no debugging symbols found)

Program received signal SIGSEGV, Segmentation fault.
0x00000008011cca60 in memchr () from /lib/libc.so.7
(gdb) bt
#0  0x00000008011cca60 in memchr () from /lib/libc.so.7
#1  0x00000008011cc599 in fwrite () from /lib/libc.so.7
#2  0x00000008011cc43d in fwrite () from /lib/libc.so.7
#3  0x0000000000404df3 in printline ()
#4  0x000000000040485a in procfile ()
#5  0x0000000000403935 in main ()
(gdb)

That looks very familiar indeed. In my last article, that was the backtrace I got on my macbook when I first started to inspect the bug.

For comparison purposes, here it is again:

* thread #1: tid = 0x14d28, 0x00007fff8cd9dc64 libsystem_platform.dylib`_platform_memchr + 84, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x101000000)
  * frame #0: 0x00007fff8cd9dc64 libsystem_platform.dylib`_platform_memchr + 84
    frame #1: 0x00007fff85789be0 libsystem_c.dylib`__sfvwrite + 272
    frame #2: 0x00007fff8578a102 libsystem_c.dylib`fwrite + 136
    frame #3: 0x0000000100003517 grep`___lldb_unnamed_function21$$grep + 313
    frame #4: 0x00000001000030bf grep`___lldb_unnamed_function16$$grep + 1707
    frame #5: 0x0000000100002039 grep`___lldb_unnamed_function5$$grep + 2781
(lldb)

Lets help ourselves out by making a debug build:

[vagrant]$ cd ../../
[vagrant]$ svn up share/mk --set-depth=infinity
<snip/>
[vagrant]$ cp share/mk/sys.mk share/mk/sys.mk.orig
<snip> hack build flags </snip>

[vagrant]$ diff share/mk/sys.mk.orig share/mk/sys.mk
61c61
< CFLAGS        ?=  -O2 -pipe
---
> CFLAGS        ?=  -g -pipe

[vagrant]$ cd usr.bin/grep
[vagrant]$ export MAKESYSPATH=../../share/mk
[vagrant]$ make clean
[vagrant]$ make
</snip>

[vagrant]$ objdump --syms bsdgrep |grep debug
0000000000000000 l    d  .debug_info    0000000000000000              .debug_info

What does valgrind have for us?

[vagrant]$ valgrind --db-attach=yes --track-origins=yes ./bsdgrep --color -e i860 -e i86 /tmp/l
==68354== Memcheck, a memory error detector
==68354== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==68354== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==68354== Command: ./bsdgrep --color -e i860 -e i86 /tmp/l
==68354== 
i860
==68354== Conditional jump or move depends on uninitialised value(s)
==68354==    at 0x1020E54: memchr (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==68354==    by 0x1BC8598: ??? (in /lib/libc.so.7)
==68354==    by 0x1BC843C: fwrite (in /lib/libc.so.7)
==68354==    by 0x4062FB: printline (util.c:469)
==68354==    by 0x405EEE: procline (util.c:366)
==68354==    by 0x405587: procfile (util.c:231)
==68354==    by 0x404291: main (grep.c:738)
==68354==  Uninitialised value was created by a heap allocation
==68354==    at 0x101D2B3: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==68354==    by 0x405814: grep_malloc (util.c:390)
==68354==    by 0x402CBA: grep_open (file.c:285)
==68354==    by 0x40535A: procfile (util.c:194)
==68354==    by 0x404291: main (grep.c:738)
==68354== 
==68354== 
==68354== ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- y

<snip> valgrind stacking it </snip>

As in the previous article, valgrind crashes when I try to attach the debugger. At least it gives up some some useful info before doing so. The badness starts in the printline function, in util.c on line 469.

With valgrind down for the count, I decide to soldier on with gdb alone:

[vagrant]$ gdb --args ./bsdgrep --color -e i860 -e i86 /tmp/l
<snip>copyright</snip>
(gdb) r
Starting program: /usr/home/vagrant/testsvn/stable/usr.bin/grep/bsdgrep --color -e i860 -e i86 /tmp/l
i860
i860

Program received signal SIGSEGV, Segmentation fault.
0x00000008011d1a60 in memchr () from /lib/libc.so.7
(gdb) bt
#0  0x00000008011d1a60 in memchr () from /lib/libc.so.7
#1  0x00000008011d1599 in fwrite () from /lib/libc.so.7
#2  0x00000008011d143d in fwrite () from /lib/libc.so.7
#3  0x00000000004062fc in printline (line=0x7fffffffe878, sep=58, matches=0x7fffffffe710, m=2) at util.c:469
#4  0x0000000000405eef in procline (l=0x7fffffffe878, nottext=0) at util.c:366
#5  0x0000000000405588 in procfile (fn=0x7fffffffed65 "/tmp/l") at util.c:231
#6  0x0000000000404292 in main (argc=7, argv=0x7fffffffeab0) at grep.c:738

(gdb) up 4
(gdb) list

464         putchar(sep);
465     /* --color and -o */
466     if ((oflag || color) && m > 0) {
467         for (i = 0; i < m; i++) {
468             if (!oflag)
469                 fwrite(line->dat + a, matches[i].rm_so - a, 1,
470                     stdout);
471             if (color) 
472                 fprintf(stdout, "\33[%sm\33[K", color);

(gdb) p line->dat 
$1 = 0x801c30000 "i860\n"

(gdb) p a
$2 = 4

(gdb) p matches[i].rm_so
$3 = 0

(gdb) p (matches[i].rm_so - a)
$4 = 18446744073709551612

Ouch - that would explain the segfault. A huge value is being passed as the size param to fwrite. Why?

(gdb) p sizeof(a)
$10 = 8

(gdb) p sizeof(matches[i].rm_so)
$11 = 4

(gdb) ptype matches[i].rm_so
type = int

(gdb) ptype a
type = long unsigned int

(gdb) print (unsigned long)(-4)
$5 = 18446744073709551612

I'm not going to pretend I know all the C conversion rules down to the letter, but it seems fairly obvious that we are ending up with a -4 in a place where it shouldn't be, and it is getting converted into an unsigned long in the process.

As in the previous article, this bug comes down to the offsets rm_so and rm_eo. These are offsets that point to the beginning and end of the matching parts of the line.

I read through the code a few times, and added some debug output to get an idea of what it was trying to do.

Here is the problematic loop inside util.c written in pythonic pseudo-code:

for match in matches:
    if not print_only_matching_part:
        print non_matching_part

    if print_in_color:
        print color_escape_sequence_start

    print matching_part

    if print_in_color:
        print color_escape_sequence_end

    start_of_next_non_matching_part = match.end_offset

    if print_only_matching_part:
        print end_line

This is the c version with some debugging output added to the relevant parts:

for (i = 0; i < m; i++) {

    if (!oflag)
    {
        fprintf(stderr, "%s:%d match %d fwrite rm_so %d rm_eo %d a %zu\n",
                __FILE__, __LINE__, i, matches[i].rm_so, matches[i].rm_eo, a);
        fwrite(line->dat + a, matches[i].rm_so - a, 1, stdout);
    }
    if (color)
        fprintf(stdout, "\33[%sm\33[K", color);

    fwrite(line->dat + matches[i].rm_so,
        matches[i].rm_eo - matches[i].rm_so, 1,
        stdout);

    if (color)
        fprintf(stdout, "\33[m\33[K");

    a = matches[i].rm_eo;

    fprintf(stderr, "%s:%d match %d a = %zu\n", __FILE__, __LINE__, i, a);

    if (oflag)
        putchar('\n');
}

After adding the debug output, I ran a few experiments with different patterns:

[vagrant]$ echo i860 | ./bsdgrep --color -e i -e 6
util.c:472 match 0 fwrite rm_so 0 rm_eo 1 a 0
util.c:487 match 0 a = 1
util.c:472 match 1 fwrite rm_so 2 rm_eo 3 a 1
util.c:487 match 1 a = 3
i860

[vagrant]$ echo i860 | ./bsdgrep --color -e i860 -e i86
util.c:472 match 0 fwrite rm_so 0 rm_eo 4 a 0
util.c:487 match 0 a = 4
util.c:472 match 1 fwrite rm_so 0 rm_eo 3 a 4
i860
i860
Segmentation fault (core dumped)

This shows us that the loop is choking when handed overlapping pattern matches. There must be an assumption here that matching parts of the line do not overlap. In this case, they do - and when the code tries to print out the non-matching part of the line, it ends up with a negative value for the size of that part.

At this point I verified that this bug was still happening in the head branch, and then sent a bug report off to FreeBSD - 197555.

The probable fix for this bug is to check whether the non-matching part of the line has already been printed out by a previous match. A fairly big assumption was violated however, so some more careful testing and debugging is warranted. I am going to leave that to the FreeBSD maintainers.

Apple can either poke around and fix their own version, or wait for a fix from upstream. I do wonder why they switched from GNU grep in the first place - possibly a licensing issue?

If you've made it this far, thanks for coming along on the ride. And remember, there are always more bugs lurking just under the surface.