I got nerd sniped by Daniel Lemire's parsing problem a little bit.

I think the fast swar approach in his article is getting close to optimal, it's hard to beat a constant time operation that has a small constant.

Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast (mul, shift, mask) hash function. The hash table key is a four byte string cast to a u32, and the value is the resultant parsed integer. It trades a bit of space (512 bytes) to get the speed.

An earlier version of this code burned a bit more space because I was too lazy to tweak the multiplier.

It has now been made compact by performing a search for a better multiplier constant. It could also be optimized a bit further in asm. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar implementation, as it should have a smaller constant (a multiply, a shift, a mask and a lookup).

To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so:

lut[simple_hash(*(unsigned*)(str))];

The lut building code:

#define SIZE 512
uint8_t lut[SIZE] = {};

// multiply, shift, mask
uint32_t simple_hash(uint32_t u) {
    uint64_t h = (uint64_t) u * 0x43ff9fb13510940a;
    h = (h >> 32) % SIZE;
    return (uint32_t) h;
}

// generate, cast and hash
void build_lut() {
    char strings[256*4];
    memset(strings, 0, sizeof(strings));
    char *iter = strings;
    for (int i = 0; i < 256; ++i) {
        sprintf(iter, "%d", i);
        iter += 4;
    }

    iter = strings;
    for (int i = 0; i < 256; ++i) {
        unsigned c = *(unsigned*) iter;
        iter += 4;
        unsigned idx = simple_hash(c);
        lut[idx] = i;
    }
}

Overall, a fun exercise in limits and mechanical sympathy.