direct write cache invalidation failure illustrated
An associate recently berated me for not posting about work things recently. Fair enough. Here’s the start of an attempt to do more blogging about my daily work.
For me, last week ended with a thread on lkml wherein a poor user reported having actually hit the nasty case where an O_DIRECT write doesn’t invalidate the page cache after a buffered reader races to bring in stale cached data during the write.
In his case he had a writer advancing through a file writing new contents. As it wrote it’d wake a buffered reader who would read up to the point of the new content that the writer had just written.
The problem was that the reader could trigger the kernel to read-ahead up into the region where the writer is currently writing with O_DIRECT. The kernel was failing to invalidate the existing page cache after the O_DIRECT write completes. The buffered reader will then wake a read stale data which was brought in with read-ahead from its previous reads. Wackiness ensues!
So, I threw together a test case. I know you, dear readers, have been just dying to see what this terrifying corner case actually looks like. Wonder no more!
[zab@hammer c]$ ./aio-dio-invalidate-check /tmp/something
( lots of time passes )
writing 2 to 3248128
setting write_pos to to 3248128
writing 2 to 3252224
reading from 2850816 to to 3248128 looking for 1
read 3248128 write 3248128
writing 2 to 3256320
writing 2 to 3260416
writing 2 to 3264512
writing 2 to 3268608
writing 2 to 3272704
setting write_pos to to 3272704
writing 2 to 3276800
writing 2 to 3280896
setting write_pos to to 3280896
writing 2 to 3284992
writing 2 to 3289088
writing 2 to 3293184
writing 2 to 3297280
reading from 3248128 to to 3280896 looking for 1
reader found old byte at pos 3252224
[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3252224 0101 0101 0101 0101 0101 0101 0101 0101
*
3256320 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608
[root@hammer ~]# echo 1 > /proc/sys/vm/drop_caches
[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608
That last bit shows the stale data present in the cache, the cache being purged, and then the stale data vanishing as the file is read back from disk.
Terrifying stuff, I know, but it is almost Halloween.
Malcolm Parsons wrote:
“setting write_pos to to 3272704″
is something missing between the 2 ‘to’s ?
Posted on 31-Oct-07 at 5:44 am | Permalink
Zach wrote:
Haha, almost certainly. That was just rough debugging output. The real program can be found in a git repo:
http://git.kernel.org/?p=linux/kernel/git/zab/aio-dio-regress.git;a=blob;f=c/aio-dio-invalidate-readahead.c
Posted on 31-Oct-07 at 9:12 am | Permalink