rhel6再次发现jbd2的bug
RobinDong 2013-07-17 17:38
从上次修复jbd2的bug后到现在,已经四个月了,突然昨天DB的机器又出现了panic,用netoops抓到的信息如下:
居然又栽在jbd2上了,panic的位置和上次一样是
J_ASSERT(journal->j_running_transaction != NULL);
考虑到是运行很长时间的DB机器才出现这个问题,疑点还是落到tid上,由文卿同学找到了现象最符合的upstream的补丁:
commit deeeaf13b291420fe4a4a52606b9fc9128387340 Author: Theodore Ts'o Date: Sun May 1 18:16:26 2011 -0400 jbd2: fix fsync() tid wraparound bug If an application program does not make any changes to the indirect blocks or extent tree, i_datasync_tid will not get updated. If there are enough commits (i.e., 2**31) such that tid_geq()'s calculations wrap, and there isn't a currently active transaction at the time of the fdatasync() call, this can end up triggering a BUG_ON in fs/jbd2/commit.c: J_ASSERT(journal->j_running_transaction != NULL); It's pretty rare that this can happen, since it requires the use of fdatasync() plus *very* frequent and excessive use of fsync(). But with the right workload, it can. We fix this by replacing the use of tid_geq() with an equality test, since there's only one valid transaction id that we is valid for us to wait until it is commited: namely, the currently running transaction (if it exists). Signed-off-by: "Theodore Ts'o" diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index e0ec3db..8c10004 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -479,9 +479,12 @@ int __jbd2_log_space_left(journal_t *journal) int __jbd2_log_start_commit(journal_t *journal, tid_t target) { /* - * Are we already doing a recent enough commit? + * The only transaction we can possibly wait upon is the + * currently running transaction (if it exists). Otherwise, + * the target tid must be an old one. */ - if (!tid_geq(journal->j_commit_request, target)) { + if (journal->j_running_transaction && + journal->j_running_transaction->t_tid == target) { /* * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. @@ -493,7 +496,14 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) journal->j_commit_sequence); wake_up(&journal->j_wait_commit); return 1; - } + } else if (!tid_geq(journal->j_commit_request, target)) + /* This should never happen, but if it does, preserve + the evidence before kjournald goes into a loop and + increments j_commit_sequence beyond all recognition. */ + WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n", + journal->j_commit_request, journal->j_commit_sequence, + target, journal->j_running_transaction ? + journal->j_running_transaction->t_tid : 0); return 0; }
原来,做fdatasync()时,会调用 ext4_sync_file():
fs/ext4/fsync.c ext4_sync_file() commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; printk("datasync:%lu, sync:%lu\n", ei->i_datasync_tid, ei->i_sync_tid); if (jbd2_log_start_commit(journal, commit_tid)) { ......
直接把i_datasync_tid赋给target,然后调用__jbd2_log_start_commit(),由于i_datasync_tid只在新创建inode或者ext4_iget()时更新,而数据库在打开某些文件后几乎再不关闭,而且一直是覆盖写,extent树不变,故而i_datasync_tid始终不更新,保持一个很小的值,而j_commit_request却随着jbd2日志的不断提交而不停增长,最后,它俩的差值越来越大,直至差值超过int型上限,tid_geq判断出错。
真是此恨绵绵,上次通过给i_datasync_tid赋正确的初值避免了“0比2177452108更晚”这一逻辑错误,修复了,但现在又是由于j_commit_request和target之间差值太大(大于了2的31次方)再次造成判断错误,所以又触发了BUG_ON。这次明显Ted哥也抓狂了,折腾tid_geq这破逻辑干嘛,直接判断j_running_transaction是不是NULL得了….
重现方法由伯瑜发现:打开一个文件,不停的覆盖写和fdatasync();另外启动一个进程打开另一个文件不停覆盖写和fsync()。重现时间需要三个多月,我们在线下修改了部分内核代码才加速复现的。
上次线上临时修复这个问题是umount ext4文件系统后用专用工具直接将tid置为0,这次其实临时方案更简单:umount后再mount就行了,因为这样会让i_datasync_tid更新为最新的tid,避免了它们之间差值不断变大。当然,文卿同学正在用kprobe的机制写了个hotfix的内核模块,这个才是正解。
DB的运维同学又渡过了一个难眠的夜晚,给大家添麻烦了,实在抱歉!