The kernel is an old one 3.1.x and the board is a S3C2416 with an external NAND flash.
I am struggled with a bug that resulted in NAND flash's ECC has a quite high probability to produce one-bit flip in a more or less fixed location in the 3x8 bytes of ECC which calculated from 256x8 bytes of a single NAND page. I've already ensured that this is not a NAND media issue, since the bit flip was observed before the ECC was sent to the NAND controller. The ECC could be a HWECC generated from NAND controller, or a software ECC generated from the kernel code, in either case, it always had chance to produce bad ECC with one bit flip. Therefore it's also not the NAND controller's fault.
The calculated ECC was kept in kernel in a buffer: chip->ecc->calc_buf, which was allocated from kmalloc(mtd->oobsize, GFP_KERNEL), where the oobsize = 64. My final finding was: if I not go kmalloc(), but just assign a pre-allocated buffer (as a static variable in the nand_base.c module) to the chip->ecc_buf, the same test can never able to produce the error. This change of the code is like below:
static unsigned char my_buff[oobsize];...chip->ecc->calc_buf = my_buff;
Another finding previous to this one was that: still using the kmalloc, but just inserted some delay after the NAND read/write operations, the same test also cannot able to produce the error.
With these facts, the only reasonable explanation to me is like that: The kmalloc-ed buffer is faster that the in-module static buffer. So accessing the in-module buffer is actually amount to inserting some delays implicitly. Do you think my guessing is possibly true? I think the kmalloc-ed buffer is at least word aligned, but I also think the my_buff usually should also be aligned.
BTW: The pattern of the ECC bit flip is very interesting, almost 90% cases when the error happened with same page to read/write, they are the same error:
correct ECC: 5a a9 a6 a5 69 a6 00 **30** 33 69 59 56 3f c3 cf 56 99 66 ff 03 33 6a a9 66 incorrect ECC: 5a a9 a6 a5 69 a6 00 **10** 33 69 59 56 3f c3 cf 56 99 66 ff 03 33 6a a9 66
For the reset 10%, they were still a '1' became '0', just happened in different location. (I actually rather like to think, it's a '0' failed to updated with '1').
Thanks.woody