I'm currently working to integrate the ARM64 ftrace patches to enable the full support to "dynamic ftrace with registers". Specifically I'm working on the 4.9.200 kernel version for the Pixel 3a (sargo) and the patches I'm referring to are the following:
- https://patchwork.kernel.org/patch/9275617/<- based on the 4.8.xkernel version
- https://patchwork.kernel.org/patch/10657431/<- basedon the 4.20.x kernel version
The aforementioned patches require the support to the '-fpatchable-function-entry=2' GCC 8.x compilation option, reason for which I integrated the support to GCC 9.1 to build the kernel. Compiling the kernel with this option properly inserts the 2 ARM64 NOP instructions at the prologue of each traceable function.
The issue is that the kernel compiled with the ported "dynamic ftrace with registers" patch (4.8.x and 4.20.x are really similar) is crashing during the transition from kernel-land to user-land, specifically in the call to 'do_execve()' to spawn '/init'. The ftrace initialization and the whole initial booting sequence in the crashing kernel is identical to a properly booting kernel (e.g. a kernel with the "dynamic ftrace enabled without registers" support).
The verbose logs ('debug', 'ignore_loglevel', 'initcall_debug' and increased log buffer shift) are enabled and the crash is not actually showing the reason of the failure (e.g. invalid instruction execution, invalid memory access).
An attempt to enable the full KASAN+KCOV support has been done, but it resulted to be impossible to carry on as the generated LZ4 image is too big to be loaded by the Pixel 3a bootloader resulting in a "FAILED (remote: 'Error verifying the received boot.img: Buffer Too Small')" fastboot error. Flashing the boot image is possible, but after the crash the device enters a bootloop phase where it's impossible to obtain the logs from '/sys/fs/pstore' because a new flash of the working boot image is causing a flush of the crash logs.
As an additional attempt, the 4.8.x patch has been ported to the 4.9.x kernel and the 4.20.x patch has been ported to the 4.19.x kernel for the HiKey620 board (ARM64-based) resulting in a successful boot in both cases (using the latest AOSP compiled from the 'master' branch) and with the possibility to use the "dynamic ftrace with registers" through the API from a kernel module. At this point I've been left wondering what may be the difference between the 4.9.x kernels for the HiKey620 board and the Pixel 3a.
I've also been playing with the kernel option 'CONFIG_DEBUG_RODATA' to disable the read only memory enforcing (e.g. this old issue https://github.com/raspberrypi/linux/issues/2166 is hinting to the ARM kernel crashing when ftrace is enabled and it turned out to be a read-only memory issue); in my case the full boot sequence is working fine so I excluded that as a possible cause.
To make sure that the '/init' (actually '/system/bin/init') binary is not executed at all I put some logs and an infinite loop as really first instructions in the entry point ('int main(..)' of the 'init/main.cpp' file) and the boot process is clearly not reaching that point, so this lead me to exclude a problem with the setup functions of the 'first' and 'second' init user-land stages.
The following links point to the verbose logs of the crashing kernel (4.9.200 with "dynamic ftrace with registers") and the booting kernel (4.9.200 with "dynamic ftrace without registers"):
- https://pastebin.com/Y4zR4dyu
Pixel 3a - Crashing Log
- https://pastebin.com/HubM6Cw9
Pixel 3a - Booting Log
What would it be the best way to debug the issue? Is there anything obvious that is causing the issue and I’m missing?
EDIT 1
I managed to get KASAN+KCOV build working after compiling the kernel with Clang10 and enabling the 'CONFIG_CC_OPTIMIZE_FOR_SIZE' option that uses the -Os compilation flag to shrink the size of the kernel image. Enabling CONFIG_GZIP reduces the size even more, enabling fastboot to properly boot the kernel without flashing. Clang10 has been compiled from sources, that now contain the full support to the '-fpatchable-function-entry' option. Even in this case the obtained crash logs are not hinting to anything particular (e.g no KASAN crashes or warnings).
While looking for similar problems I ran into what looks like a really similar issue, that had no solution: https://unix.stackexchange.com/questions/243515/why-cant-the-kernel-run-init.