Why is the system CPU time (% sy) high?

January 10, 2020, 2:19 am

≫ Next: PCI driver 'Oops: Kernel access of bad area' error

≪ Previous: Why both ioremap() and kmap() are needed?

I am running a script that loads big files. I ran the same script in a single core OpenSuSe server and quad core PC. As expected in my PC it is much more faster than in the server. But, the script slows down the server and makes it impossible to do anything else.

My script is

for 100 iterations
Load saved data (about 10 mb)

time myscript (in PC)

real    0m52.564s
user    0m51.768s
sys    0m0.524s

time myscript (in server)

real    32m32.810s
user    4m37.677s
sys    12m51.524s

I wonder why "sys" is so high when i run the code in server. I used top command to check the memory and cpu usage. enter image description here It seems there is still free memory, so swapping is not the reason. % sy is so high, its probably the reason for the speed of server but I dont know what is causing % sy so high. The process that is using highest percent of CPU (99%) is "myscript". %wa is zero in the screenshot but sometimes it gets very high (50 %).

When the script is running, load average is greater than 1 but have never seen to be as high as 2.

I also checked my disc:

strt:~ # hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   16480 MB in  2.00 seconds = 8247.94 MB/sec
 Timing buffered disk reads:   20 MB in  3.44 seconds =   5.81 MB/sec

john@strt:~> df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       245G  102G  131G  44% /
udev            4.0G  152K  4.0G   1% /dev
tmpfs           4.0G   76K  4.0G   1% /dev/shm

I have checked these things but I am still not sure what is the real problem in my server and how to fix it. Can anyone identify a probable reason for the slowness? What could be the solution? Or is there anything else I should check?

Thanks!

↧

PCI driver 'Oops: Kernel access of bad area' error

January 10, 2020, 3:08 am

≫ Next: understanding jiffy update in work queues

≪ Previous: Why is the system CPU time (% sy) high?

I wanted to write a simple PCI express driver for Xilinx FPGA. But I am not able to request memory region for PCI.

Question is: How to claim that I/O memory area for custom driver. I want to write 3. byte of that area using driver.

Below are the details. What am I missing ? Thanks.

1-) I am getting this error:

[    4.345350] Unable to handle kernel paging request for data at address 0x00000005
[    4.353978] Faulting instruction address: 0x80000000002c9370
[    4.358337] Oops: Kernel access of bad area, sig: 11 [#1]
[    4.362426] BE SMP NR_CPUS=24 CoreNet Generic
[    4.365477] Modules linked in: fpgapcie(O+) ucc_uart
[    4.369139] CPU: 0 PID: 1999 Comm: udevd Tainted: G           O      4.19.26+gc0c2141 #1
[    4.375924] NIP:  80000000002c9370 LR: 80000000002c9350 CTR: c00000000053acfc
[    4.381753] REGS: c0000001ee2bb1c0 TRAP: 0300   Tainted: G           O       (4.19.26+gc0c2141)
[    4.389146] MSR:  000000008002b000 <CE,EE,FP,ME>  CR: 22228242  XER: 20000000
[    4.394982] DEAR: 0000000000000005 ESR: 0000000000800000 IRQMASK: 0 
               GPR00: 80000000002c9350 c0000001ee2bb440 80000000002d1f00 000000000000001a 
               GPR04: 0000000000000001 000000000000022d c000000000f30548 c000000001013000 
               GPR08: 00000001fec37000 0000000000000003 0000000000000000 0000000000000020 
               GPR12: 0000000028228444 c000000001013000 0000000000020000 000000013c323ac8 
               GPR16: 000000013c323ae0 80000000002cc000 c000000000a194b0 c0000001f0eaa1c0 
               GPR20: 00000000006000c0 c000000000ed9da0 0000000000000000 0000000000000100 
               GPR24: 000000000000001c 000000000f700000 c0000001f3034880 0000000000000000 
               GPR28: c0000001f337b800 00000000000000f7 c0000001f337b8a0 0000000000000000

2-) Code piece in PCI probe function:

    static int pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
    {
    int ret, minor;
    struct cdev *cdev;
    dev_t devno;
    unsigned long pci_io_addr = 0;
    /* add this pci device in pci_cdev */
    if ((minor = pci_cdev_add(pci_cdev, MAX_DEVICE, dev)) < 0)
        goto error;

    /* compute major/minor number */
    devno = MKDEV(major, minor);

    /* allocate struct cdev */
    cdev = cdev_alloc();

    /* initialise struct cde
    cdev_init(cdev, &pci_ops);
    cdev->owner = THIS_MODULE;

    /* register cdev */
    ret = cdev_add(cdev, devno, 1);
    if (ret < 0) {
        dev_err(&(dev->dev), "Can't register character device\n");
        goto error;
    }
    pci_cdev[minor].cdev = cdev;

    dev_info(&(dev->dev), "%s The major device number is %d (%d).\n",
           "Registeration is a success", MAJOR(devno), MINOR(devno));
    dev_info(&(dev->dev), "If you want to talk to the device driver,\n");
    dev_info(&(dev->dev), "you'll have to create a device file. \n");
    dev_info(&(dev->dev), "We suggest you use:\n");
    dev_info(&(dev->dev), "mknod %s c %d %d\n", DEVICE_NAME, MAJOR(devno), MINOR(devno));
    dev_info(&(dev->dev), "The device file name is important, because\n");
    dev_info(&(dev->dev), "the ioctl program assumes that's the\n");
    dev_info(&(dev->dev), "file you'll use.\n");

    /* enable the device */
    pci_enable_device(dev);

    /* 'alloc' IO to talk with the card */
    if (pci_request_region(dev, BAR_IO, "IO-pci") == 0) {
        printk(KERN_ALERT "The memory you requested from fpgapcie is already reserved by CORE pci driver.");
    }

     check that BAR_IO is *really* IO region 
    if ((pci_resource_flags(dev, BAR_IO) & IORESOURCE_IO) != IORESOURCE_IO) {
        dev_err(&(dev->dev), "BAR2 isn't an IO region\n");
        cdev_del(cdev);
        goto error;
    }


    pci_io_addr = pci_resource_start(dev,BAR_IO);
        printk(KERN_INFO "PCI start adress: %02X", &pci_io_addr);
    outb(pci_io_addr+3, 5);
        printk(KERN_INFO "Message from PCI device to user: 5");
    return 1;

error:
        printk(KERN_INFO "An error occuder while probing pci");
    return 0;
}

3-) lspci -v output:

0001:01:00.0 Memory controller: Xilinx Corporation Device 7021
        Subsystem: Xilinx Corporation Device 0007
        Flags: bus master, fast devsel, latency 0, IRQ 41
        Memory at c10000000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [100] Device Serial Number 00-00-00-01-01-00-0a-35
        Kernel driver in use: yusufpci
        Kernel modules: fpgapcie

4-) full dmesg:

[    4.285663] Module pci init
[    4.294787] yusufpci 0001:01:00.0: Registeration is a success The major device number is 247 (0).
[    4.302367] yusufpci 0001:01:00.0: If you want to talk to the device driver,
[    4.308116] yusufpci 0001:01:00.0: you'll have to create a device file. 
[    4.313516] yusufpci 0001:01:00.0: We suggest you use:
[    4.317354] yusufpci 0001:01:00.0: mknod virtual_pci c 247 0
[    4.321713] yusufpci 0001:01:00.0: The device file name is important, because
[    4.327553] yusufpci 0001:01:00.0: the ioctl program assumes that's the
[    4.332866] yusufpci 0001:01:00.0: file you'll use.
[    4.336459] The memory you requested from fpgapcie is already reserved by CORE pci driver. This is not an error.
[    4.336463] PCI start adress: EE2BB4B0
[    4.345350] Unable to handle kernel paging request for data at address 0x00000005
[    4.353978] Faulting instruction address: 0x80000000002c9370
[    4.358337] Oops: Kernel access of bad area, sig: 11 [#1]
[    4.362426] BE SMP NR_CPUS=24 CoreNet Generic
[    4.365477] Modules linked in: fpgapcie(O+) ucc_uart
[    4.369139] CPU: 0 PID: 1999 Comm: udevd Tainted: G           O      4.19.26+gc0c2141 #1
[    4.375924] NIP:  80000000002c9370 LR: 80000000002c9350 CTR: c00000000053acfc
[    4.381753] REGS: c0000001ee2bb1c0 TRAP: 0300   Tainted: G           O       (4.19.26+gc0c2141)
[    4.389146] MSR:  000000008002b000 <CE,EE,FP,ME>  CR: 22228242  XER: 20000000
[    4.394982] DEAR: 0000000000000005 ESR: 0000000000800000 IRQMASK: 0 
               GPR00: 80000000002c9350 c0000001ee2bb440 80000000002d1f00 000000000000001a 
               GPR04: 0000000000000001 000000000000022d c000000000f30548 c000000001013000 
               GPR08: 00000001fec37000 0000000000000003 0000000000000000 0000000000000020 
               GPR12: 0000000028228444 c000000001013000 0000000000020000 000000013c323ac8 
               GPR16: 000000013c323ae0 80000000002cc000 c000000000a194b0 c0000001f0eaa1c0 
               GPR20: 00000000006000c0 c000000000ed9da0 0000000000000000 0000000000000100 
               GPR24: 000000000000001c 000000000f700000 c0000001f3034880 0000000000000000 
               GPR28: c0000001f337b800 00000000000000f7 c0000001f337b8a0 0000000000000000 
[    4.453632] NIP [80000000002c9370] .pci_probe+0x220/0x2b4 [fpgapcie]
[    4.458680] LR [80000000002c9350] .pci_probe+0x200/0x2b4 [fpgapcie]
[    4.463639] Call Trace:
[    4.464775] [c0000001ee2bb440] [80000000002c9350] .pci_probe+0x200/0x2b4 [fpgapcie] (unreliable)
[    4.472262] [c0000001ee2bb500] [c0000000004b77c8] .pci_device_probe+0x11c/0x1f4
[    4.478270] [c0000001ee2bb5a0] [c000000000561ebc] .really_probe+0x26c/0x38c
[    4.483927] [c0000001ee2bb640] [c0000000005621ac] .driver_probe_device+0x78/0x154
[    4.490106] [c0000001ee2bb6d0] [c0000000005623d8] .__driver_attach+0x150/0x154
[    4.496025] [c0000001ee2bb760] [c00000000055f424] .bus_for_each_dev+0x94/0xdc
[    4.501856] [c0000001ee2bb800] [c0000000005615fc] .driver_attach+0x24/0x38
[    4.507426] [c0000001ee2bb870] [c000000000560ec8] .bus_add_driver+0x264/0x2a4
[    4.513258] [c0000001ee2bb910] [c000000000563384] .driver_register+0x88/0x178
[    4.519089] [c0000001ee2bb990] [c0000000004b5a68] .__pci_register_driver+0x50/0x64
[    4.525355] [c0000001ee2bba00] [80000000002c9564] .pci_init_module+0xc0/0x444 [fpgapcie]
[    4.532144] [c0000001ee2bba80] [c0000000000020b4] .do_one_initcall+0x64/0x224
[    4.537978] [c0000001ee2bbb50] [c0000000000f443c] .do_init_module+0x70/0x260
[    4.543722] [c0000001ee2bbbf0] [c0000000000f6564] .load_module+0x1e6c/0x2400
[    4.549467] [c0000001ee2bbd10] [c0000000000f6d28] .__se_sys_finit_module+0xcc/0x100
[    4.555819] [c0000001ee2bbe30] [c0000000000006b0] system_call+0x60/0x6c
[    4.561127] Instruction dump:
[    4.562785] e86a8080 38810070 f9210070 4800041d e8410028 e9210070 3d420000 e94a8088 
[    4.569231] 39290003 5529063e e94a0000 7c0004ac <992a0005> 39200001 3d420000 992d0684 
[    4.575854] ---[ end trace 2d15cff7ba1b3255 ]---

↧

understanding jiffy update in work queues

January 10, 2020, 3:17 am

≫ Next: Why is an infinite loop sleeping mostly on an isolated core?

≪ Previous: PCI driver 'Oops: Kernel access of bad area' error

I'm trying to understand workqueue better using the following kernel module:

#include <linux/module.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/time.h>
#include <linux/delay.h>
#include <linux/workqueue.h>
#include <linux/jiffies.h>

static DECLARE_WAIT_QUEUE_HEAD(my_wq);
static int condition = 0;

/* declare a work queue*/
static struct work_struct wrk;

static void work_handler(struct work_struct *work)
{ 
    pr_info("Waitqueue module handler %s, ms cnt is %u\n", __FUNCTION__, jiffies_to_msecs(jiffies));
    msleep(3000);
    pr_info("Wake up the sleeping module\n");
    condition = 1;
    wake_up_interruptible(&my_wq);
}

static int __init my_init(void)
{
    pr_info("Wait queue example, ms count is %u\n", jiffies_to_msecs(jiffies));

    INIT_WORK(&wrk, work_handler);
    schedule_work(&wrk);

    pr_info("Going to sleep %s\n", __FUNCTION__);
    wait_event_interruptible(my_wq, condition != 0);

    pr_info("woken up by the work job at %ums\n", jiffies_to_msecs(jiffies));
    return 0;
}

void my_exit(void)
{
    pr_info("waitqueue example cleanup\n");
}

module_init(my_init);
module_exit(my_exit);
MODULE_AUTHOR("Aijaz Baig <aijazbaig1@gmail.com>");
MODULE_LICENSE("GPL");

Output:

Jan  3 06:53:27 buildroot user.info kernel: Wait queue example, ms count is 6144890
Jan  3 06:53:27 buildroot user.info kernel: Going to sleep my_init
Jan  3 06:53:27 buildroot user.info kernel: Waitqueue module handler work_handler, ms cnt is 6144890
Jan  3 06:53:30 buildroot user.info kernel: Wake up the sleeping module
Jan  3 06:53:30 buildroot user.info kernel: woken up by the work job at 6147900ms

Two noticeable things here:

The ms difference, after the sleep is actually 3010ms
the jiffy count under the __init function and the work_handler function is EXACTLY the same

Running it (removing and re-inserting the module) for a second time gives:

Jan  3 06:59:28 buildroot user.info kernel: Wait queue example, ms count is 6506580
Jan  3 06:59:28 buildroot user.info kernel: Going to sleep my_init
Jan  3 06:59:29 buildroot user.info kernel: Waitqueue module handler work_handler, ms cnt is 6506640
Jan  3 06:59:31 buildroot user.info kernel: Wake up the sleeping module
Jan  3 06:59:31 buildroot user.info kernel: woken up by the work job at 6509660ms

This time around, the jiffy counts in the __init and the work_handler() functions are different. However the sleep duration is 3020ms.

Which brings me to questions:

Is it normal to have the actual sleep duration this different from the one intended (a difference of ~15ms -20ms)?
How come for the first run, there was NO update in the jiffy count whatsoever?

Please shed some light on this.

↧

Why is an infinite loop sleeping mostly on an isolated core?

January 10, 2020, 3:26 am

≫ Next: Is it a safe to use Direct-IO write and Page Cache read at the same time?

≪ Previous: understanding jiffy update in work queues

I tried the following code to give all execution time by isolating CPU on Centos 8:

#include <inttypes.h>
#include <pthread.h>
#include <sched.h>

int main()
{
    volatile uint32_t i = 0;

    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(15, &cpuset);

    pthread_t thread = pthread_self();

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    // ... error checking ..

    while (1) {
        i++;
    }

    return 0;
}

However, the output of top command shows that the 15th CPU is in sleeping state mostly as below

Tasks: 287 total,   2 running, 285 sleeping,   0 stopped,   0 zombie
...
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...

Also the time output shows me the application uses the CPU only half of its time. Who uses the other half?

time ./cpu_test.out 
^C
real    0m10.984s
user    0m5.494s
sys     0m0.000s

I'm using the following settings:

kernel-4.18.0-80.11.2.el8_0.x86_64
CentOS Linux release 8.0.1905

# cat /sys/devices/system/cpu/isolated
2-17

# cat /sys/devices/system/cpu/present
0-17

# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-80.11.2.el8_0.x86_64 root=/dev/mapper/cl-root ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet nosoftlockup mce=ignore_ce intel_idle.max_cstate=0 processor.max_cstate=0 nohz_full=2-17 iommu=off isolcpus=2-17 audit=0 idle=poll skew_tick=1

# gcc --version
gcc (GCC) 9.2.0

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              18
On-line CPU(s) list: 0-17
Thread(s) per core:  1
Core(s) per socket:  18
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:            7
CPU MHz:             3899.999
CPU max MHz:         4000.0000
CPU min MHz:         1200.0000
BogoMIPS:            6200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-17
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

What should I do about this? I really appreciate the assistance. If you guys need any missing information, I will edit my question.

EDIT 1

After Maxim Egorushkin comment, the things keep getting weird. I used perf tool and the output is as below, but 16th CPU usage stably more than 99% if I run the process with perf. Without perf, still suffering to get all CPU cycles.

# perf stat -ddd ./cpu_test.out
^C./cpu_test.out: Interrupt

 Performance counter stats for './cpu_test.out':

  14535.018449      task-clock:u (msec)       #    1.000 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
            45      page-faults:u             #    0.003 K/sec                  
56,195,943,206      cycles:u                  #    3.866 GHz                      (69.22%)
40,703,654,800      instructions:u            #    0.72  insn per cycle           (76.92%)
10,177,448,394      branches:u                #  700.202 M/sec                    (76.93%)
         3,171      branch-misses:u           #    0.00% of all branches          (76.93%)
10,179,958,967      L1-dcache-loads:u         #  700.375 M/sec                    (76.93%)
         4,651      L1-dcache-load-misses:u   #    0.00% of all L1-dcache hits    (76.93%)
           760      LLC-loads:u               #    0.052 K/sec                    (76.93%)
            12      LLC-load-misses:u         #    1.58% of all LL-cache hits     (76.93%)
<not supported>     L1-icache-loads:u                                           
          5,274     L1-icache-load-misses:u                                       (76.93%)
10,178,790,717      dTLB-loads:u              #  700.294 M/sec                    (76.92%)
             0      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   (61.53%)
             0      iTLB-loads:u              #    0.000 K/sec                    (61.53%)
             0      iTLB-load-misses:u        #    0.00% of all iTLB cache hits   (61.53%)
<not supported>     L1-dcache-prefetches:u                                      
<not supported>     L1-dcache-prefetch-misses:u                                   

  14.535710128 seconds time elapsed

  14.449669000 seconds user
   0.002993000 seconds sys

↧

Is it a safe to use Direct-IO write and Page Cache read at the same time?

January 10, 2020, 5:11 am

≫ Next: Do kernel NFS module have concurrent limit?

≪ Previous: Why is an infinite loop sleeping mostly on an isolated core?

For instance, open a file twice, direct-io writes with one fd, and page cache reads with the other?

How to define safe: Write some data from direct-io fd and then expect to read them immediately from page-cache fd

↧

Do kernel NFS module have concurrent limit?

January 10, 2020, 5:35 am

≫ Next: Raspberry Pi 4 & Linux IRQ translations: DTS, /proc/interrupts and the datasheet

≪ Previous: Is it a safe to use Direct-IO write and Page Cache read at the same time?

Background: I was testing a nfs-server with fio. And I find that no matter how much "iodepth" is set to fio. The nfs-server can only have "64 Inflight". So I just suspect that somewhere around "nfs protocol" limits the max concurrent(max io in flight).

fio command is

fio -numjobs=1 -iodepth=128 -direct=1 -ioengine=libaio -sync=1 -rw=write -bs=4k -size=500M -time_based -runtime=90 -name=Fiow -directory=/75

My nfs-server is based on ganesha, and got conclusion "64 Inflight" by using ganesha_stats.py.

So I have two options for now:

Study the calling-graph and read code to find the problem
1. I download linux kernel code, but struggle to . Which function/source file should I begin , maybe vfs.c:nfsd_write?
2. Tried to use 'perf' to trace calling-graph to speed up my code reading tour for linux kernel, but failed. Because 'perf report' shows shared library symbol without function name.
Learn the nfs protocol/mount cmd to seek the limit.

Can somebody can help me with this? :)

↧

Raspberry Pi 4 & Linux IRQ translations: DTS, /proc/interrupts and the datasheet

January 10, 2020, 5:54 am

≫ Next: How do I use the getname(/fs/namei.c) function in my LKM (linux 4.19)

≪ Previous: Do kernel NFS module have concurrent limit?

I am looking at the datasheet of the BCM2835. On page 112 and 113 we have a bunch of tables, explaining where what interrupts are. For example, the SPI interrupt is in the periphials interrupt table, entry 54.

Now I am using Linux and I am writing a kernel driver and I want to call the function request_irq, so I need the correct IRQ. From the DTS file (BCM2838.dtsi) I know that the IRQ is 0x76, from /proc/interrupts I know that it is 35 and I have no idea how I would translate between these numbers and how they relate to what the datasheet says.

Now I get that the actual logic is probably device specific, but how would I handle this? How can I decode the correct IRQ number just based upon the Linux source code and the dataseet? And why is there a difference between what the DTS says and what /proc/interrupts says?

↧

How do I use the getname(/fs/namei.c) function in my LKM (linux 4.19)

January 10, 2020, 11:15 pm

≫ Next: Why doesn't likely and unlikely macros have any effect on ARM assembly code?

≪ Previous: Raspberry Pi 4 & Linux IRQ translations: DTS, /proc/interrupts and the datasheet

I am writing a linux kernel module, and I want to hook the system call table through it, but when I obtain the filename, I find that the content is always garbled

asmlinkage long monitor_execve_hook(
    const char *filename,
    const char *argv,
    const char *envp) {
    printk(KERN_ALERT "%s", filename);
    return orig_stub_execve(filename, argv, envp);
}

Later, I knew that I needed to convert the path from user mode to kernel mode via a getname function

At this point I found that I couldn't find this function

I tried to introduce namei.h but it didn't solve the problem ( No return value is received here because I just want to determine if the function exists)

asmlinkage long monitor_execve_hook(
    const char *filename,
    const char *argv,
    const char *envp) {
    getname(filename);
    printk(KERN_ALERT "%s", filename);
    return orig_stub_execve(filename, argv, envp);
}

I want to know if there is any function that can replace the getname function, or how do I need to find the getnamne function

↧

Why doesn't likely and unlikely macros have any effect on ARM assembly code?

January 11, 2020, 7:27 am

≫ Next: Writable seq_file in Linux Kernel

≪ Previous: How do I use the getname(/fs/namei.c) function in my LKM (linux 4.19)

I took below example from https://kernelnewbies.org/FAQ/LikelyUnlikely

#include <stdio.h>
#define likely(x)    __builtin_expect(!!(x), 1)
#define unlikely(x)  __builtin_expect(!!(x), 0)

int main(char *argv[], int argc)
{
   int a;

   /* Get the value from somewhere GCC can't optimize */
   a = atoi (argv[1]);

   if (likely (a == 2))
      a++;
   else
      a--;

   printf ("%d\n", a);

   return 0;
}

and compiled it https://godbolt.org/z/IC0aif with arm gcc 8.2 compiler.

In the original link, they have tested it for X86 and the assembly output is different if likely(in the if condition in above code) is replaced with unlikely, which shows the optimisation performed by compiler for branch prediction.

But when I compile the above code for ARM (arm-gcc -O2), I don't see any difference in assembly code. Below is the output of ARM assembly in both the case - likely and unlikely

main:
        push    {r4, lr}
        ldr     r0, [r0, #4]
        bl      atoi
        cmp     r0, #2
        subne   r1, r0, #1
        moveq   r1, #3
        ldr     r0, .L6
        bl      printf
        mov     r0, #0
        pop     {r4, pc}
.L6:
        .word   .LC0
.LC0:
        .ascii  "%d\012\000"

Why doesn't the compiler optimize for branch prediction in case of ARM ?

↧

Writable seq_file in Linux Kernel

January 11, 2020, 6:18 pm

≫ Next: How does fstrim not suffer from race conditions?

≪ Previous: Why doesn't likely and unlikely macros have any effect on ARM assembly code?

I'm running some experiments with seq_files and have some confusion regarding it.

I analyzed implementation of common functions from seq_file.c and judging by seq_printf implementation the internal char *buf of the struct seq_file is used entirely to store a formatted string to copy to a user in seq_read. But there is seq_write function defined in in seq_file.c which can write to the buffer.

QUESTION: Is it possible to reuse the struct seq_file's internal buffer and use it for writing data coming from user or it is for data formatting only?

I currently used another buffer for writing data and struct seq_file for data formatting only:

static char buf[4096];    
static char *limit = buf;

void *pfsw_seq_start(struct seq_file *m, loff_t *pos){
    if(*pos >= limit - buf) {
        return NULL;
    }
    char *data = buf + *pos;
    *pos = limit - buf;
    return data;
}

void pfsw_seq_stop(struct seq_file *m, void *v){ }

void *pfsw_seq_next(struct seq_file *m, void *v, loff_t *pos){ return NULL; }

int pfsw_seq_show(struct seq_file *m, void *v){
    seq_printf(m, "Data: %s\n", (char *) v);
    return 0;
}

ssize_t pfsw_seq_write(struct file *filp, const char __user * user_data, size_t sz, loff_t *off){
    if(*off < 0 || *off > sizeof buf){
        return -EINVAL;
    }
    size_t space_left_from_off = sizeof buf - (size_t) *off;
    size_t bytes_to_write = space_left_from_off <= sz ? space_left_from_off : sz;
    if(copy_from_user(buf + *off, user_data, bytes_to_write)){
        return -EAGAIN;
    }
    *off += bytes_to_write;
    if(*off > limit - buf){
        limit = buf + *off;
    }
    return bytes_to_write;
}

So I defined struct file_operations as

static const struct seq_operations seq_ops = {
    .start = pfsw_seq_start,
    .stop  = pfsw_seq_stop,
    .next  = pfsw_seq_next,
    .show  = pfsw_seq_show
};

int pfsw_seq_open(struct inode *ino, struct file *filp){ 
    return seq_open(filp, &seq_ops);
}

static const struct file_operations fops = {
    .open = pfsw_seq_open,
    .read = seq_read,
    .write = pfsw_seq_write,
    .release = seq_release,
};

↧

How does fstrim not suffer from race conditions?

January 11, 2020, 11:14 pm

≫ Next: Where is the current_thread_info implementation for x86_64?

≪ Previous: Writable seq_file in Linux Kernel

It is my understanding that the fstrim utility on GNU/Linux is just a utility, not a kernel module; how does it avoid race conditions between finding out that a given block is unused and issuing the ioctl(2)FITRIM command to TRIM it?

http://man7.org/linux/man-pages/man8/fstrim.8.html

↧

Where is the current_thread_info implementation for x86_64?

January 12, 2020, 4:11 am

≫ Next: How sem_post/sem_wait differntiate between memory based and kernel based semaphore

≪ Previous: How does fstrim not suffer from race conditions?

Inside a Linux Kernel git working directory I did:

git grep -n '*current_thread_info('

and nothing appear to come up related to x86_64. The outout was:

arch/arc/include/asm/thread_info.h:62:static inline __attribute_const__ struct thread_info *current_thread_info(void)
arch/arm/include/asm/thread_info.h:86:static inline struct thread_info *current_thread_info(void) __attribute_const__;
arch/arm/include/asm/thread_info.h:88:static inline struct thread_info *current_thread_info(void)
arch/c6x/include/asm/thread_info.h:62:struct thread_info *current_thread_info(void)
arch/csky/include/asm/thread_info.h:43:static inline struct thread_info *current_thread_info(void)
arch/h8300/include/asm/thread_info.h:50:static inline struct thread_info *current_thread_info(void)
arch/m68k/include/asm/thread_info.h:46:static inline struct thread_info *current_thread_info(void)
arch/microblaze/include/asm/thread_info.h:90:static inline struct thread_info *current_thread_info(void)
arch/mips/include/asm/thread_info.h:55:static inline struct thread_info *current_thread_info(void)
arch/nios2/include/asm/thread_info.h:67:static inline struct thread_info *current_thread_info(void)
arch/sh/include/asm/thread_info.h:70:static inline struct thread_info *current_thread_info(void)
arch/sparc/include/asm/thread_info_64.h:128:extern struct thread_info *current_thread_info(void);
arch/um/include/asm/thread_info.h:44:static inline struct thread_info *current_thread_info(void)
arch/unicore32/include/asm/thread_info.h:90:static inline struct thread_info *current_thread_info(void) __attribute_const__;
arch/unicore32/include/asm/thread_info.h:92:static inline struct thread_info *current_thread_info(void)
arch/xtensa/include/asm/thread_info.h:84:static inline struct thread_info *current_thread_info(void)

Any idea where to find the current_thread_info implementation for x86_64?

↧

How sem_post/sem_wait differntiate between memory based and kernel based semaphore

January 12, 2020, 4:20 am

≫ Next: Why doesn't the dirtyCOW PoC work on my VM?

≪ Previous: Where is the current_thread_info implementation for x86_64?

Not able to figure out , how sem_post/sem_wait functions differentiate between memory based and kernel based semaphore passed to them ? For kernel based semaphore(named semaphore) it would require system call to do any operation. For memory based no system call required to do any operation. Suppose we created a named semaphore sem1 using sem_open call. Then under /dev/shm shows a file appears for this semaphore ls -l /dev/shm/sem.* -rw-r----- 1 root root 16 Jan 12 14:38 sem.sem1. As this is file ,so when we access it via sem_post/sem_wait we would require to do system call. Or else they are mapped in process address space at time of sem_open then we won't require system call in sem_post/sem_wait. So which is the correct way for named semaphore ?

↧

Why doesn't the dirtyCOW PoC work on my VM?

January 12, 2020, 7:13 am

≫ Next: What is mark_page_accessed for in Linux?

≪ Previous: How sem_post/sem_wait differntiate between memory based and kernel based semaphore

I've wanted to write an article about the DirtyCOW exploit. And I can't figure out why it's not working on my system.

I have a virtual machine from VirtualBox which is running Ubuntu 9.10 (which I've downloaded from here), with the kernel version 2.6.31-14-generic. (which according to wikipedia is still vulnerable)

I've compiled the PoC that can be found here, and I run the following:

cow@dirtycow:~/Desktop$ uname -r
2.6.31-14-generic
cow@dirtycow:~/Desktop$ sudo su
[sudo] password for cow:
root@dirtycow:/home/cow/Desktop# echo "hey"> weird_file
root@dirtycow:/home/cow/Desktop# exit
exit
cow@dirtycow:~/Desktop$ ./dirtyc0w weird_file AA
mmap b77f2000

procselfmem -100000000

madvise 0

cow@dirtycow:~/Desktop$ cat weird_file
hey

I am using Oracle's VirtualBox to run the Ubuntu 9.10 vm, and the stats for the virtual machine are:

64 bit
3MB RAM Memory with 55GB HDD
3 CPUs (100% execution cap)

↧

What is mark_page_accessed for in Linux?

January 12, 2020, 10:03 am

≫ Next: Is there any way to solve "end kernel panic not syncing no working init found" [closed]

≪ Previous: Why doesn't the dirtyCOW PoC work on my VM?

I'm profiling my I/O-intensive application and faced some strange thing when performing read. Here is the part of perf report that confused me:

As can be seen it mark_page_accessed takes 20% of the file read routine. It does not seem to be related to the block device I/O which is performed by page_cache_async_readahead which is not even called in my case. So all file data is in cache.

The void mark_page_accessed(struct page *page) implementation is commented with the following text:

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced    ->  inactive,referenced
 * inactive,referenced      ->  active,unreferenced
 * active,unreferenced      ->  active,referenced
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */

So it seems it occurs if a page is either inactive or unreferenced.

QUESTION: What is the reason for this mark_page_accessed? Is it occurred because of the page is either inactive or unreferenced? What might be a cause? The page is already read from the disk so no page-fault (either major or minor) occurs.

↧

Is there any way to solve "end kernel panic not syncing no working init found" [closed]

January 12, 2020, 11:08 pm

≫ Next: How to build a working TPM2 image for Raspberry Pi with Yocto?

≪ Previous: What is mark_page_accessed for in Linux?

I'm trying to install kali linux in Oracle VM VirtualBox. I have downloaded the kali linux ISO file. In setting "Storage devices Controller:IDE" I have placed the ISO file properly. Its not booting up. The error that throws up is that "end kernel panic not syncing no working init found. Can someone help me out I'm totally new to Linux OS.

↧

How to build a working TPM2 image for Raspberry Pi with Yocto?

January 13, 2020, 6:02 am

≫ Next: Mannually Update Rhel 7.4 and keeping the current running version of OS

≪ Previous: Is there any way to solve "end kernel panic not syncing no working init found" [closed]

I want to build a Linux System with Yocto for the Raspberry Pi with enabled IMA & TPM2.0 support. Therefore I want to compile the kernel with IMA/EVM and TPM Configs and Recipes.

The IMA support should be enabled through the layer meta-secure-core/meta-integrity and adding DISTRO_FEATURE "ima", aswell as IMAGE_INSTALL_append "packagegroup-ima" for the tools. The TPM2 support should be enabled through the meta-security/meta-tpm layer and by adding MACHINE_FEATURES "tpm2" and installing "packagegroup-security-tpm2" via IMAGE_INSTALL_append.

Furthermore, if I understand it correctly, I need systemd as the init_manager.

Yocto Version (Thud/2.6.3). I tried Warrior but ran into build errors. This creates a 4.14.X Linux Kernel.

bblayers.conf:

BBLAYERS ?= " \
  /<working-dir>/poky/meta \
  /<working-dir>/poky/meta-poky \
  /<working-dir>/poky/meta-yocto-bsp \
  /<working-dir>/meta-openembedded/meta-oe \
  /<working-dir>/meta-openembedded/meta-python \
  /<working-dir>/meta-openembedded/meta-networking \
  /<working-dir>/meta-openembedded/meta-perl \
  /<working-dir>/meta-security \
  /<working-dir>/meta-security/meta-tpm \
  /<working-dir>/meta-secure-core/meta-integrity \
  /<working-dir>/meta-raspberrypi \
  "

local.conf:

MACHINE = "raspberrypi3"
...
DISTRO_FEATURES_append += "systemd ima"
VIRTUAL-RUNTIME_init_manager = "systemd"
MACHINE_FEATURES += "tpm2"
IMAGE_INSTALL_append += "packagegroup-security-tpm2 packagegroup-ima"
ENABLE_SPI_BUS = "1"
RPI_EXTRA_CONFIG = "\n \
dtoverlay=tpm-slb9670 \n"

Builds:

/<working-dir>/build/$ bitbake core-image-minimal

I expected the following entries in /proc/config.gz

For TPM:

    CONFIG_HW_RANDOM_TPM=y
    CONFIG_TCG_TPM=y
    CONFIG_TCG_TIS_CORE=y
    CONFIG_TCG_TIS=y
    CONFIG_TCG_CRB=y
    CONFIG_SECURITYFS=y

For IMA:

    CONFIG_IMA=y
    # CONFIG_IMA_KEXEC is not set
    # CONFIG_IMA_LSM_RULES is not set
    CONFIG_IMA_WRITE_POLICY=y
    CONFIG_IMA_READ_POLICY=y
    CONFIG_IMA_MEASURE_PCR_IDX=10
    # CONFIG_IMA_TEMPLATE is not set
    # CONFIG_IMA_NG_TEMPLATE=y is not set
    CONFIG_IMA_SIG_TEMPLATE=y
    CONFIG_IMA_DEFAULT_TEMPLATE="ima-sig"
    # CONFIG_IMA_DEFAULT_HASH_SHA1 is not set
    CONFIG_IMA_DEFAULT_HASH_SHA256=y
    # CONFIG_IMA_DEFAULT_HASH_SHA512 is not set
    # CONFIG_IMA_DEFAULT_HASH_WP512 is not set
    CONFIG_IMA_DEFAULT_HASH="sha256"
    CONFIG_IMA_APPRAISE=y
    CONFIG_IMA_LOAD_X509=y
    CONFIG_IMA_APPRAISE_BOOTPARAM=y
    CONFIG_IMA_TRUSTED_KEYRING=y
    CONFIG_IMA_KEYRINGS_PERMIT_SIGNED_BY_BUILTIN_OR_SECONDARY=y
    CONFIG_IMA_BLACKLIST_KEYRING=y
    CONFIG_IMA_X509_PATH="/etc/keys/x509_ima.der"
    # CONFIG_IMA_APPRAISE_SIGNED_INIT is not set

However, searching on the built Linux on the Raspberry Pi for those settings none were enabled.

# modprobe configs
# cat /proc/config.gz | gunzip > running.conf
# cat running.conf | grep IMA

When I previously built for qemu, I didnt have those issues and I was able to confirm that my settings were enabled in the kernel. Only the tools like evmctl were installed.

Also, my settings for /boot/config.txt of the Raspi didnt seem to have an effect. In fact, there was no /boot/config.txt for me to open at all.

Ultimately, the TPM2 abrmd didnt start during boot (error msg) and I obviously couldnt access the TPM at /dev/tpm* via SPI. What did I do wrong? I'm new to Yocto and System Building/Linux Kernel in general.

Incase it's related to the Kernel Version, I tried to build for 4.19 but got build errors. I also messed around with the meta-rpi layer from jumpnowtek but it didnt fix my problem. There is also a meta-intel-iot-security/meta-integrity layer but its not maintained.

↧

Mannually Update Rhel 7.4 and keeping the current running version of OS

January 13, 2020, 7:37 am

≫ Next: How to check the state of a Linux socket file descriptor?

≪ Previous: How to build a working TPM2 image for Raspberry Pi with Yocto?

I have to manually download Updates and install it on server. The current running version of OS is Rhel 7.4. I should keep this OS version. So, I have to update Packages like (Vim,Systemd, elfutils, curl,python-perf, Kernel, Perl, sudo, libssh, NetworkManager, NTP, XORG, POLKIT) So for kernel, I know that the version supported by RHEL 7.4 is 3.10.0-693.XX .
But for the other packages, I don't know if I should update to Latest version? What packages that depends on kernel and shouldn't be updated to Latest version?

↧

How to check the state of a Linux socket file descriptor?

January 13, 2020, 8:44 am

≫ Next: Linux Kernel hangs waiting for interrupt

≪ Previous: Mannually Update Rhel 7.4 and keeping the current running version of OS

In Windows you have "SOCKET_ERROR" and "INVALID_SOCKET" But reading the socket documentation for Linux all i could find is STDERR_FILENO but that seems to be for any file descriptor and not specific to socket file descriptors. so i don't think that would be useful for determining the state of a linux socket file descriptor.

Here's some windows socket code for an example:

// create a socket for connecting to a server
ConnectSocket = socket(ptr->ai_family, ptr->ai_socktype, ptr->ai_protocol);
    if (ConnectSocket == INVALID_SOCKET)
    {
        WSACleanup();
        fclose(fName);
        return 4;
    }

↧

Linux Kernel hangs waiting for interrupt

January 13, 2020, 4:31 pm

≫ Next: Raspberry Pi 4 U-Boot on booting hanging in Starting Kernel

≪ Previous: How to check the state of a Linux socket file descriptor?

I am bringing up linux kernel 5.5.0-rc1 on arm64 based embedded system. The kernel initialization is done. but after that the kernel waits continuously for the interrupt. below is the backtrace

(gdb) bt
#0  arch_local_irq_enable () at /home/sami/linux/arch/arm64/include/asm/irqflags.h:37
#1  arch_cpu_idle () at /home/sami/linux/arch/arm64/kernel/process.c:126
#2  0xffff8000106eb8d4 in default_idle_call () at /home/sami/linux/kernel/sched/idle.c:94
#3  0xffff8000100d9e3c in cpuidle_idle_call () at /home/sami/linux/kernel/sched/idle.c:154
#4  do_idle () at /home/sami/linux/kernel/sched/idle.c:269
#5  0xffff8000100da07c in cpu_startup_entry (state=CPUHP_ONLINE) at /home/sami/linux/kernel/sched/idle.c:361
#6  0xffff8000106e5888 in rest_init () at /home/sami/linux/init/main.c:451
#7  0xffff8000109b09e4 in arch_call_rest_init () at /home/sami/linux/init/main.c:572
#8  0xffff8000109b0e14 in start_kernel () at /home/sami/linux/init/main.c:784
#9  0x0000000000000000 in ?? ()

Any ideas what could be the issue? why is the kernel not getting interrupt?

↧