Nginx's approach to CPUs scalability is based on the creation of the number of almost independent processes each owning an event queue and then using SO_REUSEPORT to spread incoming connections, IRQs, NIC packets over all cores relatively evenly.
Does it lead to better scalability (fewer kernel data sharing = fewer locks) than creating only one Linux process followed by the creation of array of threads still equal to the number of CPUs and a per thread event queue in every thread?
Here is an example of Nginx scaling up to around 32 CPUs. Disabled HT and overall number of 36 real cores could be the main reason for this, as well as relative NICs saturation or even cores GHz drop due to overheating:
https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/
Also: https://dzone.com/articles/inside-nginx-how-we-designed