NUMA - FreeBSD Wiki

Tasks / Roadmap

Kernel

This is an initial batch of ideas and a work in progress. As we get further down the road we will probably add more items to this list (including more APIs which need NUMA awareness).

Description	Status	Owner	Commit / Branch / Patch
Parsing SRAT on x86 and adding domains to vm_phys	done	attilio / jeff / jhb	stable/10
bus_get_domain() and dev.foo.N.%domain	committed	adrian / jhb	r272799 / r274976
CPU_WHICH_DOMAIN and `cpuset -gd`	needs testing	jeff / jhb	r276829
Teach topology code about NUMA domains (for x86 hierarchy is package -> domain -> core -> thread)	not started
bus_get_cpus() to query arbitrary CPU sets (including "local" CPUs and "best intr" CPUs)	needs testing	jhb	https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus
Assign interrupts to a local CPU in intr_cpus by default on x86	not started
Design a NUMA allocation policy data type	in progress	jeff	projects/numa
Remove the "cache" page queue (makes subsequent vm_phys changes simpler)	done	alc / kib / markj	cache queue removal
Update the vm_phys layer to accept NUMA allocation policy	in progress	jeff	projects/numa
Update KVA allocation to be domain aware (since superpages get in the way of doing a straight plumb of domain from contigmalloc, kmem_*, etc through to vm_phys	not started
Update contigmalloc, kmem_* to accept NUMA allocation policy	in progress	jeff	projects/numa
Update busdma tags to have a domain identifier and optionally a policy, inheriting from the bus default (eg acpi-pci)	not started
Update static bus_dma allocations to allocate busdma memory local using the busdma tag domain identifier/policy	not started
Add NUMA awareness to UMA	in progress	jeff	projects/numa
Per-domain page daemon improvements	not started
Per-domain free list locking	not started
Migrate PCPU allocations to be domain-local	not started
Migrate vm_page_t, etc kernel structures to be domain local	not started
(optionally) migrate vm_page_t and other memory/VM management structures to be in a single 1G superpage if possible, rather than at the top of physmem which is typically not backed by a single 1G superpage	not started

KVA allocation (to enable malloc/contigmalloc)

One of the big steps required getting NUMA aware malloc/contigmalloc/uma is a domain aware KVA allocator. Unfortunately domain aware physical page allocation isn't enough - the superpage reservation framework impacts this, as the upper levels allocating KVA (and then backing them with physical pages) doesn't know that the underlying page allocation may be a 2MB superpage. Some experiments were done to plumb a domain id (or -1 for default) from contigmalloc/kmem_malloc through the vm_reserv layer to vm_phys page allocation - and it didn't quite work.

For example:

* allocate 4k page for domain 0 - allocates KVA block A, backs it with 4k page PG(A), physical superpage S(A), fills it in with physical page PHYSPG(A) * allocate 4k page for domain 1 - allocates KVA block A+1, backs it with PG(A)+1, which is in the same superpage S(A), so it goes into PHYSPG(A) and thus on domain 0.

So, there are some solutions:

Don't use superpages. In this case, we can just expect page-sized KVA allocations to map to physical pages, plumb in the domain id through the layers, and we're done. Superpages could be done as an explicit request, rather than trying to make them magically happen.
Use superpages, but put a UMA style layer in front of domain specific memory allocations so superpage aligned KVA is always requested from contigmalloc/kmem_alloc
Use a domain specific KVA vmem allocator in front of the global KVA vmem, and allocate domain specific kmem.

The last one was the suggestion from a number of people on IRC.

create domain specific KVA vmem allocations;
each of these would be chained from the global KVA vmem and allocate pages in superpage sized/aligned chunks;
push the logic for domain allocation up from vm_phys up to the malloc layers - ie, those routines would implement the NUMA allocation policy for physical pages and explicitly ask for domains;
add a domain id field to vm_page_t so allocations can be freed back to the relevant KVA vmem pool;
(one suggestion was to ignore that and instead look at the underlying physical page to see where the KVA came from, and free it back to that pool; but that seems .. fragile);

The challenge here is how we defrag the domain KVA vmem allocations back to the main KVA pool. Anything not going through the domain specific KVA pool can get starved of allocations.

Userland

One open question here is what range of policies do we want to support? Linux supports a process-wide allocation policy that can be overridden for specific mappings. Do we want to support process-wide policies? Do we want per-thread policies as well? Per-object policies? Also, for the range of policies supported, what is the precedence ordering?

Description	Status	Owner	Commit / Branch / Patch
Prototype process-wide policies	in progress	jeff	projects/numa
Implement mapping policies (vm_map)	not started
libnuma-like API?	not started	adrian
numactl-like functionality to adjust policy for new and/or existing processes	completed	adrian
A monitoring tool akin to numa-top	not started

Reviews

Link	Description
https://reviews.freebsd.org/D1897	migrate taskqueue_start_threads_pinned() -> taskqueue_start_threads_cpuset()
https://reviews.freebsd.org/D1674	skip gratuitous inactive queueing
https://reviews.freebsd.org/D1672	per-cpu page cache