Tasks / Roadmap
Kernel
This is an initial batch of ideas and a work in progress. As we get further down the road we will probably add more items to this list (including more APIs which need NUMA awareness).
Description |
Status |
Owner |
Commit / Branch / Patch |
Parsing SRAT on x86 and adding domains to vm_phys |
done |
attilio / jeff / jhb |
stable/10 |
bus_get_domain() and dev.foo.N.%domain |
committed |
adrian / jhb |
|
CPU_WHICH_DOMAIN and cpuset -gd |
needs testing |
jeff / jhb |
|
Teach topology code about NUMA domains (for x86 hierarchy is package -> domain -> core -> thread) |
not started |
|
|
bus_get_cpus() to query arbitrary CPU sets (including "local" CPUs and "best intr" CPUs) |
needs testing |
jhb |
https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus |
Assign interrupts to a local CPU in intr_cpus by default on x86 |
not started |
|
|
Design a NUMA allocation policy data type |
in progress |
jeff |
projects/numa |
Remove the "cache" page queue (makes subsequent vm_phys changes simpler) |
done |
alc / kib / markj |
|
Update the vm_phys layer to accept NUMA allocation policy |
in progress |
jeff |
projects/numa |
Update KVA allocation to be domain aware (since superpages get in the way of doing a straight plumb of domain from contigmalloc, kmem_*, etc through to vm_phys |
not started |
|
|
Update contigmalloc, kmem_* to accept NUMA allocation policy |
in progress |
jeff |
projects/numa |
Update busdma tags to have a domain identifier and optionally a policy, inheriting from the bus default (eg acpi-pci) |
not started |
|
|
Update static bus_dma allocations to allocate busdma memory local using the busdma tag domain identifier/policy |
not started |
|
|
Add NUMA awareness to UMA |
in progress |
jeff |
projects/numa |
Per-domain page daemon improvements |
not started |
|
|
Per-domain free list locking |
not started |
|
|
Migrate PCPU allocations to be domain-local |
not started |
|
|
Migrate vm_page_t, etc kernel structures to be domain local |
not started |
|
|
(optionally) migrate vm_page_t and other memory/VM management structures to be in a single 1G superpage if possible, rather than at the top of physmem which is typically not backed by a single 1G superpage |
not started |
|
|
KVA allocation (to enable malloc/contigmalloc)
One of the big steps required getting NUMA aware malloc/contigmalloc/uma is a domain aware KVA allocator. Unfortunately domain aware physical page allocation isn't enough - the superpage reservation framework impacts this, as the upper levels allocating KVA (and then backing them with physical pages) doesn't know that the underlying page allocation may be a 2MB superpage. Some experiments were done to plumb a domain id (or -1 for default) from contigmalloc/kmem_malloc through the vm_reserv layer to vm_phys page allocation - and it didn't quite work.
For example:
* allocate 4k page for domain 0 - allocates KVA block A, backs it with 4k page PG(A), physical superpage S(A), fills it in with physical page PHYSPG(A) * allocate 4k page for domain 1 - allocates KVA block A+1, backs it with PG(A)+1, which is in the same superpage S(A), so it goes into PHYSPG(A) and thus on domain 0.
So, there are some solutions:
- Don't use superpages. In this case, we can just expect page-sized KVA allocations to map to physical pages, plumb in the domain id through the layers, and we're done. Superpages could be done as an explicit request, rather than trying to make them magically happen.
- Use superpages, but put a UMA style layer in front of domain specific memory allocations so superpage aligned KVA is always requested from contigmalloc/kmem_alloc
- Use a domain specific KVA vmem allocator in front of the global KVA vmem, and allocate domain specific kmem.
The last one was the suggestion from a number of people on IRC.
- create domain specific KVA vmem allocations;
- each of these would be chained from the global KVA vmem and allocate pages in superpage sized/aligned chunks;
- push the logic for domain allocation up from vm_phys up to the malloc layers - ie, those routines would implement the NUMA allocation policy for physical pages and explicitly ask for domains;
- add a domain id field to vm_page_t so allocations can be freed back to the relevant KVA vmem pool;
- (one suggestion was to ignore that and instead look at the underlying physical page to see where the KVA came from, and free it back to that pool; but that seems .. fragile);
The challenge here is how we defrag the domain KVA vmem allocations back to the main KVA pool. Anything not going through the domain specific KVA pool can get starved of allocations.
Userland
One open question here is what range of policies do we want to support? Linux supports a process-wide allocation policy that can be overridden for specific mappings. Do we want to support process-wide policies? Do we want per-thread policies as well? Per-object policies? Also, for the range of policies supported, what is the precedence ordering?
Description |
Status |
Owner |
Commit / Branch / Patch |
Prototype process-wide policies |
in progress |
jeff |
projects/numa |
Implement mapping policies (vm_map) |
not started |
|
|
libnuma-like API? |
not started |
adrian |
|
numactl-like functionality to adjust policy for new and/or existing processes |
completed |
adrian |
|
A monitoring tool akin to numa-top |
not started |
|
|
Reviews
Link |
Description |
migrate taskqueue_start_threads_pinned() -> taskqueue_start_threads_cpuset() |
|
skip gratuitous inactive queueing |
|
per-cpu page cache |