Make alloc_pages_node() ignore cpusets Currently alloc_pages_node() obeys cpusets. I.e an active cpuset restriction will redirect allocations away from the desired node. This impacts several kernel mechanisms: 1. The slab allocator allocates structures per node. These will all be relocated to the cpuset nodes. Could happen f.e. when a device driver or an additional module is loaded after the system is already running. 2. A device driver, the block layer or the network layers attempts to allocate device local memory but is loaded in the context of a restrictive cpuset. 3. Page migration attempts to follow instructions from the cpuset layer to migrate the pages to the target nodes but the cpuset only allow the use of the new nodes after migration is complete. cpuset page migration only succeeds if done from a cpuset that also allows allocation on the target nodes. This patch adds a __GFP_NO_CPUSET flag that is set by alloc_pages_node() and that disables the cpuset check in __alloc_pages() (well its really fixed by checking that flag in a cpuset function). __GFP_NO_CPUSET may also be useful for other functions that operate in a very low level way in order to assure node local allocation for performance critical code paths. All uses of alloc_pages_node() that I am aware of get passed node numbers that have either been checked for validity or are necessary for mechanism that have nothing to do with user space. So this should not open up additional problems. Signed-off-by: Christoph Lameter Index: linux-2.6.16-rc6/include/linux/gfp.h =================================================================== --- linux-2.6.16-rc6.orig/include/linux/gfp.h 2006-03-11 14:12:55.000000000 -0800 +++ linux-2.6.16-rc6/include/linux/gfp.h 2006-03-17 18:58:14.000000000 -0800 @@ -47,6 +47,7 @@ struct vm_area_struct; #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */ #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */ +#define __GFP_NO_CPUSET ((__force gfp_t)0x40000u) /* Ignore cpuset constraints */ #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) @@ -113,6 +114,14 @@ static inline struct page *alloc_pages_n /* Unknown node is current node */ if (nid < 0) nid = numa_node_id(); + else + /* + * Caller is asking for a specific node. It is important + * to satify that request otherwise slab allocation, page migration + * and device local memory allocation gets memory from the wrong + * nodes. Therefore we bypass cpuset constraints. + */ + gfp_mask |= __GFP_NO_CPUSET; return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask)); Index: linux-2.6.16-rc6/kernel/cpuset.c =================================================================== --- linux-2.6.16-rc6.orig/kernel/cpuset.c 2006-03-11 14:12:55.000000000 -0800 +++ linux-2.6.16-rc6/kernel/cpuset.c 2006-03-17 18:56:17.000000000 -0800 @@ -2159,7 +2159,7 @@ int __cpuset_zone_allowed(struct zone *z const struct cpuset *cs; /* current cpuset ancestors */ int allowed = 1; /* is allocation in zone z allowed? */ - if (in_interrupt()) + if (in_interrupt() || (gfp_mask & __GFP_NO_CPUSET)) return 1; node = z->zone_pgdat->node_id; if (node_isset(node, current->mems_allowed))