zone_reclaim additional comments and cleanup This patch adds some comments to explain how zone reclaim works. And it fixes the following issues: - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write out pages to swap. Currently RECLAIM_SWAP may not do that. - remove setting sc.nr_reclaimed pages after slab reclaim since the slab shrinking code does not use that and the nr_reclaimed pages is just right for the intended follow up action. Signed-off-by: Christoph Lameter Index: linux-2.6.16-rc3/mm/vmscan.c =================================================================== --- linux-2.6.16-rc3.orig/mm/vmscan.c 2006-02-12 16:27:25.000000000 -0800 +++ linux-2.6.16-rc3/mm/vmscan.c 2006-02-13 09:45:05.000000000 -0800 @@ -1870,22 +1870,37 @@ int zone_reclaim_interval __read_mostly */ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) { - int nr_pages; + int nr_pages; /* Minimum pages needed in order to stay on node */ struct task_struct *p = current; struct reclaim_state reclaim_state; struct scan_control sc; cpumask_t mask; int node_id; + /* + * Do not reclaim if there was a recent unsuccessful attempt at + * zone reclaim. In that case we let allocations go off node for + * the zone_reclaim_interval. Otherwise we would scan for each off + * node page allocation. + */ if (time_before(jiffies, zone->last_unsuccessful_zone_reclaim + zone_reclaim_interval)) return 0; + /* + * Avoid concurrent zone reclaims, do not reclaim in a zone that + * does not have reclaimable pages and if we should not delay + * the allocation then do not scan. + */ if (!(gfp_mask & __GFP_WAIT) || zone->all_unreclaimable || atomic_read(&zone->reclaim_in_progress) > 0) return 0; + /* + * Only reclaim in the zones that are local or in zones + * that are on nodes without processors. + */ node_id = zone->zone_pgdat->node_id; mask = node_to_cpumask(node_id); if (!cpus_empty(mask) && node_id != numa_node_id()) @@ -1908,7 +1923,12 @@ int zone_reclaim(struct zone *zone, gfp_ sc.swap_cluster_max = SWAP_CLUSTER_MAX; cond_resched(); - p->flags |= PF_MEMALLOC; + /* + * We need to be able to allocate from the reserves for RECLAIM_SWAP + * and we also need to be able to write out pages for RECLAIM_WRITE + * and RECLAIM_SWAP. + */ + p->flags |= PF_MEMALLOC | PF_SWAPWRITE; reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; @@ -1922,23 +1942,29 @@ int zone_reclaim(struct zone *zone, gfp_ } while (sc.nr_reclaimed < nr_pages && sc.priority > 0); - if (sc.nr_reclaimed < nr_pages && (zone_reclaim_mode & RECLAIM_SLAB)) { + if (sc.nr_reclaimed < nr_pages && (zone_reclaim_mode & RECLAIM_SLAB)) /* * shrink_slab does not currently allow us to determine - * how many pages were freed in the zone. So we just - * shake the slab and then go offnode for a single allocation. + * how many pages were freed in this zone. So we just + * shake the slab a bit and then go off node for this + * particular allocation despite possibly having freed enough + * memory to allocate in this zone. If we freed local memory + * then the next allocations will be local again. * * shrink_slab will free memory on all zones and may take * a long time. */ shrink_slab(sc.nr_scanned, gfp_mask, order); - sc.nr_reclaimed = 1; /* Avoid getting the off node timeout */ - } p->reclaim_state = NULL; - current->flags &= ~PF_MEMALLOC; + current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); if (sc.nr_reclaimed == 0) + /* + * We were unable to reclaim enough pages to stay on node. + * We now allow off node accesses for a certain time period + * before trying again to reclaim pages from the local zone. + */ zone->last_unsuccessful_zone_reclaim = jiffies; return sc.nr_reclaimed >= nr_pages;