Previous Up Next

Appendix G  Non-Contiguous Memory Allocation

G.1  Allocating A Non-Contiguous Area

G.1.1  Function: vmalloc

Source: include/linux/vmalloc.h

The call graph for this function is shown in Figure 7.2. The following macros only by their GFP_ flags (See Section 6.4). The size parameter is page aligned by __vmalloc()(See Section G.1.2).

 37 static inline void * vmalloc (unsigned long size)
 38 {
 39     return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
 40 }
 45 
 46 static inline void * vmalloc_dma (unsigned long size)
 47 {
 48     return __vmalloc(size, GFP_KERNEL|GFP_DMA, PAGE_KERNEL);
 49 }
 54  
 55 static inline void * vmalloc_32(unsigned long size)
 56 {
 57     return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
 58 }
37The flags indicate that to use either ZONE_NORMAL or ZONE_HIGHMEM as necessary
46The flag indicates to only allocate from ZONE_DMA
55Only physical pages from ZONE_NORMAL will be allocated

G.1.2  Function: __vmalloc

Source: mm/vmalloc.c

This function has three tasks. It page aligns the size request, asks get_vm_area() to find an area for the request and uses vmalloc_area_pages() to allocate the PTEs for the pages.

261 void * __vmalloc (unsigned long size, int gfp_mask, pgprot_t prot)
262 {
263     void * addr;
264     struct vm_struct *area;
265 
266     size = PAGE_ALIGN(size);
267     if (!size || (size >> PAGE_SHIFT) > num_physpages)
268         return NULL;
269     area = get_vm_area(size, VM_ALLOC);
270     if (!area)
271         return NULL;
272     addr = area->addr;
273     if (__vmalloc_area_pages(VMALLOC_VMADDR(addr), size, gfp_mask,
274                              prot, NULL)) {
275         vfree(addr);
276         return NULL;
277     }
278     return addr;
279 }
261The parameters are the size to allocate, the GFP_ flags to use for allocation and what protection to give the PTE
266Align the size to a page size
267Sanity check. Make sure the size is not 0 and that the size requested is not larger than the number of physical pages has been requested
269Find an area of virtual address space to store the allocation with get_vm_area() (See Section G.1.3)
272The addr field has been filled by get_vm_area()
273Allocate the PTE entries needed for the allocation with __vmalloc_area_pages() (See Section G.1.5). If it fails, a non-zero value -ENOMEM is returned
275-276If the allocation fails, free any PTEs, pages and descriptions of the area
278Return the address of the allocated area

G.1.3  Function: get_vm_area

Source: mm/vmalloc.c

To allocate an area for the vm_struct, the slab allocator is asked to provide the necessary memory via kmalloc(). It then searches the vm_struct list linearaly looking for a region large enough to satisfy a request, including a page pad at the end of the area.

195 struct vm_struct * get_vm_area(unsigned long size, 
                                   unsigned long flags)
196 {
197     unsigned long addr, next;
198     struct vm_struct **p, *tmp, *area;
199 
200     area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL);
201     if (!area)
202         return NULL;
203
204     size += PAGE_SIZE;
205     if(!size) {
206         kfree (area);
207         return NULL;
208     }
209
210     addr = VMALLOC_START;
211     write_lock(&vmlist_lock);
212     for (p = &vmlist; (tmp = *p) ; p = &tmp->next) {
213         if ((size + addr) < addr)
214             goto out;
215         if (size + addr <= (unsigned long) tmp->addr)
216             break;
217         next = tmp->size + (unsigned long) tmp->addr;
218         if (next > addr)
219             addr = next;
220         if (addr > VMALLOC_END-size)
221             goto out;
222     }
223     area->flags = flags;
224     area->addr = (void *)addr;
225     area->size = size;
226     area->next = *p;
227     *p = area;
228     write_unlock(&vmlist_lock);
229     return area;
230 
231 out:
232     write_unlock(&vmlist_lock);
233     kfree(area);
234     return NULL;
235 }
195The parameters is the size of the requested region which should be a multiple of the page size and the area flags, either VM_ALLOC or VM_IOREMAP
200-202Allocate space for the vm_struct description struct
204Pad the request so there is a page gap between areas. This is to guard against overwrites
205-206This is to ensure the size is not 0 after the padding due to an overflow. If something does go wrong, free the area just allocated and return NULL
210Start the search at the beginning of the vmalloc address space
211Lock the list
212-222Walk through the list searching for an area large enough for the request
213-214Check to make sure the end of the addressable range has not been reached
215-216If the requested area would fit between the current address and the next area, the search is complete
217Make sure the address would not go over the end of the vmalloc address space
223-225Copy in the area information
226-227Link the new area into the list
228-229Unlock the list and return
231This label is reached if the request could not be satisfied
232Unlock the list
233-234Free the memory used for the area descriptor and return

G.1.4  Function: vmalloc_area_pages

Source: mm/vmalloc.c

This is just a wrapper around __vmalloc_area_pages(). This function exists for compatibility with older kernels. The name change was made to reflect that the new function __vmalloc_area_pages() is able to take an array of pages to use for insertion into the pagetables.

189 int vmalloc_area_pages(unsigned long address, unsigned long size,
190                        int gfp_mask, pgprot_t prot)
191 {
192         return __vmalloc_area_pages(address, size, gfp_mask, prot, NULL);
193 }
192Call __vmalloc_area_pages() with the same parameters. The pages array is passed as NULL as the pages will be allocated as necessary

G.1.5  Function: __vmalloc_area_pages

Source: mm/vmalloc.c

This is the beginning of a standard page table walk function. This top level function will step through all PGDs within an address range. For each PGD, it will call pmd_alloc() to allocate a PMD directory and call alloc_area_pmd() for the directory.

155 static inline int __vmalloc_area_pages (unsigned long address,
156                                         unsigned long size,
157                                         int gfp_mask,
158                                         pgprot_t prot,
159                                         struct page ***pages)
160 {
161     pgd_t * dir;
162     unsigned long end = address + size;
163     int ret;
164 
165     dir = pgd_offset_k(address);
166     spin_lock(&init_mm.page_table_lock);
167     do {
168         pmd_t *pmd;
169         
170         pmd = pmd_alloc(&init_mm, dir, address);
171         ret = -ENOMEM;
172         if (!pmd)
173             break;
174 
175         ret = -ENOMEM;
176         if (alloc_area_pmd(pmd, address, end - address, 
                       gfp_mask, prot, pages))
177             break;
178 
179         address = (address + PGDIR_SIZE) & PGDIR_MASK;
180         dir++;
181 
182         ret = 0;
183     } while (address && (address < end));
184     spin_unlock(&init_mm.page_table_lock);
185     flush_cache_all();
186     return ret;
187 }
155The parameters are;
address is the starting address to allocate PMDs for
size is the size of the region
gfp_mask is the GFP_ flags for alloc_pages() (See Section F.1.1)
prot is the protection to give the PTE entry
pages is an array of pages to use for insertion instead of having alloc_area_pte() allocate them one at a time. Only the vmap() interface passes in an array
162The end address is the starting address plus the size
165Get the PGD entry for the starting address
166Lock the kernel reference page table
167-183For every PGD within this address range, allocate a PMD directory and call alloc_area_pmd() (See Section G.1.6)
170Allocate a PMD directory
176Call alloc_area_pmd() (See Section G.1.6) which will allocate a PTE for each PTE slot in the PMD
179address becomes the base address of the next PGD entry
180Move dir to the next PGD entry
184Release the lock to the kernel page table
185flush_cache_all() will flush all CPU caches. This is necessary because the kernel page tables have changed
186Return success

G.1.6  Function: alloc_area_pmd

Source: mm/vmalloc.c

This is the second stage of the standard page table walk to allocate PTE entries for an address range. For every PMD within a given address range on a PGD, pte_alloc() will creates a PTE directory and then alloc_area_pte() will be called to allocate the physical pages

132 static inline int alloc_area_pmd(pmd_t * pmd, unsigned long address,
133                         unsigned long size, int gfp_mask,
134                         pgprot_t prot, struct page ***pages)
135 {
136     unsigned long end;
137 
138     address &= ~PGDIR_MASK;
139     end = address + size;
140     if (end > PGDIR_SIZE)
141         end = PGDIR_SIZE;
142     do {
143         pte_t * pte = pte_alloc(&init_mm, pmd, address);
144         if (!pte)
145             return -ENOMEM;
146         if (alloc_area_pte(pte, address, end - address, 
147                    gfp_mask, prot, pages))
148             return -ENOMEM;
149         address = (address + PMD_SIZE) & PMD_MASK;
150         pmd++;
151     } while (address < end);
152     return 0;
152 }
132The parameters are;
pmd is the PMD that needs the allocations
address is the starting address to start from
size is the size of the region within the PMD to allocate for
gfp_mask is the GFP_ flags for alloc_pages() (See Section F.1.1)
prot is the protection to give the PTE entry
pages is an optional array of pages to use instead of allocating each page individually
138Align the starting address to the PGD
139-141Calculate end to be the end of the allocation or the end of the PGD, whichever occurs first
142-151For every PMD within the given address range, allocate a PTE directory and call alloc_area_pte()(See Section G.1.7)
143Allocate the PTE directory
146-147Call alloc_area_pte() which will allocate the physical pages if an array of pages is not already supplied with pages
149address becomes the base address of the next PMD entry
150Move pmd to the next PMD entry
152Return success

G.1.7  Function: alloc_area_pte

Source: mm/vmalloc.c

This is the last stage of the page table walk. For every PTE in the given PTE directory and address range, a page will be allocated and associated with the PTE.

 95 static inline int alloc_area_pte (pte_t * pte, unsigned long address,
 96                         unsigned long size, int gfp_mask,
 97                         pgprot_t prot, struct page ***pages)
 98 {
 99     unsigned long end;
100 
101     address &= ~PMD_MASK;
102     end = address + size;
103     if (end > PMD_SIZE)
104         end = PMD_SIZE;
105     do {
106         struct page * page;
107 
108         if (!pages) {
109             spin_unlock(&init_mm.page_table_lock);
110             page = alloc_page(gfp_mask);
111             spin_lock(&init_mm.page_table_lock);
112         } else {
113             page = (**pages);
114             (*pages)++;
115
116             /* Add a reference to the page so we can free later */
117             if (page)
118                 atomic_inc(&page->count);
119
120         }
121         if (!pte_none(*pte))
122             printk(KERN_ERR "alloc_area_pte: page already exists\n");
123         if (!page)
124             return -ENOMEM;
125         set_pte(pte, mk_pte(page, prot));
126         address += PAGE_SIZE;
127         pte++;
128     } while (address < end);
129     return 0;
130 }
101Align the address to a PMD directory
103-104The end address is the end of the request or the end of the directory, whichever occurs first
105-128Loop through every PTE in this page. If a pages array is supplied, use pages from it to populate the table, otherwise allocate each one individually
108-111If an array of pages is not supplied, unlock the kernel reference pagetable, allocate a page with alloc_page() and reacquire the spinlock
112-120Else, take one page from the array and increment it's usage count as it is about to be inserted into the reference page table
121-122If the PTE is already in use, it means that the areas in the vmalloc region are overlapping somehow
123-124Return failure if physical pages are not available
125Set the page with the desired protection bits (prot) into the PTE
126address becomes the address of the next PTE
127Move to the next PTE
129Return success

G.1.8  Function: vmap

Source: mm/vmalloc.c

This function allows a caller-supplied array of pages to be inserted into the vmalloc address space. This is unused in 2.4.22 and I suspect it is an accidental backport from 2.6.x where it is used by the sound subsystem core.

281 void * vmap(struct page **pages, int count,
282             unsigned long flags, pgprot_t prot)
283 {
284     void * addr;
285     struct vm_struct *area;
286     unsigned long size = count << PAGE_SHIFT;
287 
288     if (!size || size > (max_mapnr << PAGE_SHIFT))
289         return NULL;
290     area = get_vm_area(size, flags);
291     if (!area) {
292         return NULL;
293     }
294     addr = area->addr;
295     if (__vmalloc_area_pages(VMALLOC_VMADDR(addr), size, 0,
296                              prot, &pages)) {
297         vfree(addr);
298         return NULL;
299     }
300     return addr;
301 }
281The parameters are;
pages is the caller-supplied array of pages to insert
count is the number of pages in the array
flags is the flags to use for the vm_struct
prot is the protection bits to set the PTE with
286Calculate the size in bytes of the region to create based on the size of the array
288-289Make sure the size of the region does not exceed limits
290-293Use get_vm_area() to find a region large enough for the mapping. If one is not found, return NULL
294Get the virtual address of the area
295Insert the array into the pagetable with __vmalloc_area_pages() (See Section G.1.4)
297If the insertion fails, free the region and return NULL
298Return the virtual address of the newly mapped region

G.2  Freeing A Non-Contiguous Area

G.2.1  Function: vfree

Source: mm/vmalloc.c

The call graph for this function is shown in Figure 7.4. This is the top level function responsible for freeing a non-contiguous area of memory. It performs basic sanity checks before finding the vm_struct for the requested addr. Once found, it calls vmfree_area_pages().

237 void vfree(void * addr)
238 {
239     struct vm_struct **p, *tmp;
240 
241     if (!addr)
242         return;
243     if ((PAGE_SIZE-1) & (unsigned long) addr) {
244         printk(KERN_ERR 
               "Trying to vfree() bad address (%p)\n", addr);
245         return;
246     }
247     write_lock(&vmlist_lock);
248     for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
249         if (tmp->addr == addr) {
250             *p = tmp->next;
251             vmfree_area_pages(VMALLOC_VMADDR(tmp->addr), 
                          tmp->size);
252             write_unlock(&vmlist_lock);
253             kfree(tmp);
254             return;
255         }
256     }
257     write_unlock(&vmlist_lock);
258     printk(KERN_ERR 
           "Trying to vfree() nonexistent vm area (%p)\n", addr);
259 }
237The parameter is the address returned by get_vm_area() (See Section G.1.3) to either vmalloc() or ioremap()
241-243Ignore NULL addresses
243-246This checks the address is page aligned and is a reasonable quick guess to see if the area is valid or not
247Acquire a write lock to the vmlist
248Cycle through the vmlist looking for the correct vm_struct for addr
249If this it the correct address then ...
250Remove this area from the vmlist linked list
251Free all pages associated with the address range
252Release the vmlist lock
253Free the memory used for the vm_struct and return
257-258The vm_struct was not found. Release the lock and print a message about the failed free

G.2.2  Function: vmfree_area_pages

Source: mm/vmalloc.c

This is the first stage of the page table walk to free all pages and PTEs associated with an address range. It is responsible for stepping through the relevant PGDs and for flushing the TLB.

 80 void vmfree_area_pages(unsigned long address, unsigned long size)
 81 {
 82     pgd_t * dir;
 83     unsigned long end = address + size;
 84 
 85     dir = pgd_offset_k(address);
 86     flush_cache_all();
 87     do {
 88         free_area_pmd(dir, address, end - address);
 89         address = (address + PGDIR_SIZE) & PGDIR_MASK;
 90         dir++;
 91     } while (address && (address < end));
 92     flush_tlb_all();
 93 }
80The parameters are the starting address and the size of the region
82The address space end is the starting address plus its size
85Get the first PGD for the address range
86Flush the cache CPU so cache hits will not occur on pages that are to be deleted. This is a null operation on many architectures including the x86
87Call free_area_pmd()(See Section G.2.3) to perform the second stage of the page table walk
89address becomes the starting address of the next PGD
90Move to the next PGD
92Flush the TLB as the page tables have now changed

G.2.3  Function: free_area_pmd

Source: mm/vmalloc.c

This is the second stage of the page table walk. For every PMD in this directory, call free_area_pte() to free up the pages and PTEs.

 56 static inline void free_area_pmd(pgd_t * dir, 
                     unsigned long address,
                     unsigned long size)
 57 {
 58     pmd_t * pmd;
 59     unsigned long end;
 60 
 61     if (pgd_none(*dir))
 62         return;
 63     if (pgd_bad(*dir)) {
 64         pgd_ERROR(*dir);
 65         pgd_clear(dir);
 66         return;
 67     }
 68     pmd = pmd_offset(dir, address);
 69     address &= ~PGDIR_MASK;
 70     end = address + size;
 71     if (end > PGDIR_SIZE)
 72         end = PGDIR_SIZE;
 73     do {
 74         free_area_pte(pmd, address, end - address);
 75         address = (address + PMD_SIZE) & PMD_MASK;
 76         pmd++;
 77     } while (address < end);
 78 }
56The parameters are the PGD been stepped through, the starting address and the length of the region
61-62If there is no PGD, return. This can occur after vfree() (See Section G.2.1) is called during a failed allocation
63-67A PGD can be bad if the entry is not present, it is marked read-only or it is marked accessed or dirty
68Get the first PMD for the address range
69Make the address PGD aligned
70-72end is either the end of the space to free or the end of this PGD, whichever is first
73-77For every PMD, call free_area_pte() (See Section G.2.4) to free the PTE entries
75address is the base address of the next PMD
76Move to the next PMD

G.2.4  Function: free_area_pte

Source: mm/vmalloc.c

This is the final stage of the page table walk. For every PTE in the given PMD within the address range, it will free the PTE and the associated page

22 static inline void free_area_pte(pmd_t * pmd, unsigned long address,
                    unsigned long size)
 23 {
 24     pte_t * pte;
 25     unsigned long end;
 26 
 27     if (pmd_none(*pmd))
 28         return;
 29     if (pmd_bad(*pmd)) {
 30         pmd_ERROR(*pmd);
 31         pmd_clear(pmd);
 32         return;
 33     }
 34     pte = pte_offset(pmd, address);
 35     address &= ~PMD_MASK;
 36     end = address + size;
 37     if (end > PMD_SIZE)
 38         end = PMD_SIZE;
 39     do {
 40         pte_t page;
 41         page = ptep_get_and_clear(pte);
 42         address += PAGE_SIZE;
 43         pte++;
 44         if (pte_none(page))
 45             continue;
 46         if (pte_present(page)) {
 47             struct page *ptpage = pte_page(page);
 48             if (VALID_PAGE(ptpage) && 
                   (!PageReserved(ptpage)))
 49                 __free_page(ptpage);
 50             continue;
 51         }
 52         printk(KERN_CRIT 
               "Whee.. Swapped out page in kernel page table\n");
 53     } while (address < end);
 54 }
22The parameters are the PMD that PTEs are been freed from, the starting address and the size of the region to free
27-28The PMD could be absent if this region is from a failed vmalloc()
29-33A PMD can be bad if it's not in main memory, it's read only or it's marked dirty or accessed
34pte is the first PTE in the address range
35Align the address to the PMD
36-38The end is either the end of the requested region or the end of the PMD, whichever occurs first
38-53Step through all PTEs, perform checks and free the PTE with its associated page
41ptep_get_and_clear() will remove a PTE from a page table and return it to the caller
42address will be the base address of the next PTE
43Move to the next PTE
44If there was no PTE, simply continue
46-51If the page is present, perform basic checks and then free it
47pte_page() uses the global mem_map to find the struct page for the PTE
48-49Make sure the page is a valid page and it is not reserved before calling __free_page() to free the physical page
50Continue to the next PTE
52If this line is reached, a PTE within the kernel address space was somehow swapped out. Kernel memory is not swappable and so is a critical error


Previous Up Next