Appendix D Process Address Space

D.1 Process Memory Descriptors

This section covers the functions used to allocate, initialise, copy and destroy memory descriptors.

D.1.1 Initalising a Descriptor

The initial mm_struct in the system is called init_mm and is statically initialised at compile time using the macro INIT_MM().

238 #define INIT_MM(name) \
239 {                                                         \
240       mm_rb:           RB_ROOT,                           \
241       pgd:             swapper_pg_dir,                    \
242       mm_users:        ATOMIC_INIT(2),                    \
243       mm_count:        ATOMIC_INIT(1),                    \
244       mmap_sem:        __RWSEM_INITIALIZER(name.mmap_sem),\
245       page_table_lock: SPIN_LOCK_UNLOCKED,                \
246       mmlist:          LIST_HEAD_INIT(name.mmlist),       \
247 }

Once it is established, new mm_structs are copies of their parent mm_struct and are copied using copy_mm() with the process specific fields initialised with init_mm().

D.1.2 Copying a Descriptor

D.1.2.1 Function: copy_mm

Source: kernel/fork.c

This function makes a copy of the mm_struct for the given task. This is only called from do_fork() after a new process has been created and needs its own mm_struct.

315 static int copy_mm(unsigned long clone_flags, 
                       struct task_struct * tsk)
316 {
317       struct mm_struct * mm, *oldmm;
318       int retval;
319 
320       tsk->min_flt = tsk->maj_flt = 0;
321       tsk->cmin_flt = tsk->cmaj_flt = 0;
322       tsk->nswap = tsk->cnswap = 0;
323 
324       tsk->mm = NULL;
325       tsk->active_mm = NULL;
326 
327       /*
328        * Are we cloning a kernel thread?
330        * We need to steal a active VM for that..
331        */
332       oldmm = current->mm;
333       if (!oldmm)
334             return 0;
335
336       if (clone_flags & CLONE_VM) {
337             atomic_inc(&oldmm->mm_users);
338             mm = oldmm;
339             goto good_mm;
340       }

Reset fields that are not inherited by a child mm_struct and find a mm to copy from.

: 315The parameters are the flags passed for clone and the task that is creating a copy of the mm_struct
: 320-325Initialise the task_struct fields related to memory management
: 332Borrow the mm of the current running process to copy from
: 333A kernel thread has no mm so it can return immediately
: 336-341If the CLONE_VM flag is set, the child process is to share the mm with the parent process. This is required by users like pthreads. The mm_users field is incremented so the mm is not destroyed prematurely later. The good_mm label sets tsk→mm and tsk→active_mm and returns success

342       retval = -ENOMEM;
343       mm = allocate_mm();
344       if (!mm)
345             goto fail_nomem;
346 
347       /* Copy the current MM stuff.. */
348       memcpy(mm, oldmm, sizeof(*mm));
349       if (!mm_init(mm))
350             goto fail_nomem;
351 
352       if (init_new_context(tsk,mm))
353             goto free_pt;
354 
355       down_write(&oldmm->mmap_sem);
356       retval = dup_mmap(mm);
357       up_write(&oldmm->mmap_sem);
358

: 343Allocate a new mm
: 348-350Copy the parent mm and initialise the process specific mm fields with init_mm()
: 352-353Initialise the MMU context for architectures that do not automatically manage their MMU
: 355-357Call dup_mmap() which is responsible for copying all the VMAs regions in use by the parent process

359       if (retval)
360             goto free_pt;
361 
362       /*
363        * child gets a private LDT (if there was an LDT in the parent)
364        */
365       copy_segments(tsk, mm);
366 
367 good_mm:
368       tsk->mm = mm;
369       tsk->active_mm = mm;
370       return 0;
371 
372 free_pt:
373       mmput(mm);
374 fail_nomem:
375       return retval;
376 }

: 359dup_mmap() returns 0 on success. If it failed, the label free_pt will call mmput() which decrements the use count of the mm
: 365This copies the LDT for the new process based on the parent process
: 368-370Set the new mm, active_mm and return success

D.1.2.2 Function: mm_init

Source: kernel/fork.c

This function initialises process specific mm fields.

230 static struct mm_struct * mm_init(struct mm_struct * mm)
231 {
232       atomic_set(&mm->mm_users, 1);
233       atomic_set(&mm->mm_count, 1);
234       init_rwsem(&mm->mmap_sem);
235       mm->page_table_lock = SPIN_LOCK_UNLOCKED;
236       mm->pgd = pgd_alloc(mm);
237       mm->def_flags = 0;
238       if (mm->pgd)
239             return mm;
240       free_mm(mm);
241       return NULL;
242 }

: 232Set the number of users to 1
: 233Set the reference count of the mm to 1
: 234Initialise the semaphore protecting the VMA list
: 235Initialise the spinlock protecting write access to it
: 236Allocate a new PGD for the struct
: 237By default, pages used by the process are not locked in memory
: 238If a PGD exists, return the initialised struct
: 240Initialisation failed, delete the mm_struct and return

D.1.3 Allocating a Descriptor

Two functions are provided allocating a mm_struct. To be slightly confusing, they are essentially the name. allocate_mm() will allocate a mm_struct from the slab allocator. mm_alloc() will allocate the struct and then call the function mm_init() to initialise it.

D.1.3.1 Function: allocate_mm

Source: kernel/fork.c

227 #define allocate_mm()   (kmem_cache_alloc(mm_cachep, SLAB_KERNEL))

: 226Allocate a mm_struct from the slab allocator

D.1.3.2 Function: mm_alloc

Source: kernel/fork.c

248 struct mm_struct * mm_alloc(void)
249 {
250       struct mm_struct * mm;
251 
252       mm = allocate_mm();
253       if (mm) {
254             memset(mm, 0, sizeof(*mm));
255             return mm_init(mm);
256       }
257       return NULL;
258 }

: 252Allocate a mm_struct from the slab allocator
: 254Zero out all contents of the struct
: 255Perform basic initialisation

D.1.4 Destroying a Descriptor

A new user to an mm increments the usage count with a simple call,

atomic_inc(&mm->mm_users};

It is decremented with a call to mmput(). If the mm_users count reaches zero, all the mapped regions are deleted with exit_mmap() and the page tables destroyed as there is no longer any users of the userspace portions. The mm_count count is decremented with mmdrop() as all the users of the page tables and VMAs are counted as one mm_struct user. When mm_count reaches zero, the mm_struct will be destroyed.

D.1.4.1 Function: mmput

Source: kernel/fork.c

Figure D.1: Call Graph: mmput()

276 void mmput(struct mm_struct *mm)
277 {
278       if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)) {
279             extern struct mm_struct *swap_mm;
280             if (swap_mm == mm)
281                   swap_mm = list_entry(mm->mmlist.next, 
                              struct mm_struct, mmlist);
282             list_del(&mm->mmlist);
283             mmlist_nr--;
284             spin_unlock(&mmlist_lock);
285             exit_mmap(mm);
286             mmdrop(mm);
287       }
288 }

: 278Atomically decrement the mm_users field while holding the mmlist_lock lock. Return with the lock held if the count reaches zero
: 279-286If the usage count reaches zero, the mm and associated structures need to be removed
: 279-281The swap_mm is the last mm that was swapped out by the vmscan code. If the current process was the last mm swapped, move to the next entry in the list
: 282Remove this mm from the list
: 283-284Reduce the count of mms in the list and release the mmlist lock
: 285Remove all associated mappings
: 286Delete the mm

D.1.4.2 Function: mmdrop

Source: include/linux/sched.h

765 static inline void mmdrop(struct mm_struct * mm)
766 {
767       if (atomic_dec_and_test(&mm->mm_count))
768             __mmdrop(mm);
769 }

: 767Atomically decrement the reference count. The reference count could be higher if the mm was been used by lazy tlb switching tasks
: 768If the reference count reaches zero, call __mmdrop()

D.1.4.3 Function: __mmdrop

Source: kernel/fork.c

265 inline void __mmdrop(struct mm_struct *mm)
266 {
267       BUG_ON(mm == &init_mm);
268       pgd_free(mm->pgd);
269       destroy_context(mm);
270       free_mm(mm);
271 }

: 267Make sure the init_mm is not destroyed
: 268Delete the PGD entry
: 269Delete the LDT
: 270Call kmem_cache_free() for the mm freeing it with the slab allocator

D.2 Creating Memory Regions

This large section deals with the creation, deletion and manipulation of memory regions.

D.2.1 Creating A Memory Region

The main call graph for creating a memory region is shown in Figure 4.4.

D.2.1.1 Function: do_mmap

Source: include/linux/mm.h

This is a very simply wrapper function around do_mmap_pgoff() which performs most of the work.

557 static inline unsigned long do_mmap(struct file *file, 
            unsigned long addr,
558         unsigned long len, unsigned long prot,
559         unsigned long flag, unsigned long offset)
560 {
561     unsigned long ret = -EINVAL;
562     if ((offset + PAGE_ALIGN(len)) < offset)
563         goto out;
564     if (!(offset & ~PAGE_MASK))
565         ret = do_mmap_pgoff(file, addr, len, prot, flag, 
                                offset >> PAGE_SHIFT);
566 out:
567         return ret;
568 }

: 561By default, return -EINVAL
: 562-563Make sure that the size of the region will not overflow the total size of the address space
: 564-565Page align the offset and call do_mmap_pgoff() to map the region

D.2.1.2 Function: do_mmap_pgoff

Source: mm/mmap.c

This function is very large and so is broken up into a number of sections. Broadly speaking the sections are

Sanity check the parameters
Find a free linear address space large enough for the memory mapping. If a filesystem or device specific get_unmapped_area() function is provided, it will be used otherwise arch_get_unmapped_area() is called
Calculate the VM flags and check them against the file access permissions
If an old area exists where the mapping is to take place, fix it up so it is suitable for the new mapping
Allocate a vm_area_struct from the slab allocator and fill in its entries
Link in the new VMA
Call the filesystem or device specific mmap() function
Update statistics and exit

393 unsigned long do_mmap_pgoff(struct file * file, 
                unsigned long addr,
                unsigned long len, unsigned long prot,
394             unsigned long flags, unsigned long pgoff)
395 {
396     struct mm_struct * mm = current->mm;
397     struct vm_area_struct * vma, * prev;
398     unsigned int vm_flags;
399     int correct_wcount = 0;
400     int error;
401     rb_node_t ** rb_link, * rb_parent;
402 
403     if (file && (!file->f_op || !file->f_op->mmap))
404         return -ENODEV;
405 
406     if (!len)
407         return addr;
408 
409     len = PAGE_ALIGN(len);
410     
        if (len > TASK_SIZE || len == 0)
            return -EINVAL;
413 
414     /* offset overflow? */
415     if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
416         return -EINVAL;
417 
418     /* Too many mappings? */
419     if (mm->map_count > max_map_count)
420         return -ENOMEM;
421

393The parameters which correspond directly to the parameters to the mmap system call are

: file the struct file to mmap if this is a file backed mapping
: addr the requested address to map
: len the length in bytes to mmap
: prot is the permissions on the area
: flags are the flags for the mapping
: pgoff is the offset within the file to begin the mmap at

403-404If a file or device is been mapped, make sure a filesystem or device specific mmap function is provided. For most filesystems, this will call generic_file_mmap()(See Section D.6.2.1)

406-407Make sure a zero length mmap() is not requested

409Ensure that the mapping is confined to the userspace portion of hte address space. On the x86, kernel space begins at PAGE_OFFSET(3GiB)

415-416Ensure the mapping will not overflow the end of the largest possible file size

419-490Only max_map_count number of mappings are allowed. By default this value is DEFAULT_MAX_MAP_COUNT or 65536 mappings

422     /* Obtain the address to map to. we verify (or select) it and
423      * ensure that it represents a valid section of the address space.
424      */
425     addr = get_unmapped_area(file, addr, len, pgoff, flags);
426     if (addr & ~PAGE_MASK)
427         return addr;
428

: 425After basic sanity checks, this function will call the device or file specific get_unmapped_area() function. If a device specific one is unavailable, arch_get_unmapped_area() is called. This function is discussed in Section D.3.2.2

429     /* Do simple checking here so the lower-level routines won't have
430      * to. we assume access permissions have been handled by the open
431      * of the memory object, so we don't do any here.
432      */
433     vm_flags = calc_vm_flags(prot,flags) | mm->def_flags 
                 | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
434 
435     /* mlock MCL_FUTURE? */
436     if (vm_flags & VM_LOCKED) {
437         unsigned long locked = mm->locked_vm << PAGE_SHIFT;
438         locked += len;
439         if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
440             return -EAGAIN;
441     }
442

: 433calc_vm_flags() translates the prot and flags from userspace and translates them to their VM_ equivalents
: 436-440Check if it has been requested that all future mappings be locked in memory. If yes, make sure the process isn't locking more memory than it is allowed to. If it is, return -EAGAIN

443     if (file) {
444         switch (flags & MAP_TYPE) {
445         case MAP_SHARED:
446             if ((prot & PROT_WRITE) && 
                !(file->f_mode & FMODE_WRITE))
447                 return -EACCES;
448 
449             /* Make sure we don't allow writing to 
                 an append-only file.. */
450             if (IS_APPEND(file->f_dentry->d_inode) &&
                    (file->f_mode & FMODE_WRITE))
451                 return -EACCES;
452 
453             /* make sure there are no mandatory 
                 locks on the file. */
454             if (locks_verify_locked(file->f_dentry->d_inode))
455                 return -EAGAIN;
456 
457             vm_flags |= VM_SHARED | VM_MAYSHARE;
458             if (!(file->f_mode & FMODE_WRITE))
459                 vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
460 
461             /* fall through */
462         case MAP_PRIVATE:
463             if (!(file->f_mode & FMODE_READ))
464                 return -EACCES;
465             break;
466 
467         default:
468             return -EINVAL;
469         }

: 443-470If a file is been memory mapped, check the files access permissions
: 446-447If write access is requested, make sure the file is opened for write
: 450-451Similarly, if the file is opened for append, make sure it cannot be written to. The prot field is not checked because the prot field applies only to the mapping where as we need to check the opened file
: 453If the file is mandatory locked, return -EAGAIN so the caller will try a second type
: 457-459Fix up the flags to be consistent with the file flags
: 463-464Make sure the file can be read before mmapping it

470     } else {
471         vm_flags |= VM_SHARED | VM_MAYSHARE;
472         switch (flags & MAP_TYPE) {
473         default:
474             return -EINVAL;
475         case MAP_PRIVATE:
476             vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
477             /* fall through */
478         case MAP_SHARED:
479             break;
480         }
481     }

: 471-481If the file is been mapped for anonymous use, fix up the flags if the requested mapping is MAP_PRIVATE to make sure the flags are consistent

483     /* Clear old maps */
484 munmap_back:
485     vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
486     if (vma && vma->vm_start < addr + len) {
487         if (do_munmap(mm, addr, len))
488             return -ENOMEM;
489         goto munmap_back;
490     }

: 485find_vma_prepare()(See Section D.2.2.2) steps through the RB tree for the VMA corresponding to a given address
: 486-488If a VMA was found and it is part of the new mmaping, remove the old mapping as the new one will cover both

491 
492     /* Check against address space limit. */
493     if ((mm->total_vm << PAGE_SHIFT) + len
494         > current->rlim[RLIMIT_AS].rlim_cur)
495         return -ENOMEM;
496 
497     /* Private writable mapping? Check memory availability.. */
498     if ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
499         !(flags & MAP_NORESERVE)                 &&
500         !vm_enough_memory(len >> PAGE_SHIFT))
501         return -ENOMEM;
502 
503     /* Can we just expand an old anonymous mapping? */
504     if (!file && !(vm_flags & VM_SHARED) && rb_parent)
505         if (vma_merge(mm, prev, rb_parent, 
                    addr, addr + len, vm_flags))
506             goto out;
507

: 493-495Make sure the new mapping will not will not exceed the total VM a process is allowed to have. It is unclear why this check is not made earlier
: 498-501If the caller does not specifically request that free space is not checked with MAP_NORESERVE and it is a private mapping, make sure enough memory is available to satisfy the mapping under current conditions
: 504-506If two adjacent memory mappings are anonymous and can be treated as one, expand the old mapping rather than creating a new one

508     /* Determine the object being mapped and call the appropriate
509      * specific mapper. the address has already been validated, but
510      * not unmapped, but the maps are removed from the list.
511      */
512     vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
513     if (!vma)
514         return -ENOMEM;
515 
516     vma->vm_mm = mm;
517     vma->vm_start = addr;
518     vma->vm_end = addr + len;
519     vma->vm_flags = vm_flags;
520     vma->vm_page_prot = protection_map[vm_flags & 0x0f];
521     vma->vm_ops = NULL;
522     vma->vm_pgoff = pgoff;
523     vma->vm_file = NULL;
524     vma->vm_private_data = NULL;
525     vma->vm_raend = 0;

: 512Allocate a vm_area_struct from the slab allocator
: 516-525Fill in the basic vm_area_struct fields

527     if (file) {
528         error = -EINVAL;
529         if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
530             goto free_vma;
531         if (vm_flags & VM_DENYWRITE) {
532             error = deny_write_access(file);
533             if (error)
534                 goto free_vma;
535             correct_wcount = 1;
536         }
537         vma->vm_file = file;
538         get_file(file);
539         error = file->f_op->mmap(file, vma);
540         if (error)
541             goto unmap_and_free_vma;

: 527-542Fill in the file related fields if this is a file been mapped
: 529-530These are both invalid flags for a file mapping so free the vm_area_struct and return
: 531-536This flag is cleared by the system call mmap() but is still cleared for kernel modules that call this function directly. Historically, -ETXTBUSY was returned to the calling process if the underlying file was been written to
: 537Fill in the vm_file field
: 538This increments the file usage count
: 539Call the filesystem or device specific mmap() function. In many filesystem cases, this will call generic_file_mmap()(See Section D.6.2.1)
: 540-541If an error called, goto unmap_and_free_vma to clean up and return the error

542     } else if (flags & MAP_SHARED) {
543         error = shmem_zero_setup(vma);
544         if (error)
545             goto free_vma;
546     }
547

: 543If this is an anonymous shared mapping, the region is created and setup by shmem_zero_setup()(See Section L.7.1). Anonymous shared pages are backed by a virtual tmpfs filesystem so that they can be synchronised properly with swap. The writeback function is shmem_writepage()(See Section L.6.1)

 
548     /* Can addr have changed??
549      *
550      * Answer: Yes, several device drivers can do it in their
551      *     f_op->mmap method. -DaveM
552      */
553     if (addr != vma->vm_start) {
554         /*
555          * It is a bit too late to pretend changing the virtual
556          * area of the mapping, we just corrupted userspace
557          * in the do_munmap, so FIXME (not in 2.4 to avoid
558          * breaking the driver API).
559          */
560         struct vm_area_struct * stale_vma;
561         /* Since addr changed, we rely on the mmap op to prevent 
562          * collisions with existing vmas and just use
563          * find_vma_prepare to update the tree pointers.
564          */
565         addr = vma->vm_start;
566         stale_vma = find_vma_prepare(mm, addr, &prev,
567                         &rb_link, &rb_parent);
568         /*
569          * Make sure the lowlevel driver did its job right.
570          */
571         if (unlikely(stale_vma && stale_vma->vm_start <
                 vma->vm_end)) {
572             printk(KERN_ERR "buggy mmap operation: [<%p>]\n",
573                 file ? file->f_op->mmap : NULL);
574             BUG();
575         }
576     }
577 
578     vma_link(mm, vma, prev, rb_link, rb_parent);
579     if (correct_wcount)
580         atomic_inc(&file->f_dentry->d_inode->i_writecount);
581

: 553-576If the address has changed, it means the device specific mmap operation moved the VMA address to somewhere else. The function find_vma_prepare() (See Section D.2.2.2) is used to find where the VMA was moved to
: 578Link in the new vm_area_struct
: 579-580Update the file write count

582 out:    
583     mm->total_vm += len >> PAGE_SHIFT;
584     if (vm_flags & VM_LOCKED) {
585         mm->locked_vm += len >> PAGE_SHIFT;
586         make_pages_present(addr, addr + len);
587     }
588     return addr;
589 
590 unmap_and_free_vma:
591     if (correct_wcount)
592         atomic_inc(&file->f_dentry->d_inode->i_writecount);
593     vma->vm_file = NULL;
594     fput(file);
595 
596     /* Undo any partial mapping done by a device driver. */
597     zap_page_range(mm, vma->vm_start, vma->vm_end - vma->vm_start);
598 free_vma:
599     kmem_cache_free(vm_area_cachep, vma);
600     return error;
601 }

: 583-588Update statistics for the process mm_struct and return the new address
: 590-597This is reached if the file has been partially mapped before failing. The write statistics are updated and then all user pages are removed with zap_page_range()
: 598-600This goto is used if the mapping failed immediately after the vm_area_struct is created. It is freed back to the slab allocator before the error is returned

D.2.2 Inserting a Memory Region

The call graph for insert_vm_struct() is shown in Figure 4.6.

D.2.2.1 Function: __insert_vm_struct

Source: mm/mmap.c

This is the top level function for inserting a new vma into an address space. There is a second function like it called simply insert_vm_struct() that is not described in detail here as the only difference is the one line of code increasing the map_count.

1174 void __insert_vm_struct(struct mm_struct * mm, 
                     struct vm_area_struct * vma)
1175 {
1176     struct vm_area_struct * __vma, * prev;
1177     rb_node_t ** rb_link, * rb_parent;
1178 
1179     __vma = find_vma_prepare(mm, vma->vm_start, &prev, 
                      &rb_link, &rb_parent);
1180     if (__vma && __vma->vm_start < vma->vm_end)
1181         BUG();
1182     __vma_link(mm, vma, prev, rb_link, rb_parent);
1183     mm->map_count++;
1184     validate_mm(mm);
1185 }

: 1174The arguments are the mm_struct that represents the linear address space and the vm_area_struct that is to be inserted
: 1179find_vma_prepare()(See Section D.2.2.2) locates where the new VMA can be inserted. It will be inserted between prev and __vma and the required nodes for the red-black tree are also returned
: 1180-1181This is a check to make sure the returned VMA is invalid. It is virtually impossible for this condition to occur without manually inserting bogus VMAs into the address space
: 1182This function does the actual work of linking the vma struct into the linear linked list and the red-black tree
: 1183Increase the map_count to show a new mapping has been added. This line is not present in insert_vm_struct()
: 1184validate_mm() is a debugging macro for red-black trees. If DEBUG_MM_RB is set, the linear list of VMAs and the tree will be traversed to make sure it is valid. The tree traversal is a recursive function so it is very important that that it is used only if really necessary as a large number of mappings could cause a stack overflow. If it is not set, validate_mm() does nothing at all

D.2.2.2 Function: find_vma_prepare

Source: mm/mmap.c

This is responsible for finding the correct places to insert a VMA at the supplied address. It returns a number of pieces of information via the actual return and the function arguments. The forward VMA to link to is returned with return. pprev is the previous node which is required because the list is a singly linked list. rb_link and rb_parent are the parent and leaf node the new VMA will be inserted between.

246 static struct vm_area_struct * find_vma_prepare(
                       struct mm_struct * mm,
                       unsigned long addr,
247                    struct vm_area_struct ** pprev,
248                    rb_node_t *** rb_link,
                     rb_node_t ** rb_parent)
249 {
250     struct vm_area_struct * vma;
251     rb_node_t ** __rb_link, * __rb_parent, * rb_prev;
252 
253     __rb_link = &mm->mm_rb.rb_node;
254     rb_prev = __rb_parent = NULL;
255     vma = NULL;
256 
257     while (*__rb_link) {
258         struct vm_area_struct *vma_tmp;
259 
260         __rb_parent = *__rb_link;
261         vma_tmp = rb_entry(__rb_parent, 
                     struct vm_area_struct, vm_rb);
262 
263         if (vma_tmp->vm_end > addr) {
264             vma = vma_tmp;
265             if (vma_tmp->vm_start <= addr)
266                 return vma;
267             __rb_link = &__rb_parent->rb_left;
268         } else {
269             rb_prev = __rb_parent;
270             __rb_link = &__rb_parent->rb_right;
271         }
272     }
273 
274     *pprev = NULL;
275     if (rb_prev)
276         *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
277     *rb_link = __rb_link;
278     *rb_parent = __rb_parent;
279     return vma;
280 }

: 246The function arguments are described above
: 253-255Initialise the search
: 263-272This is a similar tree walk to what was described for find_vma(). The only real difference is the nodes last traversed are remembered with the __rb_link and __rb_parent variables
: 275-276Get the back linking VMA via the red-black tree
: 279Return the forward linking VMA

D.2.2.3 Function: vma_link

Source: mm/mmap.c

This is the top-level function for linking a VMA into the proper lists. It is responsible for acquiring the necessary locks to make a safe insertion

337 static inline void vma_link(struct mm_struct * mm, 
                struct vm_area_struct * vma, 
                struct vm_area_struct * prev,
338                 rb_node_t ** rb_link, rb_node_t * rb_parent)
339 {
340     lock_vma_mappings(vma);
341     spin_lock(&mm->page_table_lock);
342     __vma_link(mm, vma, prev, rb_link, rb_parent);
343     spin_unlock(&mm->page_table_lock);
344     unlock_vma_mappings(vma);
345 
346     mm->map_count++;
347     validate_mm(mm);
348 }

: 337mm is the address space the VMA is to be inserted into. prev is the backwards linked VMA for the linear linked list of VMAs. rb_link and rb_parent are the nodes required to make the rb insertion
: 340This function acquires the spinlock protecting the address_space representing the file that is been memory mapped.
: 341Acquire the page table lock which protects the whole mm_struct
: 342Insert the VMA
: 343Free the lock protecting the mm_struct
: 345Unlock the address_space for the file
: 346Increase the number of mappings in this mm
: 347If DEBUG_MM_RB is set, the RB trees and linked lists will be checked to make sure they are still valid

D.2.2.4 Function: __vma_link

Source: mm/mmap.c

This simply calls three helper functions which are responsible for linking the VMA into the three linked lists that link VMAs together.

329 static void __vma_link(struct mm_struct * mm, 
               struct vm_area_struct * vma,
               struct vm_area_struct * prev,
330            rb_node_t ** rb_link, rb_node_t * rb_parent)
331 {
332     __vma_link_list(mm, vma, prev, rb_parent);
333     __vma_link_rb(mm, vma, rb_link, rb_parent);
334     __vma_link_file(vma);
335 }

: 332This links the VMA into the linear linked lists of VMAs in this mm via the vm_next field
: 333This links the VMA into the red-black tree of VMAs in this mm whose root is stored in the vm_rb field
: 334This links the VMA into the shared mapping VMA links. Memory mapped files are linked together over potentially many mms by this function via the vm_next_share and vm_pprev_share fields

D.2.2.5 Function: __vma_link_list

Source: mm/mmap.c

282 static inline void __vma_link_list(struct mm_struct * mm, 
                     struct vm_area_struct * vma, 
                     struct vm_area_struct * prev,
283                    rb_node_t * rb_parent)
284 {
285     if (prev) {
286         vma->vm_next = prev->vm_next;
287         prev->vm_next = vma;
288     } else {
289         mm->mmap = vma;
290         if (rb_parent)
291             vma->vm_next = rb_entry(rb_parent, 
                                struct vm_area_struct, 
                                vm_rb);
292         else
293             vma->vm_next = NULL;
294     }
295 }

: 285If prev is not null, the vma is simply inserted into the list
: 289Else this is the first mapping and the first element of the list has to be stored in the mm_struct
: 290The VMA is stored as the parent node

D.2.2.6 Function: __vma_link_rb

Source: mm/mmap.c

The principal workings of this function are stored within <linux/rbtree.h> and will not be discussed in detail in this book.

297 static inline void __vma_link_rb(struct mm_struct * mm, 
                     struct vm_area_struct * vma,
298                  rb_node_t ** rb_link, 
                     rb_node_t * rb_parent)
299 {
300     rb_link_node(&vma->vm_rb, rb_parent, rb_link);
301     rb_insert_color(&vma->vm_rb, &mm->mm_rb);
302 }

D.2.2.7 Function: __vma_link_file

Source: mm/mmap.c

This function links the VMA into a linked list of shared file mappings.

304 static inline void __vma_link_file(struct vm_area_struct * vma)
305 {
306     struct file * file;
307 
308     file = vma->vm_file;
309     if (file) {
310         struct inode * inode = file->f_dentry->d_inode;
311         struct address_space *mapping = inode->i_mapping;
312         struct vm_area_struct **head;
313 
314         if (vma->vm_flags & VM_DENYWRITE)
315             atomic_dec(&inode->i_writecount);
316 
317         head = &mapping->i_mmap;
318         if (vma->vm_flags & VM_SHARED)
319             head = &mapping->i_mmap_shared;
320     
321         /* insert vma into inode's share list */
322         if((vma->vm_next_share = *head) != NULL)
323             (*head)->vm_pprev_share = &vma->vm_next_share;
324         *head = vma;
325         vma->vm_pprev_share = head;
326     }
327 }

: 309Check to see if this VMA has a shared file mapping. If it does not, this function has nothing more to do
: 310-312Extract the relevant information about the mapping from the VMA
: 314-315If this mapping is not allowed to write even if the permissions are ok for writing, decrement the i_writecount field. A negative value to this field indicates that the file is memory mapped and may not be written to. Efforts to open the file for writing will now fail
: 317-319Check to make sure this is a shared mapping
: 322-325Insert the VMA into the shared mapping linked list

D.2.3 Merging Contiguous Regions

D.2.3.1 Function: vma_merge

Source: mm/mmap.c

This function checks to see if a region pointed to be prev may be expanded forwards to cover the area from addr to end instead of allocating a new VMA. If it cannot, the VMA ahead is checked to see can it be expanded backwards instead.

350 static int vma_merge(struct mm_struct * mm, 
                 struct vm_area_struct * prev,
351                rb_node_t * rb_parent, 
                 unsigned long addr, unsigned long end, 
                 unsigned long vm_flags)
352 {
353     spinlock_t * lock = &mm->page_table_lock;
354     if (!prev) {
355         prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
356         goto merge_next;
357     }

350The parameters are as follows;

: mm The mm the VMAs belong to
: prev The VMA before the address we are interested in
: rb_parent The parent RB node as returned by find_vma_prepare()
: addr The starting address of the region to be merged
: end The end of the region to be merged
: vm_flags The permission flags of the region to be merged

353This is the lock to the mm

354-357If prev is not passed it, it is taken to mean that the VMA being tested for merging is in front of the region from addr to end. The entry for that VMA is extracted from the rb_parent

358     if (prev->vm_end == addr && can_vma_merge(prev, vm_flags)) {
359         struct vm_area_struct * next;
360 
361         spin_lock(lock);
362         prev->vm_end = end;
363         next = prev->vm_next;
364         if (next && prev->vm_end == next->vm_start &&
                   can_vma_merge(next, vm_flags)) {
365             prev->vm_end = next->vm_end;
366             __vma_unlink(mm, next, prev);
367             spin_unlock(lock);
368 
369             mm->map_count--;
370             kmem_cache_free(vm_area_cachep, next);
371             return 1;
372         }
373         spin_unlock(lock);
374         return 1;
375     }
376 
377     prev = prev->vm_next;
378     if (prev) {
379  merge_next:
380         if (!can_vma_merge(prev, vm_flags))
381             return 0;
382         if (end == prev->vm_start) {
383             spin_lock(lock);
384             prev->vm_start = addr;
385             spin_unlock(lock);
386             return 1;
387         }
388     }
389 
390     return 0;
391 }

: 358-375Check to see can the region pointed to by prev may be expanded to cover the current region
: 358The function can_vma_merge() checks the permissions of prev with those in vm_flags and that the VMA has no file mappings (i.e. it is anonymous). If it is true, the area at prev may be expanded
: 361Lock the mm
: 362Expand the end of the VMA region (vm_end) to the end of the new mapping (end)
: 363next is now the VMA in front of the newly expanded VMA
: 364Check if the expanded region can be merged with the VMA in front of it
: 365If it can, continue to expand the region to cover the next VMA
: 366As a VMA has been merged, one region is now defunct and may be unlinked
: 367No further adjustments are made to the mm struct so the lock is released
: 369There is one less mapped region to reduce the map_count
: 370Delete the struct describing the merged VMA
: 371Return success
: 377If this line is reached it means the region pointed to by prev could not be expanded forward so a check is made to see if the region ahead can be merged backwards instead
: 382-388Same idea as the above block except instead of adjusted vm_end to cover end, vm_start is expanded to cover addr

D.2.3.2 Function: can_vma_merge

Source: include/linux/mm.h

This trivial function checks to see if the permissions of the supplied VMA match the permissions in vm_flags

582 static inline int can_vma_merge(struct vm_area_struct * vma, 
                        unsigned long vm_flags)
583 {
584     if (!vma->vm_file && vma->vm_flags == vm_flags)
585         return 1;
586     else
587         return 0;
588 }

: 584Self explanatory. Return true if there is no file/device mapping (i.e. it is anonymous) and the VMA flags for both regions match

D.2.4 Remapping and Moving a Memory Region

D.2.4.1 Function: sys_mremap

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.7. This is the system service call to remap a memory region

347 asmlinkage unsigned long sys_mremap(unsigned long addr,
348     unsigned long old_len, unsigned long new_len,
349     unsigned long flags, unsigned long new_addr)
350 {
351     unsigned long ret;
352 
353     down_write(&current->mm->mmap_sem);
354     ret = do_mremap(addr, old_len, new_len, flags, new_addr);
355     up_write(&current->mm->mmap_sem);
356     return ret;
357 }

: 347-349The parameters are the same as those described in the mremap() man page
: 353Acquire the mm semaphore
: 354do_mremap()(See Section D.2.4.2) is the top level function for remapping a region
: 355Release the mm semaphore
: 356Return the status of the remapping

D.2.4.2 Function: do_mremap

Source: mm/mremap.c

This function does most of the actual “work” required to remap, resize and move a memory region. It is quite long but can be broken up into distinct parts which will be dealt with separately here. The tasks are broadly speaking

Check usage flags and page align lengths
Handle the condition where MAP_FIXED is set and the region is been moved to a new location.
If a region is shrinking, allow it to happen unconditionally
If the region is growing or moving, perform a number of checks in advance to make sure the move is allowed and safe
Handle the case where the region is been expanded and cannot be moved
Finally handle the case where the region has to be resized and moved

219 unsigned long do_mremap(unsigned long addr,
220     unsigned long old_len, unsigned long new_len,
221     unsigned long flags, unsigned long new_addr)
222 {
223     struct vm_area_struct *vma;
224     unsigned long ret = -EINVAL;
225 
226     if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
227         goto out;
228 
229     if (addr & ~PAGE_MASK)
230         goto out;
231 
232     old_len = PAGE_ALIGN(old_len);
233     new_len = PAGE_ALIGN(new_len);
234

219The parameters of the function are

: addris the old starting address
: old_lenis the old region length
: new_lenis the new region length
: flagsis the option flags passed. If MREMAP_MAYMOVE is specified, it means that the region is allowed to move if there is not enough linear address space at the current space. If MREMAP_FIXED is specified, it means that the whole region is to move to the specified new_addr with the new length. The area from new_addr to new_addr+new_len will be unmapped with do_munmap().
: new_addris the address of the new region if it is moved

224At this point, the default return is -EINVAL for invalid arguments

226-227Make sure flags other than the two allowed flags are not used

229-230The address passed in must be page aligned

232-233Page align the passed region lengths

236     if (flags & MREMAP_FIXED) {
237         if (new_addr & ~PAGE_MASK)
238             goto out;
239         if (!(flags & MREMAP_MAYMOVE))
240             goto out;
241 
242         if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
243             goto out;
244 
245         /* Check if the location we're moving into overlaps the
246          * old location at all, and fail if it does.
247          */
248         if ((new_addr <= addr) && (new_addr+new_len) > addr)
249             goto out;
250 
251         if ((addr <= new_addr) && (addr+old_len) > new_addr)
252             goto out;
253 
254         do_munmap(current->mm, new_addr, new_len);
255     }

This block handles the condition where the region location is fixed and must be fully moved. It ensures the area been moved to is safe and definitely unmapped.

: 236MREMAP_FIXED is the flag which indicates the location is fixed
: 237-238The specified new_addr must be be page aligned
: 239-240If MREMAP_FIXED is specified, then the MAYMOVE flag must be used as well
: 242-243Make sure the resized region does not exceed TASK_SIZE
: 248-249Just as the comments indicate, the two regions been used for the move may not overlap
: 254Unmap the region that is about to be used. It is presumed the caller ensures that the region is not in use for anything important

261     ret = addr;
262     if (old_len >= new_len) {
263         do_munmap(current->mm, addr+new_len, old_len - new_len);
264         if (!(flags & MREMAP_FIXED) || (new_addr == addr))
265             goto out;
266     }

: 261At this point, the address of the resized region is the return value
: 262If the old length is larger than the new length, then the region is shrinking
: 263Unmap the unused region
: 264-235If the region is not to be moved, either because MREMAP_FIXED is not used or the new address matches the old address, goto out which will return the address

271     ret = -EFAULT;
272     vma = find_vma(current->mm, addr);
273     if (!vma || vma->vm_start > addr)
274         goto out;
275     /* We can't remap across vm area boundaries */
276     if (old_len > vma->vm_end - addr)
277         goto out;
278     if (vma->vm_flags & VM_DONTEXPAND) {
279         if (new_len > old_len)
280             goto out;
281     }
282     if (vma->vm_flags & VM_LOCKED) {
283         unsigned long locked = current->mm->locked_vm << PAGE_SHIFT;
284         locked += new_len - old_len;
285         ret = -EAGAIN;
286         if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
287             goto out;
288     }
289     ret = -ENOMEM;
290     if ((current->mm->total_vm << PAGE_SHIFT) + (new_len - old_len)
291         > current->rlim[RLIMIT_AS].rlim_cur)
292         goto out;
293     /* Private writable mapping? Check memory availability.. */
294     if ((vma->vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
295         !(flags & MAP_NORESERVE) &&
296         !vm_enough_memory((new_len - old_len) >> PAGE_SHIFT))
297         goto out;

Do a number of checks to make sure it is safe to grow or move the region

: 271At this point, the default action is to return -EFAULT causing a segmentation fault as the ranges of memory been used are invalid
: 272Find the VMA responsible for the requested address
: 273If the returned VMA is not responsible for this address, then an invalid address was used so return a fault
: 276-277If the old_len passed in exceeds the length of the VMA, it means the user is trying to remap multiple regions which is not allowed
: 278-281If the VMA has been explicitly marked as non-resizable, raise a fault
: 282-283If the pages for this VMA must be locked in memory, recalculate the number of locked pages that will be kept in memory. If the number of pages exceed the ulimit set for this resource, return EAGAIN indicating to the caller that the region is locked and cannot be resized
: 289The default return at this point is to indicate there is not enough memory
: 290-292Ensure that the user will not exist their allowed allocation of memory
: 294-297Ensure that there is enough memory to satisfy the request after the resizing with vm_enough_memory()(See Section M.1.1)

302     if (old_len == vma->vm_end - addr &&
303         !((flags & MREMAP_FIXED) && (addr != new_addr)) &&
304         (old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
305         unsigned long max_addr = TASK_SIZE;
306         if (vma->vm_next)
307             max_addr = vma->vm_next->vm_start;
308         /* can we just expand the current mapping? */
309         if (max_addr - addr >= new_len) {
310             int pages = (new_len - old_len) >> PAGE_SHIFT;
311             spin_lock(&vma->vm_mm->page_table_lock);
312             vma->vm_end = addr + new_len;
313             spin_unlock(&vma->vm_mm->page_table_lock);
314             current->mm->total_vm += pages;
315             if (vma->vm_flags & VM_LOCKED) {
316                 current->mm->locked_vm += pages;
317                 make_pages_present(addr + old_len,
318                            addr + new_len);
319             }
320             ret = addr;
321             goto out;
322         }
323     }

Handle the case where the region is been expanded and cannot be moved

: 302If it is the full region that is been remapped and ...
: 303The region is definitely not been moved and ...
: 304The region is been expanded and cannot be moved then ...
: 305Set the maximum address that can be used to TASK_SIZE, 3GiB on an x86
: 306-307If there is another region, set the max address to be the start of the next region
: 309-322Only allow the expansion if the newly sized region does not overlap with the next VMA
: 310Calculate the number of extra pages that will be required
: 311Lock the mm spinlock
: 312Expand the VMA
: 313Free the mm spinlock
: 314Update the statistics for the mm
: 315-319If the pages for this region are locked in memory, make them present now
: 320-321Return the address of the resized region

329     ret = -ENOMEM;
330     if (flags & MREMAP_MAYMOVE) {
331         if (!(flags & MREMAP_FIXED)) {
332             unsigned long map_flags = 0;
333             if (vma->vm_flags & VM_SHARED)
334                 map_flags |= MAP_SHARED;
335 
336             new_addr = get_unmapped_area(vma->vm_file, 0,
                     new_len, vma->vm_pgoff, map_flags);
337             ret = new_addr;
338             if (new_addr & ~PAGE_MASK)
339                 goto out;
340         }
341         ret = move_vma(vma, addr, old_len, new_len, new_addr);
342     }
343 out:
344     return ret;
345 }

To expand the region, a new one has to be allocated and the old one moved to it

: 329The default action is to return saying no memory is available
: 330Check to make sure the region is allowed to move
: 331If MREMAP_FIXED is not specified, it means the new location was not supplied so one must be found
: 333-334Preserve the MAP_SHARED option
: 336Find an unmapped region of memory large enough for the expansion
: 337The return value is the address of the new region
: 338-339For the returned address to be not page aligned, get_unmapped_area() would need to be broken. This could possibly be the case with a buggy device driver implementing get_unmapped_area() incorrectly
: 341Call move_vma to move the region
: 343-344Return the address if successful and the error code otherwise

D.2.4.3 Function: move_vma

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.8. This function is responsible for moving all the page table entries from one VMA to another region. If necessary a new VMA will be allocated for the region being moved to. Just like the function above, it is very long but may be broken up into the following distinct parts.

Function preamble, find the VMA preceding the area about to be moved to and the VMA in front of the region to be mapped
Handle the case where the new location is between two existing VMAs. See if the preceding region can be expanded forward or the next region expanded backwards to cover the new mapped region
Handle the case where the new location is going to be the last VMA on the list. See if the preceding region can be expanded forward
If a region could not be expanded, allocate a new VMA from the slab allocator
Call move_page_tables(), fill in the new VMA details if a new one was allocated and update statistics before returning

125 static inline unsigned long move_vma(struct vm_area_struct * vma,
126     unsigned long addr, unsigned long old_len, unsigned long new_len,
127     unsigned long new_addr)
128 {
129     struct mm_struct * mm = vma->vm_mm;
130     struct vm_area_struct * new_vma, * next, * prev;
131     int allocated_vma;
132 
133     new_vma = NULL;
134     next = find_vma_prev(mm, new_addr, &prev);

125-127The parameters are

: vma The VMA that the address been moved belongs to
: addr The starting address of the moving region
: old_len The old length of the region to move
: new_len The new length of the region moved
: new_addr The new address to relocate to

134Find the VMA preceding the address been moved to indicated by prev and return the region after the new mapping as next

135     if (next) {
136         if (prev && prev->vm_end == new_addr &&
137             can_vma_merge(prev, vma->vm_flags) && 
              !vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
138             spin_lock(&mm->page_table_lock);
139             prev->vm_end = new_addr + new_len;
140             spin_unlock(&mm->page_table_lock);
141             new_vma = prev;
142             if (next != prev->vm_next)
143                 BUG();
144             if (prev->vm_end == next->vm_start &&
                can_vma_merge(next, prev->vm_flags)) {
145                 spin_lock(&mm->page_table_lock);
146                 prev->vm_end = next->vm_end;
147                 __vma_unlink(mm, next, prev);
148                 spin_unlock(&mm->page_table_lock);
149 
150                 mm->map_count--;
151                 kmem_cache_free(vm_area_cachep, next);
152             }
153         } else if (next->vm_start == new_addr + new_len &&
154                can_vma_merge(next, vma->vm_flags) &&
                 !vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
155             spin_lock(&mm->page_table_lock);
156             next->vm_start = new_addr;
157             spin_unlock(&mm->page_table_lock);
158             new_vma = next;
159         }
160     } else {

In this block, the new location is between two existing VMAs. Checks are made to see can be preceding region be expanded to cover the new mapping and then if it can be expanded to cover the next VMA as well. If it cannot be expanded, the next region is checked to see if it can be expanded backwards.

: 136-137If the preceding region touches the address to be mapped to and may be merged then enter this block which will attempt to expand regions
: 138Lock the mm
: 139Expand the preceding region to cover the new location
: 140Unlock the mm
: 141The new VMA is now the preceding VMA which was just expanded
: 142-143Make sure the VMA linked list is intact. It would require a device driver with severe brain damage to cause this situation to occur
: 144Check if the region can be expanded forward to encompass the next region
: 145If it can, then lock the mm
: 146Expand the VMA further to cover the next VMA
: 147There is now an extra VMA so unlink it
: 148Unlock the mm
: 150There is one less mapping now so update the map_count
: 151Free the memory used by the memory mapping
: 153Else the prev region could not be expanded forward so check if the region pointed to be next may be expanded backwards to cover the new mapping instead
: 155If it can, lock the mm
: 156Expand the mapping backwards
: 157Unlock the mm
: 158The VMA representing the new mapping is now next

161         prev = find_vma(mm, new_addr-1);
162         if (prev && prev->vm_end == new_addr &&
163             can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
                    !(vma->vm_flags & VM_SHARED)) {
164             spin_lock(&mm->page_table_lock);
165             prev->vm_end = new_addr + new_len;
166             spin_unlock(&mm->page_table_lock);
167             new_vma = prev;
168         }
169     }

This block is for the case where the newly mapped region is the last VMA (next is NULL) so a check is made to see can the preceding region be expanded.

: 161Get the previously mapped region
: 162-163Check if the regions may be mapped
: 164Lock the mm
: 165Expand the preceding region to cover the new mapping
: 166Lock the mm
: 167The VMA representing the new mapping is now prev

170 
171     allocated_vma = 0;
172     if (!new_vma) {
173         new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
174         if (!new_vma)
175             goto out;
176         allocated_vma = 1;
177     }
178

: 171Set a flag indicating if a new VMA was not allocated
: 172If a VMA has not been expanded to cover the new mapping then...
: 173Allocate a new VMA from the slab allocator
: 174-175If it could not be allocated, goto out to return failure
: 176Set the flag indicated a new VMA was allocated

179     if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
180         unsigned long vm_locked = vma->vm_flags & VM_LOCKED;
181
182         if (allocated_vma) {
183             *new_vma = *vma;
184             new_vma->vm_start = new_addr;
185             new_vma->vm_end = new_addr+new_len;
186             new_vma->vm_pgoff += 
                     (addr-vma->vm_start) >> PAGE_SHIFT;
187             new_vma->vm_raend = 0;
188             if (new_vma->vm_file)
189                 get_file(new_vma->vm_file);
190             if (new_vma->vm_ops && new_vma->vm_ops->open)
191                 new_vma->vm_ops->open(new_vma);
192             insert_vm_struct(current->mm, new_vma);
193         }

            do_munmap(current->mm, addr, old_len);

197         current->mm->total_vm += new_len >> PAGE_SHIFT;
198         if (new_vma->vm_flags & VM_LOCKED) {
199             current->mm->locked_vm += new_len >> PAGE_SHIFT;
200             make_pages_present(new_vma->vm_start,
201                        new_vma->vm_end);
202         }
203         return new_addr;
204     }
205     if (allocated_vma)
206         kmem_cache_free(vm_area_cachep, new_vma);
207  out:
208     return -ENOMEM;
209 }

: 179move_page_tables()(See Section D.2.4.6) is responsible for copying all the page table entries. It returns 0 on success
: 182-193If a new VMA was allocated, fill in all the relevant details, including the file/device entries and insert it into the various VMA linked lists with insert_vm_struct()(See Section D.2.2.1)
: 194Unmap the old region as it is no longer required
: 197Update the total_vm size for this process. The size of the old region is not important as it is handled within do_munmap()
: 198-202If the VMA has the VM_LOCKED flag, all the pages within the region are made present with mark_pages_present()
: 203Return the address of the new region
: 205-206This is the error path. If a VMA was allocated, delete it
: 208Return an out of memory error

D.2.4.4 Function: make_pages_present

Source: mm/memory.c

This function makes all pages between addr and end present. It assumes that the two addresses are within the one VMA.

1460 int make_pages_present(unsigned long addr, unsigned long end)
1461 {
1462     int ret, len, write;
1463     struct vm_area_struct * vma;
1464 
1465     vma = find_vma(current->mm, addr);
1466     write = (vma->vm_flags & VM_WRITE) != 0;
1467     if (addr >= end)
1468         BUG();
1469     if (end > vma->vm_end)
1470         BUG();
1471     len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
1472     ret = get_user_pages(current, current->mm, addr,
1473                 len, write, 0, NULL, NULL);
1474     return ret == len ? 0 : -1;
1475 }

: 1465Find the VMA with find_vma()(See Section D.3.1.1) that contains the starting address
: 1466Record if write-access is allowed in write
: 1467-1468If the starting address is after the end address, then BUG()
: 1469-1470If the range spans more than one VMA its a bug
: 1471Calculate the length of the region to fault in
: 1472Call get_user_pages() to fault in all the pages in the requested region. It returns the number of pages that were faulted in
: 1474Return true if all the requested pages were successfully faulted in

D.2.4.5 Function: get_user_pages

Source: mm/memory.c

This function is used to fault in user pages and may be used to fault in pages belonging to another process, which is required by ptrace() for example.

454 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 
                       unsigned long start,
455                    int len, int write, int force, struct page **pages, 
                       struct vm_area_struct **vmas)
456 {
457     int i;
458     unsigned int flags;
459 
460     /*
461      * Require read or write permissions.
462      * If 'force' is set, we only require the "MAY" flags.
463      */
464     flags =  write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
465     flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
466     i = 0;
467

454The parameters are:

: tsk is the process that pages are been faulted for
: mm is the mm_struct managing the address space being faulted
: start is where to start faulting
: len is the length of the region, in pages, to fault
: write indicates if the pages are being faulted for writing
: force indicates that the pages should be faulted even if the region only has the VM_MAYREAD or VM_MAYWRITE flags
: pages is an array of struct pages which may be NULL. If supplied, the array will be filled with struct pages that were faulted in
: vmas is similar to the pages array. If supplied, it will be filled with VMAs that were affected by the faults

464Set the required flags to VM_WRITE and VM_MAYWRITE flags if the parameter write is set to 1. Otherwise use the read equivilants

465If force is specified, only require the MAY flags

468     do {
469         struct vm_area_struct * vma;
470 
471         vma = find_extend_vma(mm, start);
472 
473         if ( !vma || 
                 (pages && vma->vm_flags & VM_IO) || 
                 !(flags & vma->vm_flags) )
474             return i ? : -EFAULT;
475 
476         spin_lock(&mm->page_table_lock);
477         do {
478             struct page *map;
479             while (!(map = follow_page(mm, start, write))) {
480                 spin_unlock(&mm->page_table_lock);
481                 switch (handle_mm_fault(mm, vma, start, write)) {
482                 case 1:
483                     tsk->min_flt++;
484                     break;
485                 case 2:
486                     tsk->maj_flt++;
487                     break;
488                 case 0:
489                     if (i) return i;
490                     return -EFAULT;
491                 default:
492                     if (i) return i;
493                     return -ENOMEM;
494                 }
495                 spin_lock(&mm->page_table_lock);
496             }
497             if (pages) {
498                 pages[i] = get_page_map(map);
499                 /* FIXME: call the correct function,
500                  * depending on the type of the found page
501                  */
502                 if (!pages[i])
503                     goto bad_page;
504                 page_cache_get(pages[i]);
505             }
506             if (vmas)
507                 vmas[i] = vma;
508             i++;
509             start += PAGE_SIZE;
510             len--;
511         } while(len && start < vma->vm_end);
512         spin_unlock(&mm->page_table_lock);
513     } while(len);
514 out:
515     return i;

: 468-513This outer loop will move through every VMA affected by the faults
: 471Find the VMA affected by the current value of start. This variable is incremented in PAGE_SIZEd strides
: 473If a VMA does not exist for the address, or the caller has requested struct pages for a region that is IO mapped (and therefore not backed by physical memory) or that the VMA does not have the required flags, then return -EFAULT
: 476Lock the page table spinlock
: 479-496follow_page()(See Section C.2.1) walks the page tables and returns the struct page representing the frame mapped at start. This loop will only be entered if the PTE is not present and will keep looping until the PTE is known to be present with the page table spinlock held
: 480Unlock the page table spinlock as handle_mm_fault() is likely to sleep
: 481If the page is not present, fault it in with handle_mm_fault()(See Section D.5.3.1)
: 482-487Update the task_struct statistics indicating if a major or minor fault occured
: 488-490If the faulting address is invalid, return
: 491-493If the system is out of memory, return -ENOMEM
: 495Relock the page tables. The loop will check to make sure the page is actually present
: 597-505If the caller requested it, populate the pages array with struct pages affected by this function. Each struct will have a reference to it taken with page_cache_get()
: 506-507Similarly, record VMAs affected
: 508Increment i which is a counter for the number of pages present in the requested region
: 509Increment start in a page-sized stride
: 510Decrement the number of pages that must be faulted in
: 511Keep moving through the VMAs until the requested pages have been faulted in
: 512Release the page table spinlock
: 515Return the number of pages known to be present in the region

516 
517     /*
518      * We found an invalid page in the VMA.  Release all we have
519      * so far and fail.
520      */
521 bad_page:
522     spin_unlock(&mm->page_table_lock);
523     while (i--)
524         page_cache_release(pages[i]);
525     i = -EFAULT;
526     goto out;
527 }

: 521This will only be reached if a struct page is found which represents a non-existant page frame
: 523-524If one if found, release references to all pages stored in the pages array
: 525-526Return -EFAULT

D.2.4.6 Function: move_page_tables

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.9. This function is responsible copying all the page table entries from the region pointed to be old_addr to new_addr. It works by literally copying page table entries one at a time. When it is finished, it deletes all the entries from the old area. This is not the most efficient way to perform the operation, but it is very easy to error recover.

 90 static int move_page_tables(struct mm_struct * mm,
 91     unsigned long new_addr, unsigned long old_addr, 
        unsigned long len)
 92 {
 93     unsigned long offset = len;
 94 
 95     flush_cache_range(mm, old_addr, old_addr + len);
 96 
102     while (offset) {
103         offset -= PAGE_SIZE;
104         if (move_one_page(mm, old_addr + offset, new_addr +
                    offset))
105             goto oops_we_failed;
106     }
107     flush_tlb_range(mm, old_addr, old_addr + len);
108     return 0;
109 
117 oops_we_failed:
118     flush_cache_range(mm, new_addr, new_addr + len);
119     while ((offset += PAGE_SIZE) < len)
120         move_one_page(mm, new_addr + offset, old_addr + offset);
121     zap_page_range(mm, new_addr, len);
122     return -1;
123 }

: 90The parameters are the mm for the process, the new location, the old location and the length of the region to move entries for
: 95flush_cache_range() will flush all CPU caches for this range. It must be called first as some architectures, notably Sparc's require that a virtual to physical mapping exist before flushing the TLB
: 102-106This loops through each page in the region and moves the PTE with move_one_pte()(See Section D.2.4.7). This translates to a lot of page table walking and could be performed much better but it is a rare operation
: 107Flush the TLB for the old region
: 108Return success
: 118-120This block moves all the PTEs back. A flush_tlb_range() is not necessary as there is no way the region could have been used yet so no TLB entries should exist
: 121Zap any pages that were allocated for the move
: 122Return failure

D.2.4.7 Function: move_one_page

Source: mm/mremap.c

This function is responsible for acquiring the spinlock before finding the correct PTE with get_one_pte() and copying it with copy_one_pte()

 77 static int move_one_page(struct mm_struct *mm, 
                 unsigned long old_addr, unsigned long new_addr)
 78 {
 79     int error = 0;
 80     pte_t * src;
 81 
 82     spin_lock(&mm->page_table_lock);
 83     src = get_one_pte(mm, old_addr);
 84     if (src)
 85         error = copy_one_pte(mm, src, alloc_one_pte(mm, new_addr));
 86     spin_unlock(&mm->page_table_lock);
 87     return error;
 88 }

: 82Acquire the mm lock
: 83Call get_one_pte()(See Section D.2.4.8) which walks the page tables to get the correct PTE
: 84-85If the PTE exists, allocate a PTE for the destination and copy the PTEs with copy_one_pte()(See Section D.2.4.10)
: 86Release the lock
: 87Return whatever copy_one_pte() returned. It will only return an error if alloc_one_pte()(See Section D.2.4.9) failed on line 85

D.2.4.8 Function: get_one_pte

Source: mm/mremap.c

This is a very simple page table walk.

 18 static inline pte_t *get_one_pte(struct mm_struct *mm, 
                                     unsigned long addr)
 19 {
 20     pgd_t * pgd;
 21     pmd_t * pmd;
 22     pte_t * pte = NULL;
 23 
 24     pgd = pgd_offset(mm, addr);
 25     if (pgd_none(*pgd))
 26         goto end;
 27     if (pgd_bad(*pgd)) {
 28         pgd_ERROR(*pgd);
 29         pgd_clear(pgd);
 30         goto end;
 31     }
 32 
 33     pmd = pmd_offset(pgd, addr);
 34     if (pmd_none(*pmd))
 35         goto end;
 36     if (pmd_bad(*pmd)) {
 37         pmd_ERROR(*pmd);
 38         pmd_clear(pmd);
 39         goto end;
 40     }
 41 
 42     pte = pte_offset(pmd, addr);
 43     if (pte_none(*pte))
 44         pte = NULL;
 45 end:
 46     return pte;
 47 }

: 24Get the PGD for this address
: 25-26If no PGD exists, return NULL as no PTE will exist either
: 27-31If the PGD is bad, mark that an error occurred in the region, clear its contents and return NULL
: 33-40Acquire the correct PMD in the same fashion as for the PGD
: 42Acquire the PTE so it may be returned if it exists

D.2.4.9 Function: alloc_one_pte

Source: mm/mremap.c

Trivial function to allocate what is necessary for one PTE in a region.

 49 static inline pte_t *alloc_one_pte(struct mm_struct *mm, 
                     unsigned long addr)
 50 {
 51     pmd_t * pmd;
 52     pte_t * pte = NULL;
 53 
 54     pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr);
 55     if (pmd)
 56         pte = pte_alloc(mm, pmd, addr);
 57     return pte;
 58 }

: 54If a PMD entry does not exist, allocate it
: 55-56If the PMD exists, allocate a PTE entry. The check to make sure it succeeded is performed later in the function copy_one_pte()

D.2.4.10 Function: copy_one_pte

Source: mm/mremap.c

Copies the contents of one PTE to another.

 60 static inline int copy_one_pte(struct mm_struct *mm, 
                   pte_t * src, pte_t * dst)
 61 {
 62     int error = 0;
 63     pte_t pte;
 64 
 65     if (!pte_none(*src)) {
 66         pte = ptep_get_and_clear(src);
 67         if (!dst) {
 68             /* No dest?  We must put it back. */
 69             dst = src;
 70             error++;
 71         }
 72         set_pte(dst, pte);
 73     }
 74     return error;
 75 }

: 65If the source PTE does not exist, just return 0 to say the copy was successful
: 66Get the PTE and remove it from its old location
: 67-71If the dst does not exist, it means the call to alloc_one_pte() failed and the copy operation has failed and must be aborted
: 72Move the PTE to its new location
: 74Return an error if one occurred

D.2.5 Deleting a memory region

D.2.5.1 Function: do_munmap

Source: mm/mmap.c

The call graph for this function is shown in Figure 4.11. This function is responsible for unmapping a region. If necessary, the unmapping can span multiple VMAs and it can partially unmap one if necessary. Hence the full unmapping operation is divided into two major operations. This function is responsible for finding what VMAs are affected and unmap_fixup() is responsible for fixing up the remaining VMAs.

This function is divided up in a number of small sections will be dealt with in turn. The are broadly speaking;

Function preamble and find the VMA to start working from
Take all VMAs affected by the unmapping out of the mm and place them on a linked list headed by the variable free
Cycle through the list headed by free, unmap all the pages in the region to be unmapped and call unmap_fixup() to fix up the mappings
Validate the mm and free memory associated with the unmapping

924 int do_munmap(struct mm_struct *mm, unsigned long addr, 
                  size_t len)
925 {
926     struct vm_area_struct *mpnt, *prev, **npp, *free, *extra;
927 
928     if ((addr & ~PAGE_MASK) || addr > TASK_SIZE || 
                     len  > TASK_SIZE-addr)
929         return -EINVAL;
930 
931     if ((len = PAGE_ALIGN(len)) == 0)
932         return -EINVAL;
933 
939     mpnt = find_vma_prev(mm, addr, &prev);
940     if (!mpnt)
941         return 0;
942     /* we have  addr < mpnt->vm_end  */
943 
944     if (mpnt->vm_start >= addr+len)
945         return 0;
946 
948     if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
949         && mm->map_count >= max_map_count)
950         return -ENOMEM;
951 
956     extra = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
957     if (!extra)
958         return -ENOMEM;

924The parameters are as follows;

: mmThe mm for the processes performing the unmap operation
: addrThe starting address of the region to unmap
: lenThe length of the region

928-929Ensure the address is page aligned and that the area to be unmapped is not in the kernel virtual address space

931-932Make sure the region size to unmap is page aligned

939Find the VMA that contains the starting address and the preceding VMA so it can be easily unlinked later

940-941If no mpnt was returned, it means the address must be past the last used VMA so the address space is unused, just return

944-945If the returned VMA starts past the region we are trying to unmap, then the region in unused, just return

948-950The first part of the check sees if the VMA is just been partially unmapped, if it is, another VMA will be created later to deal with a region being broken into so to the map_count has to be checked to make sure it is not too large

956-958In case a new mapping is required, it is allocated now as later it will be much more difficult to back out in event of an error

960     npp = (prev ? &prev->vm_next : &mm->mmap);
961     free = NULL;
962     spin_lock(&mm->page_table_lock);
963     for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) {
964         *npp = mpnt->vm_next;
965         mpnt->vm_next = free;
966         free = mpnt;
967         rb_erase(&mpnt->vm_rb, &mm->mm_rb);
968     }
969     mm->mmap_cache = NULL;  /* Kill the cache. */
970     spin_unlock(&mm->page_table_lock);

This section takes all the VMAs affected by the unmapping and places them on a separate linked list headed by a variable called free. This makes the fixup of the regions much easier.

: 960npp becomes the next VMA in the list during the for loop following below. To initialise it, it is either the current VMA (mpnt) or else it becomes the first VMA in the list
: 961free is the head of a linked list of VMAs that are affected by the unmapping
: 962Lock the mm
: 963Cycle through the list until the start of the current VMA is past the end of the region to be unmapped
: 964npp becomes the next VMA in the list
: 965-966Remove the current VMA from the linear linked list within the mm and place it on a linked list headed by free. The current mpnt becomes the head of the free linked list
: 967Delete mpnt from the red-black tree
: 969Remove the cached result in case the last looked up result is one of the regions to be unmapped
: 970Free the mm

971 
972     /* Ok - we have the memory areas we should free on the 
973      * 'free' list, so release them, and unmap the page range..
974      * If the one of the segments is only being partially unmapped,
975      * it will put new vm_area_struct(s) into the address space.
976      * In that case we have to be careful with VM_DENYWRITE.
977      */
978     while ((mpnt = free) != NULL) {
979         unsigned long st, end, size;
980         struct file *file = NULL;
981 
982         free = free->vm_next;
983 
984         st = addr < mpnt->vm_start ? mpnt->vm_start : addr;
985         end = addr+len;
986         end = end > mpnt->vm_end ? mpnt->vm_end : end;
987         size = end - st;
988 
989         if (mpnt->vm_flags & VM_DENYWRITE &&
990             (st != mpnt->vm_start || end != mpnt->vm_end) &&
991             (file = mpnt->vm_file) != NULL) {
992            atomic_dec(&file->f_dentry->d_inode->i_writecount);
993         }
994         remove_shared_vm_struct(mpnt);
995         mm->map_count--;
996 
997         zap_page_range(mm, st, size);
998 
999         /*
1000         * Fix the mapping, and free the old area 
             * if it wasn't reused.
1001         */
1002        extra = unmap_fixup(mm, mpnt, st, size, extra);
1003        if (file)
1004           atomic_inc(&file->f_dentry->d_inode->i_writecount);
1005     }

: 978Keep stepping through the list until no VMAs are left
: 982Move free to the next element in the list leaving mpnt as the head about to be removed
: 984st is the start of the region to be unmapped. If the addr is before the start of the VMA, the starting point is mpnt→vm_start, otherwise it is the supplied address
: 985-986Calculate the end of the region to map in a similar fashion
: 987Calculate the size of the region to be unmapped in this pass
: 989-993If the VM_DENYWRITE flag is specified, a hole will be created by this unmapping and a file is mapped then the i_writecount is decremented. When this field is negative, it counts how many users there is protecting this file from being opened for writing
: 994Remove the file mapping. If the file is still partially mapped, it will be acquired again during unmap_fixup()(See Section D.2.5.2)
: 995Reduce the map count
: 997Remove all pages within this region
: 1002Call unmap_fixup()(See Section D.2.5.2) to fix up the regions after this one is deleted
: 1003-1004Increment the writecount to the file as the region has been unmapped. If it was just partially unmapped, this call will simply balance out the decrement at line 987

1006     validate_mm(mm);
1007 
1008     /* Release the extra vma struct if it wasn't used */
1009     if (extra)
1010         kmem_cache_free(vm_area_cachep, extra);
1011 
1012     free_pgtables(mm, prev, addr, addr+len);
1013 
1014     return 0;
1015 }

: 1006validate_mm() is a debugging function. If enabled, it will ensure the VMA tree for this mm is still valid
: 1009-1010If extra VMA was not required, delete it
: 1012Free all the page tables that were used for the unmapped region
: 1014Return success

D.2.5.2 Function: unmap_fixup

Source: mm/mmap.c

This function fixes up the regions after a block has been unmapped. It is passed a list of VMAs that are affected by the unmapping, the region and length to be unmapped and a spare VMA that may be required to fix up the region if a whole is created. There is four principle cases it handles; The unmapping of a region, partial unmapping from the start to somewhere in the middle, partial unmapping from somewhere in the middle to the end and the creation of a hole in the middle of the region. Each case will be taken in turn.

787 static struct vm_area_struct * unmap_fixup(struct mm_struct *mm, 
788     struct vm_area_struct *area, unsigned long addr, size_t len, 
789     struct vm_area_struct *extra)
790 {
791     struct vm_area_struct *mpnt;
792     unsigned long end = addr + len;
793 
794     area->vm_mm->total_vm -= len >> PAGE_SHIFT;
795     if (area->vm_flags & VM_LOCKED)
796         area->vm_mm->locked_vm -= len >> PAGE_SHIFT;
797

Function preamble.

787The parameters to the function are;

: mm is the mm the unmapped region belongs to
: area is the head of the linked list of VMAs affected by the unmapping
: addr is the starting address of the unmapping
: len is the length of the region to be unmapped
: extra is a spare VMA passed in for when a hole in the middle is created

792Calculate the end address of the region being unmapped

794Reduce the count of the number of pages used by the process

795-796If the pages were locked in memory, reduce the locked page count

798     /* Unmapping the whole area. */
799     if (addr == area->vm_start && end == area->vm_end) {
800         if (area->vm_ops && area->vm_ops->close)
801             area->vm_ops->close(area);
802         if (area->vm_file)
803             fput(area->vm_file);
804         kmem_cache_free(vm_area_cachep, area);
805         return extra;
806     }

The first, and easiest, case is where the full region is being unmapped

: 799The full region is unmapped if the addr is the start of the VMA and the end is the end of the VMA. This is interesting because if the unmapping is spanning regions, it is possible the end is beyond the end of the VMA but the full of this VMA is still being unmapped
: 800-801If a close operation is supplied by the VMA, call it
: 802-803If a file or device is mapped, call fput() which decrements the usage count and releases it if the count falls to 0
: 804Free the memory for the VMA back to the slab allocator
: 805Return the extra VMA as it was unused

809     if (end == area->vm_end) {
810         /*
811          * here area isn't visible to the semaphore-less readers
812          * so we don't need to update it under the spinlock.
813          */
814         area->vm_end = addr;
815         lock_vma_mappings(area);
816         spin_lock(&mm->page_table_lock);
817     }

Handle the case where the middle of the region to the end is been unmapped

: 814Truncate the VMA back to addr. At this point, the pages for the region have already freed and the page table entries will be freed later so no further work is required
: 815If a file/device is being mapped, the lock protecting shared access to it is taken in the function lock_vm_mappings()
: 816Lock the mm. Later in the function, the remaining VMA will be reinserted into the mm

817           else if (addr == area->vm_start) {
818         area->vm_pgoff += (end - area->vm_start) >> PAGE_SHIFT;
819         /* same locking considerations of the above case */
820         area->vm_start = end;
821         lock_vma_mappings(area);
822         spin_lock(&mm->page_table_lock);
823     } else {

Handle the case where the VMA is been unmapped from the start to some part in the middle

: 818Increase the offset within the file/device mapped by the number of pages this unmapping represents
: 820Move the start of the VMA to the end of the region being unmapped
: 821-822Lock the file/device and mm as above

823     } else {
825         /* Add end mapping -- leave beginning for below */
826         mpnt = extra;
827         extra = NULL;
828 
829         mpnt->vm_mm = area->vm_mm;
830         mpnt->vm_start = end;
831         mpnt->vm_end = area->vm_end;
832         mpnt->vm_page_prot = area->vm_page_prot;
833         mpnt->vm_flags = area->vm_flags;
834         mpnt->vm_raend = 0;
835         mpnt->vm_ops = area->vm_ops;
836         mpnt->vm_pgoff = area->vm_pgoff + 
                     ((end - area->vm_start) >> PAGE_SHIFT);
837         mpnt->vm_file = area->vm_file;
838         mpnt->vm_private_data = area->vm_private_data;
839         if (mpnt->vm_file)
840             get_file(mpnt->vm_file);
841         if (mpnt->vm_ops && mpnt->vm_ops->open)
842             mpnt->vm_ops->open(mpnt);
843         area->vm_end = addr;    /* Truncate area */
844 
845         /* Because mpnt->vm_file == area->vm_file this locks
846          * things correctly.
847          */
848         lock_vma_mappings(area);
849         spin_lock(&mm->page_table_lock);
850         __insert_vm_struct(mm, mpnt);
851     }

Handle the case where a hole is being created by a partial unmapping. In this case, the extra VMA is required to create a new mapping from the end of the unmapped region to the end of the old VMA

: 826-827Take the extra VMA and make VMA NULL so that the calling function will know it is in use and cannot be freed
: 828-838Copy in all the VMA information
: 839If a file/device is mapped, get a reference to it with get_file()
: 841-842If an open function is provided, call it
: 843Truncate the VMA so that it ends at the start of the region to be unmapped
: 848-849Lock the files and mm as with the two previous cases
: 850Insert the extra VMA into the mm

852 
853     __insert_vm_struct(mm, area);
854     spin_unlock(&mm->page_table_lock);
855     unlock_vma_mappings(area);
856     return extra;
857 }

: 853Reinsert the VMA into the mm
: 854Unlock the page tables
: 855Unlock the spinlock to the shared mapping
: 856Return the extra VMA if it was not used and NULL if it was

D.2.6 Deleting all memory regions

D.2.6.1 Function: exit_mmap

Source: mm/mmap.c

This function simply steps through all VMAs associated with the supplied mm and unmaps them.

1127 void exit_mmap(struct mm_struct * mm)
1128 {
1129     struct vm_area_struct * mpnt;
1130 
1131     release_segments(mm);
1132     spin_lock(&mm->page_table_lock);
1133     mpnt = mm->mmap;
1134     mm->mmap = mm->mmap_cache = NULL;
1135     mm->mm_rb = RB_ROOT;
1136     mm->rss = 0;
1137     spin_unlock(&mm->page_table_lock);
1138     mm->total_vm = 0;
1139     mm->locked_vm = 0;
1140 
1141     flush_cache_mm(mm);
1142     while (mpnt) {
1143         struct vm_area_struct * next = mpnt->vm_next;
1144         unsigned long start = mpnt->vm_start;
1145         unsigned long end = mpnt->vm_end;
1146         unsigned long size = end - start;
1147 
1148         if (mpnt->vm_ops) {
1149             if (mpnt->vm_ops->close)
1150                 mpnt->vm_ops->close(mpnt);
1151         }
1152         mm->map_count--;
1153         remove_shared_vm_struct(mpnt);
1154         zap_page_range(mm, start, size);
1155         if (mpnt->vm_file)
1156             fput(mpnt->vm_file);
1157         kmem_cache_free(vm_area_cachep, mpnt);
1158         mpnt = next;
1159     }
1160     flush_tlb_mm(mm);
1161 
1162     /* This is just debugging */
1163     if (mm->map_count)
1164         BUG();
1165 
1166     clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
1167 }

: 1131release_segments() will release memory segments associated with the process on its Local Descriptor Table (LDT) if the architecture supports segments and the process was using them. Some applications, notably WINE use this feature
: 1132Lock the mm
: 1133mpnt becomes the first VMA on the list
: 1134Clear VMA related information from the mm so it may be unlocked
: 1137Unlock the mm
: 1138-1139Clear the mm statistics
: 1141Flush the CPU for the address range
: 1142-1159Step through every VMA that was associated with the mm
: 1143Record what the next VMA to clear will be so this one may be deleted
: 1144-1146Record the start, end and size of the region to be deleted
: 1148-1151If there is a close operation associated with this VMA, call it
: 1152Reduce the map count
: 1153Remove the file/device mapping from the shared mappings list
: 1154Free all pages associated with this region
: 1155-1156If a file/device was mapped in this region, free it
: 1157Free the VMA struct
: 1158Move to the next VMA
: 1160Flush the TLB for this whole mm as it is about to be unmapped
: 1163-1164If the map_count is positive, it means the map count was not accounted for properly so call BUG() to mark it
: 1166Clear the page tables associated with this region with clear_page_tables() (See Section D.2.6.2)

D.2.6.2 Function: clear_page_tables

Source: mm/memory.c

This is the top-level function used to unmap all PTEs and free pages within a region. It is used when pagetables needs to be torn down such as when the process exits or a region is unmapped.

146 void clear_page_tables(struct mm_struct *mm, 
                           unsigned long first, int nr)
147 {
148     pgd_t * page_dir = mm->pgd;
149 
150     spin_lock(&mm->page_table_lock);
151     page_dir += first;
152     do {
153         free_one_pgd(page_dir);
154         page_dir++;
155     } while (--nr);
156     spin_unlock(&mm->page_table_lock);
157 
158     /* keep the page table cache within bounds */
159     check_pgt_cache();
160 }

: 148Get the PGD for the mm being unmapped
: 150Lock the pagetables
: 151-155Step through all PGDs in the requested range. For each PGD found, call free_one_pgd() (See Section D.2.6.3)
: 156Unlock the pagetables
: 159Check the cache of available PGD structures. If there are too many PGDs in the PGD quicklist, some of them will be reclaimed

D.2.6.3 Function: free_one_pgd

Source: mm/memory.c

This function tears down one PGD. For each PMD in this PGD, free_one_pmd() will be called.

109 static inline void free_one_pgd(pgd_t * dir)
110 {
111     int j;
112     pmd_t * pmd;
113 
114     if (pgd_none(*dir))
115         return;
116     if (pgd_bad(*dir)) {
117         pgd_ERROR(*dir);
118         pgd_clear(dir);
119         return;
120     }
121     pmd = pmd_offset(dir, 0);
122     pgd_clear(dir);
123     for (j = 0; j < PTRS_PER_PMD ; j++) {
124         prefetchw(pmd+j+(PREFETCH_STRIDE/16));
125         free_one_pmd(pmd+j);
126     }
127     pmd_free(pmd);
128 }

: 114-115If no PGD exists here, return
: 116-120If the PGD is bad, flag the error and return
: 1121Get the first PMD in the PGD
: 122Clear the PGD entry
: 123-126For each PMD in this PGD, call free_one_pmd() (See Section D.2.6.4)
: 127Free the PMD page to the PMD quicklist. Later, check_pgt_cache() will be called and if the cache has too many PMD pages in it, they will be reclaimed

D.2.6.4 Function: free_one_pmd

Source: mm/memory.c

 93 static inline void free_one_pmd(pmd_t * dir)
 94 {
 95     pte_t * pte;
 96 
 97     if (pmd_none(*dir))
 98         return;
 99     if (pmd_bad(*dir)) {
100         pmd_ERROR(*dir);
101         pmd_clear(dir);
102         return;
103     }
104     pte = pte_offset(dir, 0);
105     pmd_clear(dir);
106     pte_free(pte);
107 }

: 97-98If no PMD exists here, return
: 99-103If the PMD is bad, flag the error and return
: 104Get the first PTE in the PMD
: 105Clear the PMD from the pagetable
: 106Free the PTE page to the PTE quicklist cache with pte_free(). Later, check_pgt_cache() will be called and if the cache has too many PTE pages in it, they will be reclaimed

D.3 Searching Memory Regions

The functions in this section deal with searching the virtual address space for mapped and free regions.

D.3.1 Finding a Mapped Memory Region

D.3.1.1 Function: find_vma

Source: mm/mmap.c

661 struct vm_area_struct * find_vma(struct mm_struct * mm, 
                                     unsigned long addr)
662 {
663     struct vm_area_struct *vma = NULL;
664 
665     if (mm) {
666         /* Check the cache first. */
667         /* (Cache hit rate is typically around 35%.) */
668         vma = mm->mmap_cache;
669         if (!(vma && vma->vm_end > addr && 
              vma->vm_start <= addr)) {
670             rb_node_t * rb_node;
671 
672             rb_node = mm->mm_rb.rb_node;
673             vma = NULL;
674 
675             while (rb_node) {
676                 struct vm_area_struct * vma_tmp;
677 
678                 vma_tmp = rb_entry(rb_node, 
                        struct vm_area_struct, vm_rb);
679 
680                 if (vma_tmp->vm_end > addr) {
681                     vma = vma_tmp;
682                     if (vma_tmp->vm_start <= addr)
683                         break;
684                     rb_node = rb_node->rb_left;
685                 } else
686                     rb_node = rb_node->rb_right;
687             }
688             if (vma)
689                 mm->mmap_cache = vma;
690         }
691     }
692     return vma;
693 }

: 661The two parameters are the top level mm_struct that is to be searched and the address the caller is interested in
: 663Default to returning NULL for address not found
: 665Make sure the caller does not try and search a bogus mm
: 668mmap_cache has the result of the last call to find_vma(). This has a chance of not having to search at all through the red-black tree
: 669If it is a valid VMA that is being examined, check to see if the address being searched is contained within it. If it is, the VMA was the mmap_cache one so it can be returned, otherwise the tree is searched
: 670-674Start at the root of the tree
: 675-687This block is the tree walk
: 678The macro, as the name suggests, returns the VMA this tree node points to
: 680Check if the next node traversed by the left or right leaf
: 682If the current VMA is what is required, exit the while loop
: 689If the VMA is valid, set the mmap_cache for the next call to find_vma()
: 692Return the VMA that contains the address or as a side effect of the tree walk, return the VMA that is closest to the requested address

D.3.1.2 Function: find_vma_prev

Source: mm/mmap.c

696 struct vm_area_struct * find_vma_prev(struct mm_struct * mm, 
                        unsigned long addr,
697                     struct vm_area_struct **pprev)
698 {
699     if (mm) {
700         /* Go through the RB tree quickly. */
701         struct vm_area_struct * vma;
702         rb_node_t * rb_node, * rb_last_right, * rb_prev;
703         
704         rb_node = mm->mm_rb.rb_node;
705         rb_last_right = rb_prev = NULL;
706         vma = NULL;
707 
708         while (rb_node) {
709             struct vm_area_struct * vma_tmp;
710 
711             vma_tmp = rb_entry(rb_node, 
                             struct vm_area_struct, vm_rb);
712 
713             if (vma_tmp->vm_end > addr) {
714                 vma = vma_tmp;
715                 rb_prev = rb_last_right;
716                 if (vma_tmp->vm_start <= addr)
717                     break;
718                 rb_node = rb_node->rb_left;
719             } else {
720                 rb_last_right = rb_node;
721                 rb_node = rb_node->rb_right;
722             }
723         }
724         if (vma) {
725             if (vma->vm_rb.rb_left) {
726                 rb_prev = vma->vm_rb.rb_left;
727                 while (rb_prev->rb_right)
728                     rb_prev = rb_prev->rb_right;
729             }
730             *pprev = NULL;
731             if (rb_prev)
732                 *pprev = rb_entry(rb_prev, struct
                         vm_area_struct, vm_rb);
733             if ((rb_prev ? (*pprev)->vm_next : mm->mmap) !=
vma)
734                 BUG();
735             return vma;
736         }
737     }
738     *pprev = NULL;
739     return NULL;
740 }

: 696-723This is essentially the same as the find_vma() function already described. The only difference is that the last right node accesses is remembered as this will represent the vma previous to the requested vma.
: 725-729If the returned VMA has a left node, it means that it has to be traversed. It first takes the left leaf and then follows each right leaf until the bottom of the tree is found.
: 731-732Extract the VMA from the red-black tree node
: 733-734A debugging check, if this is the previous node, then its next field should point to the VMA being returned. If it is not, it is a bug

D.3.1.3 Function: find_vma_intersection

Source: include/linux/mm.h

673 static inline struct vm_area_struct * find_vma_intersection(
                       struct mm_struct * mm, 
                       unsigned long start_addr, unsigned long end_addr)
674 {
675     struct vm_area_struct * vma = find_vma(mm,start_addr);
676 
677     if (vma && end_addr <= vma->vm_start)
678         vma = NULL;
679     return vma;
680 }

: 675Return the VMA closest to the starting address
: 677If a VMA is returned and the end address is still less than the beginning of the returned VMA, the VMA does not intersect
: 679Return the VMA if it does intersect

D.3.2 Finding a Free Memory Region

D.3.2.1 Function: get_unmapped_area

Source: mm/mmap.c

The call graph for this function is shown at Figure 4.5.

644 unsigned long get_unmapped_area(struct file *file, 
                        unsigned long addr,
                        unsigned long len, 
                        unsigned long pgoff, 
                        unsigned long flags)
645 {
646     if (flags & MAP_FIXED) {
647         if (addr > TASK_SIZE - len)
648             return -ENOMEM;
649         if (addr & ~PAGE_MASK)
650             return -EINVAL;
651         return addr;
652     }
653 
654     if (file && file->f_op && file->f_op->get_unmapped_area)
655         return file->f_op->get_unmapped_area(file, addr, 
                                len, pgoff, flags);
656 
657     return arch_get_unmapped_area(file, addr, len, pgoff, flags);
658 }

644The parameters passed are

: file The file or device being mapped
: addr The requested address to map to
: len The length of the mapping
: pgoff The offset within the file being mapped
: flags Protection flags

646-652Sanity checked. If it is required that the mapping be placed at the specified address, make sure it will not overflow the address space and that it is page aligned

654If the struct file provides a get_unmapped_area() function, use it

657Else use arch_get_unmapped_area()(See Section D.3.2.2) as an anonymous version of the get_unmapped_area() function

D.3.2.2 Function: arch_get_unmapped_area

Source: mm/mmap.c

Architectures have the option of specifying this function for themselves by defining HAVE_ARCH_UNMAPPED_AREA. If the architectures does not supply one, this version is used.

614 #ifndef HAVE_ARCH_UNMAPPED_AREA
615 static inline unsigned long arch_get_unmapped_area(
            struct file *filp,
            unsigned long addr, unsigned long len, 
            unsigned long pgoff, unsigned long flags)
616 {
617     struct vm_area_struct *vma;
618 
619     if (len > TASK_SIZE)
620         return -ENOMEM;
621 
622     if (addr) {
623         addr = PAGE_ALIGN(addr);
624         vma = find_vma(current->mm, addr);
625         if (TASK_SIZE - len >= addr &&
626             (!vma || addr + len <= vma->vm_start))
627             return addr;
628     }
629     addr = PAGE_ALIGN(TASK_UNMAPPED_BASE);
630 
631     for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
632         /* At this point:  (!vma || addr < vma->vm_end). */
633         if (TASK_SIZE - len < addr)
634             return -ENOMEM;
635         if (!vma || addr + len <= vma->vm_start)
636             return addr;
637         addr = vma->vm_end;
638     }
639 }
640 #else
641 extern unsigned long arch_get_unmapped_area(struct file *, 
                     unsigned long, unsigned long, 
                     unsigned long, unsigned long);
642 #endif

: 614If this is not defined, it means that the architecture does not provide its own arch_get_unmapped_area() so this one is used instead
: 615The parameters are the same as those for get_unmapped_area()(See Section D.3.2.1)
: 619-620Sanity check, make sure the required map length is not too long
: 622-628If an address is provided, use it for the mapping
: 623Make sure the address is page aligned
: 624find_vma()(See Section D.3.1.1) will return the region closest to the requested address
: 625-627Make sure the mapping will not overlap with another region. If it does not, return it as it is safe to use. Otherwise it gets ignored
: 629TASK_UNMAPPED_BASE is the starting point for searching for a free region to use
: 631-638Starting from TASK_UNMAPPED_BASE, linearly search the VMAs until a large enough region between them is found to store the new mapping. This is essentially a first fit search
: 641If an external function is provided, it still needs to be declared here

D.4 Locking and Unlocking Memory Regions

This section contains the functions related to locking and unlocking a region. The main complexity in them is how the regions need to be fixed up after the operation takes place.

D.4.1 Locking a Memory Region

D.4.1.1 Function: sys_mlock

Source: mm/mlock.c

The call graph for this function is shown in Figure 4.10. This is the system call mlock() for locking a region of memory into physical memory. This function simply checks to make sure that process and user limits are not exceeeded and that the region to lock is page aligned.

195 asmlinkage long sys_mlock(unsigned long start, size_t len)
196 {
197     unsigned long locked;
198     unsigned long lock_limit;
199     int error = -ENOMEM;
200 
201     down_write(&current->mm->mmap_sem);
202     len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
203     start &= PAGE_MASK;
204 
205     locked = len >> PAGE_SHIFT;
206     locked += current->mm->locked_vm;
207 
208     lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
209     lock_limit >>= PAGE_SHIFT;
210 
211     /* check against resource limits */
212     if (locked > lock_limit)
213         goto out;
214 
215     /* we may lock at most half of physical memory... */
216     /* (this check is pretty bogus, but doesn't hurt) */
217     if (locked > num_physpages/2)
218         goto out;
219 
220     error = do_mlock(start, len, 1);
221 out:
222     up_write(&current->mm->mmap_sem);
223     return error;
224 }

: 201Take the semaphore, we are likely to sleep during this so a spinlock can not be used
: 202Round the length up to the page boundary
: 203Round the start address down to the page boundary
: 205Calculate how many pages will be locked
: 206Calculate how many pages will be locked in total by this process
: 208-209Calculate what the limit is to the number of locked pages
: 212-213Do not allow the process to lock more than it should
: 217-218Do not allow the process to map more than half of physical memory
: 220Call do_mlock()(See Section D.4.1.4) which starts the “real” work by find the VMA clostest to the area to lock before calling mlock_fixup()(See Section D.4.3.1)
: 222Free the semaphore
: 223Return the error or success code from do_mlock()

D.4.1.2 Function: sys_mlockall

Source: mm/mlock.c

This is the system call mlockall() which attempts to lock all pages in the calling process in memory. If MCL_CURRENT is specified, all current pages will be locked. If MCL_FUTURE is specified, all future mappings will be locked. The flags may be or-ed together. This function makes sure that the flags and process limits are ok before calling do_mlockall().

266 asmlinkage long sys_mlockall(int flags)
267 {
268     unsigned long lock_limit;
269     int ret = -EINVAL;
270 
271     down_write(&current->mm->mmap_sem);
272     if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
273         goto out;
274 
275     lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
276     lock_limit >>= PAGE_SHIFT;
277 
278     ret = -ENOMEM;
279     if (current->mm->total_vm > lock_limit)
280         goto out;
281 
282     /* we may lock at most half of physical memory... */
283     /* (this check is pretty bogus, but doesn't hurt) */
284     if (current->mm->total_vm > num_physpages/2)
285         goto out;
286 
287     ret = do_mlockall(flags);
288 out:
289     up_write(&current->mm->mmap_sem);
290     return ret;
291 }

: 269By default, return -EINVAL to indicate invalid parameters
: 271Acquire the current mm_struct semaphore
: 272-273Make sure that some valid flag has been specified. If not, goto out to unlock the semaphore and return -EINVAL
: 275-276Check the process limits to see how many pages may be locked
: 278From here on, the default error is -ENOMEM
: 279-280If the size of the locking would exceed set limits, then goto out
: 284-285Do not allow this process to lock more than half of physical memory. This is a bogus check because four processes locking a quarter of physical memory each will bypass this. It is acceptable though as only root proceses are allowed to lock memory and are unlikely to make this type of mistake
: 287Call the core function do_mlockall()(See Section D.4.1.3)
: 289-290Unlock the semaphore and return

D.4.1.3 Function: do_mlockall

Source: mm/mlock.c

238 static int do_mlockall(int flags)
239 {
240     int error;
241     unsigned int def_flags;
242     struct vm_area_struct * vma;
243 
244     if (!capable(CAP_IPC_LOCK))
245         return -EPERM;
246 
247     def_flags = 0;
248     if (flags & MCL_FUTURE)
249         def_flags = VM_LOCKED;
250     current->mm->def_flags = def_flags;
251 
252     error = 0;
253     for (vma = current->mm->mmap; vma ; vma = vma->vm_next) {
254         unsigned int newflags;
255 
256         newflags = vma->vm_flags | VM_LOCKED;
257         if (!(flags & MCL_CURRENT))
258             newflags &= ~VM_LOCKED;
259         error = mlock_fixup(vma, vma->vm_start, vma->vm_end, 
                                newflags);
260         if (error)
261             break;
262     }
263     return error;
264 }

: 244-245The calling process must be either root or have CAP_IPC_LOCK capabilities
: 248-250The MCL_FUTURE flag says that all future pages should be locked so if set, the def_flags for VMAs should be VM_LOCKED
: 253-262Cycle through all VMAs
: 256Set the VM_LOCKED flag in the current VMA flags
: 257-258If the MCL_CURRENT flag has not been set requesting that all current pages be locked, then clear the VM_LOCKED flag. The logic is arranged like this so that the unlock code can use this same function just with no flags
: 259Call mlock_fixup()(See Section D.4.3.1) which will adjust the regions to match the locking as necessary
: 260-261If a non-zero value is returned at any point, stop locking. It is interesting to note that VMAs already locked will not be unlocked
: 263Return the success or error value

D.4.1.4 Function: do_mlock

Source: mm/mlock.c

This function is is responsible for starting the work needed to either lock or unlock a region depending on the value of the on parameter. It is broken up into two sections. The first makes sure the region is page aligned (despite the fact the only two callers of this function do the same thing) before finding the VMA that is to be adjusted. The second part then sets the appropriate flags before calling mlock_fixup() for each VMA that is affected by this locking.

148 static int do_mlock(unsigned long start, size_t len, int on)
149 {
150     unsigned long nstart, end, tmp;
151     struct vm_area_struct * vma, * next;
152     int error;
153 
154     if (on && !capable(CAP_IPC_LOCK))
155         return -EPERM;
156     len = PAGE_ALIGN(len);
157     end = start + len;
158     if (end < start)
159         return -EINVAL;
160     if (end == start)
161         return 0;
162     vma = find_vma(current->mm, start);
163     if (!vma || vma->vm_start > start)
164         return -ENOMEM;

Page align the request and find the VMA

: 154Only root processes can lock pages
: 156Page align the length. This is redundent as the length is page aligned in the parent functions
: 157-159Calculate the end of the locking and make sure it is a valid region. Return -EINVAL if it is not
: 160-161if locking a region of size 0, just return
: 162Find the VMA that will be affected by this locking
: 163-164If the VMA for this address range does not exist, return -ENOMEM

166     for (nstart = start ; ; ) {
167         unsigned int newflags;
168 
170 
171         newflags = vma->vm_flags | VM_LOCKED;
172         if (!on)
173             newflags &= ~VM_LOCKED;
174 
175         if (vma->vm_end >= end) {
176             error = mlock_fixup(vma, nstart, end, newflags);
177             break;
178         }
179 
180         tmp = vma->vm_end;
181         next = vma->vm_next;
182         error = mlock_fixup(vma, nstart, tmp, newflags);
183         if (error)
184             break;
185         nstart = tmp;
186         vma = next;
187         if (!vma || vma->vm_start != nstart) {
188             error = -ENOMEM;
189             break;
190         }
191     }
192     return error;
193 }

Walk through the VMAs affected by this locking and call mlock_fixup() for each of them.

: 166-192Cycle through as many VMAs as necessary to lock the pages
: 171Set the VM_LOCKED flag on the VMA
: 172-173Unless this is an unlock in which case, remove the flag
: 175-177If this VMA is the last VMA to be affected by the unlocking, call mlock_fixup() with the end address for the locking and exit
: 180-190Else this is whole VMA needs to be locked. To lock it, the end of this VMA is pass as a parameter to mlock_fixup()(See Section D.4.3.1) instead of the end of the actual locking
: 180tmp is the end of the mapping on this VMA
: 181next is the next VMA that will be affected by the locking
: 182Call mlock_fixup()(See Section D.4.3.1) for this VMA
: 183-184If an error occurs, back out. Note that the VMAs already locked are not fixed up right
: 185The next start address is the start of the next VMA
: 186Move to the next VMA
: 187-190If there is no VMA , return -ENOMEM. The next condition though would require the regions to be extremly broken as a result of a broken implementation of mlock_fixup() or have VMAs that overlap
: 192Return the error or success value

D.4.2 Unlocking the region

D.4.2.1 Function: sys_munlock

Source: mm/mlock.c

Page align the request before calling do_mlock() which begins the real work of fixing up the regions.

226 asmlinkage long sys_munlock(unsigned long start, size_t len)
227 {
228     int ret;
229 
230     down_write(&current->mm->mmap_sem);
231     len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
232     start &= PAGE_MASK;
233     ret = do_mlock(start, len, 0);
234     up_write(&current->mm->mmap_sem);
235     return ret;
236 }

: 230Acquire the semaphore protecting the mm_struct
: 231Round the length of the region up to the nearest page boundary
: 232Round the start of the region down to the nearest page boundary
: 233Call do_mlock()(See Section D.4.1.4) with 0 as the third parameter to unlock the region
: 234Release the semaphore
: 235Return the success or failure code

D.4.2.2 Function: sys_munlockall

Source: mm/mlock.c

Trivial function. If the flags to mlockall() are 0 it gets translated as none of the current pages must be present and no future mappings should be locked either which means the VM_LOCKED flag will be removed on all VMAs.

293 asmlinkage long sys_munlockall(void)
294 {
295     int ret;
296 
297     down_write(&current->mm->mmap_sem);
298     ret = do_mlockall(0);
299     up_write(&current->mm->mmap_sem);
300     return ret;
301 }

: 297Acquire the semaphore protecting the mm_struct
: 298Call do_mlockall()(See Section D.4.1.3) with 0 as flags which will remove the VM_LOCKED from all VMAs
: 299Release the semaphore
: 300Return the error or success code

D.4.3 Fixing up regions after locking/unlocking

D.4.3.1 Function: mlock_fixup

Source: mm/mlock.c

This function identifies four separate types of locking that must be addressed. There first is where the full VMA is to be locked where it calls mlock_fixup_all(). The second is where only the beginning portion of the VMA is affected, handled by mlock_fixup_start(). The third is the locking of a region at the end handled by mlock_fixup_end() and the last is locking a region in the middle of the VMA with mlock_fixup_middle().

117 static int mlock_fixup(struct vm_area_struct * vma, 
118    unsigned long start, unsigned long end, unsigned int newflags)
119 {
120     int pages, retval;
121 
122     if (newflags == vma->vm_flags)
123         return 0;
124 
125     if (start == vma->vm_start) {
126         if (end == vma->vm_end)
127             retval = mlock_fixup_all(vma, newflags);
128         else
129             retval = mlock_fixup_start(vma, end, newflags);
130     } else {
131         if (end == vma->vm_end)
132             retval = mlock_fixup_end(vma, start, newflags);
133         else
134             retval = mlock_fixup_middle(vma, start, 
                            end, newflags);
135     }
136     if (!retval) {
137         /* keep track of amount of locked VM */
138         pages = (end - start) >> PAGE_SHIFT;
139         if (newflags & VM_LOCKED) {
140             pages = -pages;
141             make_pages_present(start, end);
142         }
143         vma->vm_mm->locked_vm -= pages;
144     }
145     return retval;
146 }

: 122-123If no change is to be made, just return
: 125If the start of the locking is at the start of the VMA, it means that either the full region is to the locked or only a portion at the beginning
: 126-127The full VMA is being locked, call mlock_fixup_all() (See Section D.4.3.2)
: 128-129Part of the VMA is being locked with the start of the VMA matching the start of the locking, call mlock_fixup_start() (See Section D.4.3.3)
: 130Else either the a region at the end is to be locked or a region in the middle
: 131-132The end of the locking matches the end of the VMA, call mlock_fixup_end() (See Section D.4.3.4)
: 133-134A region in the middle of the VMA is to be locked, call mlock_fixup_middle() (See Section D.4.3.5)
: 136-144The fixup functions return 0 on success. If the fixup of the regions succeed and the regions are now marked as locked, call make_pages_present() which makes some basic checks before calling get_user_pages() which faults in all the pages in the same way the page fault handler does

D.4.3.2 Function: mlock_fixup_all

Source: mm/mlock.c

 15 static inline int mlock_fixup_all(struct vm_area_struct * vma, 
                    int newflags)
 16 {
 17     spin_lock(&vma->vm_mm->page_table_lock);
 18     vma->vm_flags = newflags;
 19     spin_unlock(&vma->vm_mm->page_table_lock);
 20     return 0;
 21 }

: 17-19Trivial, lock the VMA with the spinlock, set the new flags, release the lock and return success

D.4.3.3 Function: mlock_fixup_start

Source: mm/mlock.c

Slightly more compilcated. A new VMA is required to represent the affected region. The start of the old VMA is moved forward

 23 static inline int mlock_fixup_start(struct vm_area_struct * vma,
 24     unsigned long end, int newflags)
 25 {
 26     struct vm_area_struct * n;
 27 
 28     n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 29     if (!n)
 30         return -EAGAIN;
 31     *n = *vma;
 32     n->vm_end = end;
 33     n->vm_flags = newflags;
 34     n->vm_raend = 0;
 35     if (n->vm_file)
 36         get_file(n->vm_file);
 37     if (n->vm_ops && n->vm_ops->open)
 38         n->vm_ops->open(n);
 39     vma->vm_pgoff += (end - vma->vm_start) >> PAGE_SHIFT;
 40     lock_vma_mappings(vma);
 41     spin_lock(&vma->vm_mm->page_table_lock);
 42     vma->vm_start = end;
 43     __insert_vm_struct(current->mm, n);
 44     spin_unlock(&vma->vm_mm->page_table_lock);
 45     unlock_vma_mappings(vma);
 46     return 0;
 47 }

: 28Allocate a VMA from the slab allocator for the affected region
: 31-34Copy in the necessary information
: 35-36If the VMA has a file or device mapping, get_file() will increment the reference count
: 37-38If an open() function is provided, call it
: 39Update the offset within the file or device mapping for the old VMA to be the end of the locked region
: 40lock_vma_mappings() will lock any files if this VMA is a shared region
: 41-44Lock the parent mm_struct, update its start to be the end of the affected region, insert the new VMA into the processes linked lists (See Section D.2.2.1) and release the lock
: 45Unlock the file mappings with unlock_vma_mappings()
: 46Return success

D.4.3.4 Function: mlock_fixup_end

Source: mm/mlock.c

Essentially the same as mlock_fixup_start() except the affected region is at the end of the VMA.

 49 static inline int mlock_fixup_end(struct vm_area_struct * vma,
 50     unsigned long start, int newflags)
 51 {
 52     struct vm_area_struct * n;
 53 
 54     n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 55     if (!n)
 56         return -EAGAIN;
 57     *n = *vma;
 58     n->vm_start = start;
 59     n->vm_pgoff += (n->vm_start - vma->vm_start) >> PAGE_SHIFT;
 60     n->vm_flags = newflags;
 61     n->vm_raend = 0;
 62     if (n->vm_file)
 63         get_file(n->vm_file);
 64     if (n->vm_ops && n->vm_ops->open)
 65         n->vm_ops->open(n);
 66     lock_vma_mappings(vma);
 67     spin_lock(&vma->vm_mm->page_table_lock);
 68     vma->vm_end = start;
 69     __insert_vm_struct(current->mm, n);
 70     spin_unlock(&vma->vm_mm->page_table_lock);
 71     unlock_vma_mappings(vma);
 72     return 0;
 73 }

: 54Alloc a VMA from the slab allocator for the affected region
: 57-61Copy in the necessary information and update the offset within the file or device mapping
: 62-63If the VMA has a file or device mapping, get_file() will increment the reference count
: 64-65If an open() function is provided, call it
: 66lock_vma_mappings() will lock any files if this VMA is a shared region
: 67-70Lock the parent mm_struct, update its start to be the end of the affected region, insert the new VMA into the processes linked lists (See Section D.2.2.1) and release the lock
: 71Unlock the file mappings with unlock_vma_mappings()
: 72Return success

D.4.3.5 Function: mlock_fixup_middle

Source: mm/mlock.c

Similar to the previous two fixup functions except that 2 new regions are required to fix up the mapping.

 75 static inline int mlock_fixup_middle(struct vm_area_struct * vma,
 76     unsigned long start, unsigned long end, int newflags)
 77 {
 78     struct vm_area_struct * left, * right;
 79 
 80     left = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 81     if (!left)
 82         return -EAGAIN;
 83     right = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 84     if (!right) {
 85         kmem_cache_free(vm_area_cachep, left);
 86         return -EAGAIN;
 87     }
 88     *left = *vma;
 89     *right = *vma;
 90     left->vm_end = start;
 91     right->vm_start = end;
 92     right->vm_pgoff += (right->vm_start - left->vm_start) >>
                PAGE_SHIFT;
 93     vma->vm_flags = newflags;
 94     left->vm_raend = 0;
 95     right->vm_raend = 0;
 96     if (vma->vm_file)
 97         atomic_add(2, &vma->vm_file->f_count);
 98 
 99     if (vma->vm_ops && vma->vm_ops->open) {
100         vma->vm_ops->open(left);
101         vma->vm_ops->open(right);
102     }
103     vma->vm_raend = 0;
104     vma->vm_pgoff += (start - vma->vm_start) >> PAGE_SHIFT;
105     lock_vma_mappings(vma);
106     spin_lock(&vma->vm_mm->page_table_lock);
107     vma->vm_start = start;
108     vma->vm_end = end;
109     vma->vm_flags = newflags;
110     __insert_vm_struct(current->mm, left);
111     __insert_vm_struct(current->mm, right);
112     spin_unlock(&vma->vm_mm->page_table_lock);
113     unlock_vma_mappings(vma);
114     return 0;
115 }

: 80-87Allocate the two new VMAs from the slab allocator
: 88-89Copy in the information from the old VMA into them
: 90The end of the left region is the start of the region to be affected
: 91The start of the right region is the end of the affected region
: 92Update the file offset
: 93The old VMA is now the affected region so update its flags
: 94-95Make the readahead window 0 to ensure pages not belonging to their regions are not accidently read ahead
: 96-97Increment the reference count to the file/device mapping if there is one
: 99-102Call the open() function for the two new mappings
: 103-104Cancel the readahead window and update the offset within the file to be the beginning of the locked region
: 105Lock the shared file/device mappings
: 106-112Lock the parent mm_struct, update the VMA and insert the two new regions into the process before releasing the lock again
: 113Unlock the shared mappings
: 114Return success

D.5 Page Faulting

This section deals with the page fault handler. It begins with the architecture specific function for the x86 and then moves to the architecture independent layer. The architecture specific functions all have the same responsibilities.

D.5.1 x86 Page Fault Handler

D.5.1.1 Function: do_page_fault

Source: arch/i386/mm/fault.c

The call graph for this function is shown in Figure 4.12. This function is the x86 architecture dependent function for the handling of page fault exception handlers. Each architecture registers their own but all of them have similar responsibilities.

140 asmlinkage void do_page_fault(struct pt_regs *regs, 
                  unsigned long error_code)
141 {
142     struct task_struct *tsk;
143     struct mm_struct *mm;
144     struct vm_area_struct * vma;
145     unsigned long address;
146     unsigned long page;
147     unsigned long fixup;
148     int write;
149     siginfo_t info;
150 
151     /* get the address */
152     __asm__("movl %%cr2,%0":"=r" (address));
153 
154     /* It's safe to allow irq's after cr2 has been saved */
155     if (regs->eflags & X86_EFLAGS_IF)
156         local_irq_enable();
157 
158     tsk = current;
159

Function preamble. Get the fault address and enable interrupts

140The parameters are

: regs is a struct containing what all the registers at fault time
: error_code indicates what sort of fault occurred

152As the comment indicates, the cr2 register holds the fault address

155-156If the fault is from within an interrupt, enable them

158Set the current task

173     if (address >= TASK_SIZE && !(error_code & 5))
174         goto vmalloc_fault;
175 
176     mm = tsk->mm;
177     info.si_code = SEGV_MAPERR;
178 
183     if (in_interrupt() || !mm)
184         goto no_context;
185

Check for exceptional faults, kernel faults, fault in interrupt and fault with no memory context

: 173If the fault address is over TASK_SIZE, it is within the kernel address space. If the error code is 5, then it means it happened while in kernel mode and is not a protection error so handle a vmalloc fault
: 176Record the working mm
: 183If this is an interrupt, or there is no memory context (such as with a kernel thread), there is no way to safely handle the fault so goto no_context

186     down_read(&mm->mmap_sem);
187 
188     vma = find_vma(mm, address);
189     if (!vma)
190         goto bad_area;
191     if (vma->vm_start <= address)
192         goto good_area;
193     if (!(vma->vm_flags & VM_GROWSDOWN))
194         goto bad_area;
195     if (error_code & 4) {
196         /*
197          * accessing the stack below %esp is always a bug.
198          * The "+ 32" is there due to some instructions (like
199          * pusha) doing post-decrement on the stack and that
200          * doesn't show up until later..
201          */
202         if (address + 32 < regs->esp)
203             goto bad_area;
204     }
205     if (expand_stack(vma, address))
206         goto bad_area;

If a fault in userspace, find the VMA for the faulting address and determine if it is a good area, a bad area or if the fault occurred near a region that can be expanded such as the stack

: 186Take the long lived mm semaphore
: 188Find the VMA that is responsible or is closest to the faulting address
: 189-190If a VMA does not exist at all, goto bad_area
: 191-192If the start of the region is before the address, it means this VMA is the correct VMA for the fault so goto good_area which will check the permissions
: 193-194For the region that is closest, check if it can gown down (VM_GROWSDOWN). If it does, it means the stack can probably be expanded. If not, goto bad_area
: 195-204Check to make sure it isn't an access below the stack. if the error_code is 4, it means it is running in userspace
: 205-206The stack is the only region with VM_GROWSDOWN set so if we reach here, the stack is expaneded with with expand_stack()(See Section D.5.2.1), if it fails, goto bad_area

211 good_area:
212     info.si_code = SEGV_ACCERR;
213     write = 0;
214     switch (error_code & 3) {
215         default:    /* 3: write, present */
216 #ifdef TEST_VERIFY_AREA
217             if (regs->cs == KERNEL_CS)
218                 printk("WP fault at %08lx\n", regs->eip);
219 #endif
220             /* fall through */
221         case 2:     /* write, not present */
222             if (!(vma->vm_flags & VM_WRITE))
223                 goto bad_area;
224             write++;
225             break;
226         case 1:     /* read, present */
227             goto bad_area;
228         case 0:     /* read, not present */
229             if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
230                 goto bad_area;
231     }

There is the first part of a good area is handled. The permissions need to be checked in case this is a protection fault.

: 212By default return an error
: 214Check the error code against bits 0 and 1 of the error code. Bit 0 at 0 means page was not present. At 1, it means a protection fault like a write to a read-only area. Bit 1 is 0 if it was a read fault and 1 if a write
: 215If it is 3, both bits are 1 so it is a write protection fault
: 221Bit 1 is a 1 so it is a write fault
: 222-223If the region can not be written to, it is a bad write to goto bad_area. If the region can be written to, this is a page that is marked Copy On Write (COW)
: 224Flag that a write has occurred
: 226-227This is a read and the page is present. There is no reason for the fault so must be some other type of exception like a divide by zero, goto bad_area where it is handled
: 228-230A read occurred on a missing page. Make sure it is ok to read or exec this page. If not, goto bad_area. The check for exec is made because the x86 can not exec protect a page and instead uses the read protect flag. This is why both have to be checked

233  survive:
239     switch (handle_mm_fault(mm, vma, address, write)) {
240     case 1:
241         tsk->min_flt++;
242         break;
243     case 2:
244         tsk->maj_flt++;
245         break;
246     case 0:
247         goto do_sigbus;
248     default:
249         goto out_of_memory;
250     }
251 
252     /*
253      * Did it hit the DOS screen memory VA from vm86 mode?
254      */
255     if (regs->eflags & VM_MASK) {
256         unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT;
257         if (bit < 32)
258             tsk->thread.screen_bitmap |= 1 << bit;
259     }
260     up_read(&mm->mmap_sem);
261     return;

At this point, an attempt is going to be made to handle the fault gracefully with handle_mm_fault().

: 239Call handle_mm_fault() with the relevant information about the fault. This is the architecture independent part of the handler
: 240-242A return of 1 means it was a minor fault. Update statistics
: 243-245A return of 2 means it was a major fault. Update statistics
: 246-247A return of 0 means some IO error happened during the fault so go to the do_sigbus handler
: 248-249Any other return means memory could not be allocated for the fault so we are out of memory. In reality this does not happen as another function out_of_memory() is invoked in mm/oom_kill.c before this could happen which is a lot more graceful about who it kills
: 260Release the lock to the mm
: 261Return as the fault has been successfully handled

267 bad_area:
268     up_read(&mm->mmap_sem);
269 
270     /* User mode accesses just cause a SIGSEGV */
271     if (error_code & 4) {
272         tsk->thread.cr2 = address;
273         tsk->thread.error_code = error_code;
274         tsk->thread.trap_no = 14;
275         info.si_signo = SIGSEGV;
276         info.si_errno = 0;
277         /* info.si_code has been set above */
278         info.si_addr = (void *)address;
279         force_sig_info(SIGSEGV, &info, tsk);
280         return;
281     }
282 
283     /*
284      * Pentium F0 0F C7 C8 bug workaround.
285      */
286     if (boot_cpu_data.f00f_bug) {
287         unsigned long nr;
288         
289         nr = (address - idt) >> 3;
290 
291         if (nr == 6) {
292             do_invalid_op(regs, 0);
293             return;
294         }
295     }

This is the bad area handler such as using memory with no vm_area_struct managing it. If the fault is not by a user process or the f00f bug, the no_context label is fallen through to.

: 271An error code of 4 implies userspace so it is a simple case of sending a SIGSEGV to kill the process
: 272-274Set thread information about what happened which can be read by a debugger later
: 275Record that a SIGSEGV signal was sent
: 276clear errno
: 278Record the address
: 279Send the SIGSEGV signal. The process will exit and dump all the relevant information
: 280Return as the fault has been successfully handled
: 286-295An bug in the first Pentiums was called the f00f bug which caused the processor to constantly page fault. It was used as a local DoS attack on a running Linux system. This bug was trapped within a few hours and a patch released. Now it results in a harmless termination of the process rather than a rebooting system

296 
297 no_context:
298     /* Are we prepared to handle this kernel fault?  */
299     if ((fixup = search_exception_table(regs->eip)) != 0) {
300         regs->eip = fixup;
301         return;
302     }

: 299-302Search the exception table with search_exception_table() to see if this exception be handled and if so, call the proper exception handler after returning. This is really important during copy_from_user() and copy_to_user() when an exception handler is especially installed to trap reads and writes to invalid regions in userspace without having to make expensive checks. It means that a small fixup block of code can be called rather than falling through to the next block which causes an oops

304 /*
305  * Oops. The kernel tried to access some bad page. We'll have to
306  * terminate things with extreme prejudice.
307  */
308 
309     bust_spinlocks(1);
310 
311     if (address < PAGE_SIZE)
312         printk(KERN_ALERT "Unable to handle kernel NULL pointer
                     dereference");
313     else
314         printk(KERN_ALERT "Unable to handle kernel paging
                     request");
315     printk(" at virtual address %08lx\n",address);
316     printk(" printing eip:\n");
317     printk("%08lx\n", regs->eip);
318     asm("movl %%cr3,%0":"=r" (page));
319     page = ((unsigned long *) __va(page))[address >> 22];
320     printk(KERN_ALERT "*pde = %08lx\n", page);
321     if (page & 1) {
322         page &= PAGE_MASK;
323         address &= 0x003ff000;
324         page = ((unsigned long *) 
                __va(page))[address >> PAGE_SHIFT];
325         printk(KERN_ALERT "*pte = %08lx\n", page);
326     }
327     die("Oops", regs, error_code);
328     bust_spinlocks(0);
329     do_exit(SIGKILL);

This is the no_context handler. Some bad exception occurred which is going to end up in the process been terminated in all likeliness. Otherwise the kernel faulted when it definitely should have and an OOPS report is generated.

: 309-329Otherwise the kernel faulted when it really shouldn't have and it is a kernel bug. This block generates an oops report
: 309Forcibly free spinlocks which might prevent a message getting to console
: 311-312If the address is < PAGE_SIZE, it means that a null pointer was used. Linux deliberately has page 0 unassigned to trap this type of fault which is a common programming error
: 313-314Otherwise it is just some bad kernel error such as a driver trying to access userspace incorrectly
: 315-320Print out information about the fault
: 321-326Print out information about the page been faulted
: 327Die and generate an oops report which can be used later to get a stack trace so a developer can see more accurately where and how the fault occurred
: 329Forcibly kill the faulting process

335 out_of_memory:
336     if (tsk->pid == 1) {
337         yield();
338         goto survive;
339     }
340     up_read(&mm->mmap_sem);
341     printk("VM: killing process %s\n", tsk->comm);
342     if (error_code & 4)
343         do_exit(SIGKILL);
344     goto no_context;

The out of memory handler. Usually ends with the faulting process getting killed unless it is init

: 336-339If the process is init, just yield and goto survive which will try to handle the fault gracefully. init should never be killed
: 340Free the mm semaphore
: 341Print out a helpful “You are Dead” message
: 342If from userspace, just kill the process
: 344If in kernel space, go to the no_context handler which in this case will probably result in a kernel oops

345 
346 do_sigbus:
347     up_read(&mm->mmap_sem);
348 
353     tsk->thread.cr2 = address;
354     tsk->thread.error_code = error_code;
355     tsk->thread.trap_no = 14;
356     info.si_signo = SIGBUS;
357     info.si_errno = 0;
358     info.si_code = BUS_ADRERR;
359     info.si_addr = (void *)address;
360     force_sig_info(SIGBUS, &info, tsk);
361 
362     /* Kernel mode? Handle exceptions or die */
363     if (!(error_code & 4))
364         goto no_context;
365     return;

: 347Free the mm lock
: 353-359Fill in information to show a SIGBUS occurred at the faulting address so that a debugger can trap it later
: 360Send the signal
: 363-364If in kernel mode, try and handle the exception during no_context
: 365If in userspace, just return and the process will die in due course

367 vmalloc_fault:
368     {
376         int offset = __pgd_offset(address);
377         pgd_t *pgd, *pgd_k;
378         pmd_t *pmd, *pmd_k;
379         pte_t *pte_k;
380 
381         asm("movl %%cr3,%0":"=r" (pgd));
382         pgd = offset + (pgd_t *)__va(pgd);
383         pgd_k = init_mm.pgd + offset;
384 
385         if (!pgd_present(*pgd_k))
386             goto no_context;
387         set_pgd(pgd, *pgd_k);
388         
389         pmd = pmd_offset(pgd, address);
390         pmd_k = pmd_offset(pgd_k, address);
391         if (!pmd_present(*pmd_k))
392             goto no_context;
393         set_pmd(pmd, *pmd_k);
394 
395         pte_k = pte_offset(pmd_k, address);
396         if (!pte_present(*pte_k))
397             goto no_context;
398         return;
399     }
400 }

This is the vmalloc fault handler. When pages are mapped in the vmalloc space, only the refernce page table is updated. As each process references this area, a fault will be trapped and the process page tables will be synchronised with the reference page table here.

: 376Get the offset within a PGD
: 381Copy the address of the PGD for the process from the cr3 register to pgd
: 382Calculate the pgd pointer from the process PGD
: 383Calculate for the kernel reference PGD
: 385-386If the pgd entry is invalid for the kernel page table, goto no_context
: 386Set the page table entry in the process page table with a copy from the kernel reference page table
: 389-393Same idea for the PMD. Copy the page table entry from the kernel reference page table to the process page tables
: 395Check the PTE
: 396-397If it is not present, it means the page was not valid even in the kernel reference page table so goto no_context to handle what is probably a kernel bug, probably a reference to a random part of unused kernel space
: 398Otherwise return knowing the process page tables have been updated and are in sync with the kernel page tables

D.5.2 Expanding the Stack

D.5.2.1 Function: expand_stack

Source: include/linux/mm.h

This function is called by the architecture dependant page fault handler. The VMA supplied is guarenteed to be one that can grow to cover the address.

640 static inline int expand_stack(struct vm_area_struct * vma, 
                                   unsigned long address)
641 {
642     unsigned long grow;
643 
644     /*
645      * vma->vm_start/vm_end cannot change under us because 
         * the caller is required
646      * to hold the mmap_sem in write mode. We need to get the
647      * spinlock only before relocating the vma range ourself.
648      */
649     address &= PAGE_MASK;
650     spin_lock(&vma->vm_mm->page_table_lock);
651     grow = (vma->vm_start - address) >> PAGE_SHIFT;
652     if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
653     ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > 
                                       current->rlim[RLIMIT_AS].rlim_cur) {
654         spin_unlock(&vma->vm_mm->page_table_lock);
655         return -ENOMEM;
656     }
657     vma->vm_start = address;
658     vma->vm_pgoff -= grow;
659     vma->vm_mm->total_vm += grow;
660     if (vma->vm_flags & VM_LOCKED)
661         vma->vm_mm->locked_vm += grow;
662     spin_unlock(&vma->vm_mm->page_table_lock);
663     return 0;
664 }

: 649Round the address down to the nearest page boundary
: 650Lock the page tables spinlock
: 651Calculate how many pages the stack needs to grow by
: 652Check to make sure that the size of the stack does not exceed the process limits
: 653Check to make sure that the size of the addres space will not exceed process limits after the stack is grown
: 654-655If either of the limits are reached, return -ENOMEM which will cause the faulting process to segfault
: 657-658Grow the VMA down
: 659Update the amount of address space used by the process
: 660-661If the region is locked, update the number of locked pages used by the process
: 662-663Unlock the process page tables and return success

D.5.3 Architecture Independent Page Fault Handler

This is the top level pair of functions for the architecture independent page fault handler.

D.5.3.1 Function: handle_mm_fault

Source: mm/memory.c

The call graph for this function is shown in Figure 4.14. This function allocates the PMD and PTE necessary for this new PTE hat is about to be allocated. It takes the necessary locks to protect the page tables before calling handle_pte_fault() to fault in the page itself.

1364 int handle_mm_fault(struct mm_struct *mm, 
         struct vm_area_struct * vma,
1365     unsigned long address, int write_access)
1366 {
1367     pgd_t *pgd;
1368     pmd_t *pmd;
1369 
1370     current->state = TASK_RUNNING;
1371     pgd = pgd_offset(mm, address);
1372 
1373     /*
1374      * We need the page table lock to synchronize with kswapd
1375      * and the SMP-safe atomic PTE updates.
1376      */
1377     spin_lock(&mm->page_table_lock);
1378     pmd = pmd_alloc(mm, pgd, address);
1379 
1380     if (pmd) {
1381         pte_t * pte = pte_alloc(mm, pmd, address);
1382         if (pte)
1383             return handle_pte_fault(mm, vma, address,
                            write_access, pte);
1384     }
1385     spin_unlock(&mm->page_table_lock);
1386     return -1;
1387 }

1364The parameters of the function are;

: mm is the mm_struct for the faulting process
: vma is the vm_area_struct managing the region the fault occurred in
: address is the faulting address
: write_access is 1 if the fault is a write fault

1370Set the current state of the process

1371Get the pgd entry from the top level page table

1377Lock the mm_struct as the page tables will change

1378pmd_alloc() will allocate a pmd_t if one does not already exist

1380If the pmd has been successfully allocated then...

1381Allocate a PTE for this address if one does not already exist

1382-1383Handle the page fault with handle_pte_fault() (See Section D.5.3.2) and return the status code

1385Failure path, unlock the mm_struct

1386Return -1 which will be interpreted as an out of memory condition which is correct as this line is only reached if a PMD or PTE could not be allocated

D.5.3.2 Function: handle_pte_fault

Source: mm/memory.c

This function decides what type of fault this is and which function should handle it. do_no_page() is called if this is the first time a page is to be allocated. do_swap_page() handles the case where the page was swapped out to disk with the exception of pages swapped out from tmpfs. do_wp_page() breaks COW pages. If none of them are appropriate, the PTE entry is simply updated. If it was written to, it is marked dirty and it is marked accessed to show it is a young page.

1331 static inline int handle_pte_fault(struct mm_struct *mm,
1332     struct vm_area_struct * vma, unsigned long address,
1333     int write_access, pte_t * pte)
1334 {
1335     pte_t entry;
1336 
1337     entry = *pte;
1338     if (!pte_present(entry)) {
1339         /*
1340          * If it truly wasn't present, we know that kswapd
1341          * and the PTE updates will not touch it later. So
1342          * drop the lock.
1343          */
1344         if (pte_none(entry))
1345             return do_no_page(mm, vma, address, 
                         write_access, pte);
1346         return do_swap_page(mm, vma, address, pte, entry,
                     write_access);
1347     }
1348 
1349     if (write_access) {
1350         if (!pte_write(entry))
1351             return do_wp_page(mm, vma, address, pte, entry);
1352 
1353         entry = pte_mkdirty(entry);
1354     }
1355     entry = pte_mkyoung(entry);
1356     establish_pte(vma, address, pte, entry);
1357     spin_unlock(&mm->page_table_lock);
1358     return 1;
1359 }

: 1331The parameters of the function are the same as those for handle_mm_fault() except the PTE for the fault is included
: 1337Record the PTE
: 1338Handle the case where the PTE is not present
: 1344If the PTE has never been filled, handle the allocation of the PTE with do_no_page()(See Section D.5.4.1)
: 1346If the page has been swapped out to backing storage, handle it with do_swap_page()(See Section D.5.5.1)
: 1349-1354Handle the case where the page is been written to
: 1350-1351If the PTE is marked write-only, it is a COW page so handle it with do_wp_page()(See Section D.5.6.1)
: 1353Otherwise just simply mark the page as dirty
: 1355Mark the page as accessed
: 1356establish_pte() copies the PTE and then updates the TLB and MMU cache. This does not copy in a new PTE but some architectures require the TLB and MMU update
: 1357Unlock the mm_struct and return that a minor fault occurred

D.5.4 Demand Allocation

D.5.4.1 Function: do_no_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.15. This function is called the first time a page is referenced so that it may be allocated and filled with data if necessary. If it is an anonymous page, determined by the lack of a vm_ops available to the VMA or the lack of a nopage() function, then do_anonymous_page() is called. Otherwise the supplied nopage() function is called to allocate a page and it is inserted into the page tables here. The function has the following tasks;

Check if do_anonymous_page() should be used and if so, call it and return the page it allocates. If not, call the supplied nopage() function and ensure it allocates a page successfully.
Break COW early if appropriate
Add the page to the page table entries and call the appropriate architecture dependent hooks

1245 static int do_no_page(struct mm_struct * mm, 
         struct vm_area_struct * vma,
1246     unsigned long address, int write_access, pte_t *page_table)
1247 {
1248     struct page * new_page;
1249     pte_t entry;
1250 
1251     if (!vma->vm_ops || !vma->vm_ops->nopage)
1252         return do_anonymous_page(mm, vma, page_table,
                        write_access, address);
1253     spin_unlock(&mm->page_table_lock);
1254 
1255     new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
1256 
1257     if (new_page == NULL)   /* no page was available -- SIGBUS */
1258         return 0;
1259     if (new_page == NOPAGE_OOM)
1260         return -1;

: 1245The parameters supplied are the same as those for handle_pte_fault()
: 1251-1252If no vm_ops is supplied or no nopage() function is supplied, then call do_anonymous_page()(See Section D.5.4.2) to allocate a page and return it
: 1253Otherwise free the page table lock as the nopage() function can not be called with spinlocks held
: 1255Call the supplied nopage function, in the case of filesystems, this is frequently filemap_nopage()(See Section D.6.4.1) but will be different for each device driver
: 1257-1258If NULL is returned, it means some error occurred in the nopage function such as an IO error while reading from disk. In this case, 0 is returned which results in a SIGBUS been sent to the faulting process
: 1259-1260If NOPAGE_OOM is returned, the physical page allocator failed to allocate a page and -1 is returned which will forcibly kill the process

1265     if (write_access && !(vma->vm_flags & VM_SHARED)) {
1266         struct page * page = alloc_page(GFP_HIGHUSER);
1267         if (!page) {
1268             page_cache_release(new_page);
1269             return -1;
1270         }
1271         copy_user_highpage(page, new_page, address);
1272         page_cache_release(new_page);
1273         lru_cache_add(page);
1274         new_page = page;
1275     }

Break COW early in this block if appropriate. COW is broken if the fault is a write fault and the region is not shared with VM_SHARED. If COW was not broken in this case, a second fault would occur immediately upon return.

: 1265Check if COW should be broken early
: 1266If so, allocate a new page for the process
: 1267-1270If the page could not be allocated, reduce the reference count to the page returned by the nopage() function and return -1 for out of memory
: 1271Otherwise copy the contents
: 1272Reduce the reference count to the returned page which may still be in use by another process
: 1273Add the new page to the LRU lists so it may be reclaimed by kswapd later

1277     spin_lock(&mm->page_table_lock);
1288     /* Only go through if we didn't race with anybody else... */
1289     if (pte_none(*page_table)) {
1290         ++mm->rss;
1291         flush_page_to_ram(new_page);
1292         flush_icache_page(vma, new_page);
1293         entry = mk_pte(new_page, vma->vm_page_prot);
1294         if (write_access)
1295             entry = pte_mkwrite(pte_mkdirty(entry));
1296         set_pte(page_table, entry);
1297     } else {
1298         /* One of our sibling threads was faster, back out. */
1299         page_cache_release(new_page);
1300         spin_unlock(&mm->page_table_lock);
1301         return 1;
1302     }
1303 
1304     /* no need to invalidate: a not-present page shouldn't 
        * be cached
        */
1305     update_mmu_cache(vma, address, entry);
1306     spin_unlock(&mm->page_table_lock);
1307     return 2;     /* Major fault */
1308 }

: 1277Lock the page tables again as the allocations have finished and the page tables are about to be updated
: 1289Check if there is still no PTE in the entry we are about to use. If two faults hit here at the same time, it is possible another processor has already completed the page fault and this one should be backed out
: 1290-1297If there is no PTE entered, complete the fault
: 1290Increase the RSS count as the process is now using another page. A check really should be made here to make sure it isn't the global zero page as the RSS count could be misleading
: 1291As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the CPU cache will be coherent
: 1292flush_icache_page() is similar in principle except it ensures the icache and dcache's are coherent
: 1293Create a pte_t with the appropriate permissions
: 1294-1295If this is a write, then make sure the PTE has write permissions
: 1296Place the new PTE in the process page tables
: 1297-1302If the PTE is already filled, the page acquired from the nopage() function must be released
: 1299Decrement the reference count to the page. If it drops to 0, it will be freed
: 1300-1301Release the mm_struct lock and return 1 to signal this is a minor page fault as no major work had to be done for this fault as it was all done by the winner of the race
: 1305Update the MMU cache for architectures that require it
: 1306-1307Release the mm_struct lock and return 2 to signal this is a major page fault

D.5.4.2 Function: do_anonymous_page

Source: mm/memory.c

This function allocates a new page for a process accessing a page for the first time. If it is a read access, a system wide page containing only zeros is mapped into the process. If it is write, a zero filled page is allocated and placed within the page tables

1190 static int do_anonymous_page(struct mm_struct * mm, 
                  struct vm_area_struct * vma, 
                  pte_t *page_table, int write_access, 
                  unsigned long addr)
1191 {
1192     pte_t entry;
1193 
1194     /* Read-only mapping of ZERO_PAGE. */
1195     entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), 
                       vma->vm_page_prot));
1196 
1197     /* ..except if it's a write access */
1198     if (write_access) {
1199         struct page *page;
1200 
1201         /* Allocate our own private page. */
1202         spin_unlock(&mm->page_table_lock);
1203 
1204         page = alloc_page(GFP_HIGHUSER);
1205         if (!page)
1206             goto no_mem;
1207         clear_user_highpage(page, addr);
1208 
1209         spin_lock(&mm->page_table_lock);
1210         if (!pte_none(*page_table)) {
1211             page_cache_release(page);
1212             spin_unlock(&mm->page_table_lock);
1213             return 1;
1214         }
1215         mm->rss++;
1216         flush_page_to_ram(page);
1217         entry = pte_mkwrite(
                 pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
1218         lru_cache_add(page);
1219         mark_page_accessed(page);
1220     }
1221 
1222     set_pte(page_table, entry);
1223 
1224     /* No need to invalidate - it was non-present before */
1225     update_mmu_cache(vma, addr, entry);
1226     spin_unlock(&mm->page_table_lock);
1227     return 1;     /* Minor fault */
1228 
1229 no_mem:
1230     return -1;
1231 }

: 1190The parameters are the same as those passed to handle_pte_fault() (See Section D.5.3.2)
: 1195For read accesses, simply map the system wide empty_zero_page which the ZERO_PAGE() macro returns with the given permissions. The page is write protected so that a write to the page will result in a page fault
: 1198-1220If this is a write fault, then allocate a new page and zero fill it
: 1202Unlock the mm_struct as the allocation of a new page could sleep
: 1204Allocate a new page
: 1205If a page could not be allocated, return -1 to handle the OOM situation
: 1207Zero fill the page
: 1209Reacquire the lock as the page tables are to be updated
: 1215Update the RSS for the process. Note that the RSS is not updated if it is the global zero page being mapped as is the case with the read-only fault at line 1195
: 1216Ensure the cache is coherent
: 1217Mark the PTE writable and dirty as it has been written to
: 1218Add the page to the LRU list so it may be reclaimed by the swapper later
: 1219Mark the page accessed which ensures the page is marked hot and on the top of the active list
: 1222Fix the PTE in the page tables for this process
: 1225Update the MMU cache if the architecture needs it
: 1226Free the page table lock
: 1227Return as a minor fault as even though it is possible the page allocator spent time writing out pages, data did not have to be read from disk to fill this page

D.5.5 Demand Paging

D.5.5.1 Function: do_swap_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.16. This function handles the case where a page has been swapped out. A swapped out page may exist in the swap cache if it is shared between a number of processes or recently swapped in during readahead. This function is broken up into three parts

Search for the page in swap cache
If it does not exist, call swapin_readahead() to read in the page
Insert the page into the process page tables

1117 static int do_swap_page(struct mm_struct * mm,
1118     struct vm_area_struct * vma, unsigned long address,
1119     pte_t * page_table, pte_t orig_pte, int write_access)
1120 {
1121     struct page *page;
1122     swp_entry_t entry = pte_to_swp_entry(orig_pte);
1123     pte_t pte;
1124     int ret = 1;
1125 
1126     spin_unlock(&mm->page_table_lock);
1127     page = lookup_swap_cache(entry);

Function preamble, check for the page in the swap cache

: 1117-1119The parameters are the same as those supplied to handle_pte_fault() (See Section D.5.3.2)
: 1122Get the swap entry information from the PTE
: 1126Free the mm_struct spinlock
: 1127Lookup the page in the swap cache

1128     if (!page) {
1129         swapin_readahead(entry);
1130         page = read_swap_cache_async(entry);
1131         if (!page) {
1136             int retval;
1137             spin_lock(&mm->page_table_lock);
1138             retval = pte_same(*page_table, orig_pte) ? -1 : 1;
1139             spin_unlock(&mm->page_table_lock);
1140             return retval;
1141         }
1142 
1143         /* Had to read the page from swap area: Major fault */
1144         ret = 2;
1145     }

If the page did not exist in the swap cache, then read it from backing storage with swapin_readhead() which reads in the requested pages and a number of pages after it. Once it completes, read_swap_cache_async() should be able to return the page.

: 1128-1145This block is executed if the page was not in the swap cache
: 1129swapin_readahead()(See Section D.6.6.1) reads in the requested page and a number of pages after it. The number of pages read in is determined by the page_cluster variable in mm/swap.c which is initialised to 2 on machines with less than 16MiB of memory and 3 otherwise. 2^page_cluster pages are read in after the requested page unless a bad or empty page entry is encountered
: 1130read_swap_cache_async() (See Section K.3.1.1) will look up the requested page and read it from disk if necessary
: 1131-1141If the page does not exist, there was another fault which swapped in this page and removed it from the cache while spinlocks were dropped
: 1137Lock the mm_struct
: 1138Compare the two PTEs. If they do not match, -1 is returned to signal an IO error, else 1 is returned to mark a minor page fault as a disk access was not required for this particular page.
: 1139-1140Free the mm_struct and return the status
: 1144The disk had to be accessed to mark that this is a major page fault

1147     mark_page_accessed(page);
1148 
1149     lock_page(page);
1150 
1151     /*
1152      * Back out if somebody else faulted in this pte while we
1153      * released the page table lock.
1154      */
1155     spin_lock(&mm->page_table_lock);
1156     if (!pte_same(*page_table, orig_pte)) {
1157         spin_unlock(&mm->page_table_lock);
1158         unlock_page(page);
1159         page_cache_release(page);
1160         return 1;
1161     }
1162 
1163     /* The page isn't present yet, go ahead with the fault. */
1164         
1165     swap_free(entry);
1166     if (vm_swap_full())
1167         remove_exclusive_swap_page(page);
1168 
1169     mm->rss++;
1170     pte = mk_pte(page, vma->vm_page_prot);
1171     if (write_access && can_share_swap_page(page))
1172         pte = pte_mkdirty(pte_mkwrite(pte));
1173     unlock_page(page);
1174 
1175     flush_page_to_ram(page);
1176     flush_icache_page(vma, page);
1177     set_pte(page_table, pte);
1178 
1179     /* No need to invalidate - it was non-present before */
1180     update_mmu_cache(vma, address, pte);
1181     spin_unlock(&mm->page_table_lock);
1182     return ret;
1183 }

Place the page in the process page tables

: 1147mark_page_accessed()(See Section J.2.3.1) will mark the page as active so it will be moved to the top of the active LRU list
: 1149Lock the page which has the side effect of waiting for the IO swapping in the page to complete
: 1155-1161If someone else faulted in the page before we could, the reference to the page is dropped, the lock freed and return that this was a minor fault
: 1165The function swap_free()(See Section K.2.2.1) reduces the reference to a swap entry. If it drops to 0, it is actually freed
: 1166-1167Page slots in swap space are reserved for the same page once they have been swapped out to avoid having to search for a free slot each time. If the swap space is full though, the reservation is broken and the slot freed up for another page
: 1169The page is now going to be used so increment the mm_structs RSS count
: 1170Make a PTE for this page
: 1171If the page is been written to and is not shared between more than one process, mark it dirty so that it will be kept in sync with the backing storage and swap cache for other processes
: 1173Unlock the page
: 1175As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the cache will be coherent
: 1176flush_icache_page() is similar in principle except it ensures the icache and dcache's are coherent
: 1177Set the PTE in the process page tables
: 1180Update the MMU cache if the architecture requires it
: 1181-1182Unlock the mm_struct and return whether it was a minor or major page fault

D.5.5.2 Function: can_share_swap_page

Source: mm/swapfile.c

This function determines if the swap cache entry for this page may be used or not. It may be used if there is no other references to it. Most of the work is performed by exclusive_swap_page() but this function first makes a few basic checks to avoid having to acquire too many locks.

259 int can_share_swap_page(struct page *page)
260 {
261     int retval = 0;
262 
263     if (!PageLocked(page))
264         BUG();
265     switch (page_count(page)) {
266     case 3:
267         if (!page->buffers)
268                 break;
269         /* Fallthrough */
270     case 2:
271         if (!PageSwapCache(page))
272                 break;
273         retval = exclusive_swap_page(page);
274         break;
275     case 1:
276         if (PageReserved(page))
277                 break;
278             retval = 1;
279     }
280         return retval;
281 }

: 263-264This function is called from the fault path and the page must be locked
: 265Switch based on the number of references
: 266-268If the count is 3, but there is no buffers associated with it, there is more than one process using the page. Buffers may be associated for just one process if the page is backed by a swap file instead of a partition
: 270-273If the count is only two, but it is not a member of the swap cache, then it has no slot which may be shared so return false. Otherwise perform a full check with exclusive_swap_page() (See Section D.5.5.3)
: 276-277If the page is reserved, it is the global ZERO_PAGE so it cannot be shared otherwise this page is definitely the only one

D.5.5.3 Function: exclusive_swap_page

Source: mm/swapfile.c

This function checks if the process is the only user of a locked swap page.

229 static int exclusive_swap_page(struct page *page)
230 {
231     int retval = 0;
232     struct swap_info_struct * p;
233     swp_entry_t entry;
234 
235     entry.val = page->index;
236     p = swap_info_get(entry);
237     if (p) {
238         /* Is the only swap cache user the cache itself? */
239         if (p->swap_map[SWP_OFFSET(entry)] == 1) {
240             /* Recheck the page count with the pagecache 
                 * lock held.. */
241             spin_lock(&pagecache_lock);
242             if (page_count(page) - !!page->buffers == 2)
243                 retval = 1;
244             spin_unlock(&pagecache_lock);
245         }
246         swap_info_put(p);
247     }
248     return retval;
249 }

: 231By default, return false
: 235The swp_entry_t for the page is stored in page→index as explained in Section 2.4
: 236Get the swap_info_struct with swap_info_get()(See Section K.2.3.1)
: 237-247If a slot exists, check if we are the exclusive user and return true if we are
: 239Check if the slot is only being used by the cache itself. If it is, the page count needs to be checked again with the pagecache_lock held
: 242-243!!page→buffers will evaluate to 1 if there is buffers are present so this block effectively checks if the process is the only user of the page. If it is, retval is set to 1 so that true will be returned
: 246Drop the reference to the slot that was taken with swap_info_get() (See Section K.2.3.1)

D.5.6 Copy On Write (COW) Pages

D.5.6.1 Function: do_wp_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.17. This function handles the case where a user tries to write to a private page shared amoung processes, such as what happens after fork(). Basically what happens is a page is allocated, the contents copied to the new page and the shared count decremented in the old page.

948 static int do_wp_page(struct mm_struct *mm, 
            struct vm_area_struct * vma,
949         unsigned long address, pte_t *page_table, pte_t pte)
950 {
951     struct page *old_page, *new_page;
952 
953     old_page = pte_page(pte);
954     if (!VALID_PAGE(old_page))
955         goto bad_wp_page;
956

: 948-950The parameters are the same as those supplied to handle_pte_fault()
: 953-955Get a reference to the current page in the PTE and make sure it is valid

957     if (!TryLockPage(old_page)) {
958         int reuse = can_share_swap_page(old_page);
959         unlock_page(old_page);
960         if (reuse) {
961             flush_cache_page(vma, address);
962             establish_pte(vma, address, page_table,
                          pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
963             spin_unlock(&mm->page_table_lock);
964             return 1;       /* Minor fault */
965         }
966     }

: 957First try to lock the page. If 0 is returned, it means the page was previously unlocked
: 958If we managed to lock it, call can_share_swap_page() (See Section D.5.5.2) to see are we the exclusive user of the swap slot for this page. If we are, it means that we are the last process to break COW and we can simply use this page rather than allocating a new one
: 960-965If we are the only users of the swap slot, then it means we are the only user of this page and the last process to break COW so the PTE is simply re-established and we return a minor fault

968     /*
969      * Ok, we need to copy. Oh, well..
970      */
971     page_cache_get(old_page);
972     spin_unlock(&mm->page_table_lock);
973 
974     new_page = alloc_page(GFP_HIGHUSER);
975     if (!new_page)
976         goto no_mem;
977     copy_cow_page(old_page,new_page,address);
978

: 971We need to copy this page so first get a reference to the old page so it doesn't disappear before we are finished with it
: 972Unlock the spinlock as we are about to call alloc_page() (See Section F.2.1) which may sleep
: 974-976Allocate a page and make sure one was returned
: 977No prizes what this function does. If the page being broken is the global zero page, clear_user_highpage() will be used to zero out the contents of the page, otherwise copy_user_highpage() copies the actual contents

982     spin_lock(&mm->page_table_lock);
983     if (pte_same(*page_table, pte)) {
984         if (PageReserved(old_page))
985             ++mm->rss;
986         break_cow(vma, new_page, address, page_table);
987         lru_cache_add(new_page);
988 
989         /* Free the old page.. */
990         new_page = old_page;
991     }
992     spin_unlock(&mm->page_table_lock);
993     page_cache_release(new_page);
994     page_cache_release(old_page);
995     return 1;       /* Minor fault */

: 982The page table lock was released for alloc_page()(See Section F.2.1) so reacquire it
: 983Make sure the PTE hasn't changed in the meantime which could have happened if another fault occured while the spinlock is released
: 984-985The RSS is only updated if PageReserved() is true which will only happen if the page being faulted is the global ZERO_PAGE which is not accounted for in the RSS. If this was a normal page, the process would be using the same number of physical frames after the fault as it was before but against the zero page, it'll be using a new frame so rss++
: 986break_cow() is responsible for calling the architecture hooks to ensure the CPU cache and TLBs are up to date and then establish the new page into the PTE. It first calls flush_page_to_ram() which must be called when a struct page is about to be placed in userspace. Next is flush_cache_page() which flushes the page from the CPU cache. Lastly is establish_pte() which establishes the new page into the PTE
: 987Add the page to the LRU lists
: 992Release the spinlock
: 993-994Drop the references to the pages
: 995Return a minor fault

996 
997 bad_wp_page:
998     spin_unlock(&mm->page_table_lock);
999     printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",
                    address,(unsigned long)old_page);
1000     return -1;
1001 no_mem:
1002     page_cache_release(old_page);
1003     return -1;
1004 }

: 997-1000This is a false COW break which will only happen with a buggy kernel. Print out an informational message and return
: 1001-1003The page allocation failed so release the reference to the old page and return -1

D.6 Page-Related Disk IO

D.6.1 Generic File Reading

This is more the domain of the IO manager than the VM but because it performs the operations via the page cache, we will cover it briefly. The operation of generic_file_write() is essentially the same although it is not covered by this book. However, if you understand how the read takes place, the write function will pose no problem to you.

D.6.1.1 Function: generic_file_read

Source: mm/filemap.c

This is the generic file read function used by any filesystem that reads pages through the page cache. For normal IO, it is responsible for building a read_descriptor_t for use with do_generic_file_read() and file_read_actor(). For direct IO, this function is basically a wrapper around generic_file_direct_IO().

1695 ssize_t generic_file_read(struct file * filp, 
                               char * buf, size_t count, 
                               loff_t *ppos)
1696 {
1697     ssize_t retval;
1698 
1699     if ((ssize_t) count < 0)
1700         return -EINVAL;
1701 
1702     if (filp->f_flags & O_DIRECT)
1703         goto o_direct;
1704 
1705     retval = -EFAULT;
1706     if (access_ok(VERIFY_WRITE, buf, count)) {
1707         retval = 0;
1708 
1709         if (count) {
1710             read_descriptor_t desc;
1711 
1712             desc.written = 0;
1713             desc.count = count;
1714             desc.buf = buf;
1715             desc.error = 0;
1716             do_generic_file_read(filp, ppos, &desc, 
                                      file_read_actor);
1717 
1718             retval = desc.written;
1719             if (!retval)
1720                 retval = desc.error;
1721         }
1722     }
1723  out:
1724     return retval;

This block is concern with normal file IO.

: 1702-1703If this is direct IO, jump to the o_direct label
: 1706If the access permissions to write to a userspace page are ok, then proceed
: 1709If count is 0, there is no IO to perform
: 1712-1715Populate a read_descriptor_t structure which will be used by file_read_actor()(See Section L.3.2.3)
: 1716Perform the file read
: 1718Extract the number of bytes written from the read descriptor struct
: 1282-1683If an error occured, extract what the error was
: 1724Return either the number of bytes read or the error that occured

1725 
1726  o_direct:
1727     {
1728         loff_t pos = *ppos, size;
1729         struct address_space *mapping = 
                                      filp->f_dentry->d_inode->i_mapping;
1730         struct inode *inode = mapping->host;
1731 
1732         retval = 0;
1733         if (!count)
1734             goto out; /* skip atime */
1735         down_read(&inode->i_alloc_sem);
1736         down(&inode->i_sem);
1737         size = inode->i_size;
1738         if (pos < size) {
1739             retval = generic_file_direct_IO(READ, filp, buf, 
                                                 count, pos);
1740             if (retval > 0)
1741                 *ppos = pos + retval;
1742         }
1743         UPDATE_ATIME(filp->f_dentry->d_inode);
1744         goto out;
1745     }
1746 }

This block is concerned with direct IO. It is largely responsible for extracting the parameters required for generic_file_direct_IO().

: 1729Get the address_space used by this struct file
: 1733-1734If no IO has been requested, jump to out to avoid updating the inodes access time
: 1737Get the size of the file
: 1299-1700If the current position is before the end of the file, the read is safe so call generic_file_direct_IO()
: 1740-1741If the read was successful, update the current position in the file for the reader
: 1743Update the access time
: 1744Goto out which just returns retval

D.6.1.2 Function: do_generic_file_read

Source: mm/filemap.c

This is the core part of the generic file read operation. It is responsible for allocating a page if it doesn't already exist in the page cache. If it does, it must make sure the page is up-to-date and finally, it is responsible for making sure that the appropriate readahead window is set.

1349 void do_generic_file_read(struct file * filp, 
                               loff_t *ppos, 
                               read_descriptor_t * desc, 
                               read_actor_t actor)
1350 {
1351     struct address_space *mapping = 
                                     filp->f_dentry->d_inode->i_mapping;
1352     struct inode *inode = mapping->host;
1353     unsigned long index, offset;
1354     struct page *cached_page;
1355     int reada_ok;
1356     int error;
1357     int max_readahead = get_max_readahead(inode);
1358 
1359     cached_page = NULL;
1360     index = *ppos >> PAGE_CACHE_SHIFT;
1361     offset = *ppos & ~PAGE_CACHE_MASK;
1362

: 1357Get the maximum readahead window size for this block device
: 1360Calculate the page index which holds the current file position pointer
: 1361Calculate the offset within the page that holds the current file position pointer

1363 /*
1364  * If the current position is outside the previous read-ahead
1365  * window, we reset the current read-ahead context and set read
1366  * ahead max to zero (will be set to just needed value later),
1367  * otherwise, we assume that the file accesses are sequential
1368  * enough to continue read-ahead.
1369  */
1370     if (index > filp->f_raend || 
             index + filp->f_rawin < filp->f_raend) {
1371         reada_ok = 0;
1372         filp->f_raend = 0;
1373         filp->f_ralen = 0;
1374         filp->f_ramax = 0;
1375         filp->f_rawin = 0;
1376     } else {
1377         reada_ok = 1;
1378     }
1379 /*
1380  * Adjust the current value of read-ahead max.
1381  * If the read operation stay in the first half page, force no
1382  * readahead. Otherwise try to increase read ahead max just
      * enough to do the read request.
1383  * Then, at least MIN_READAHEAD if read ahead is ok,
1384  * and at most MAX_READAHEAD in all cases.
1385  */
1386     if (!index && offset + desc->count <= (PAGE_CACHE_SIZE >> 1)) {
1387         filp->f_ramax = 0;
1388     } else {
1389         unsigned long needed;
1390 
1391         needed = ((offset + desc->count) >> PAGE_CACHE_SHIFT) + 1;
1392 
1393         if (filp->f_ramax < needed)
1394             filp->f_ramax = needed;
1395 
1396         if (reada_ok && filp->f_ramax < vm_min_readahead)
1397                 filp->f_ramax = vm_min_readahead;
1398         if (filp->f_ramax > max_readahead)
1399             filp->f_ramax = max_readahead;
1400     }

: 1370-1378As the comment suggests, the readahead window gets reset if the current file position is outside the current readahead window. It gets reset to 0 here and adjusted by generic_file_readahead()(See Section D.6.1.3) as necessary
: 1386-1400As the comment states, the readahead window gets adjusted slightly if we are in the second-half of the current page

1402     for (;;) {
1403         struct page *page, **hash;
1404         unsigned long end_index, nr, ret;
1405 
1406         end_index = inode->i_size >> PAGE_CACHE_SHIFT;
1407             
1408         if (index > end_index)
1409             break;
1410         nr = PAGE_CACHE_SIZE;
1411         if (index == end_index) {
1412             nr = inode->i_size & ~PAGE_CACHE_MASK;
1413             if (nr <= offset)
1414                 break;
1415         }
1416 
1417         nr = nr - offset;
1418 
1419         /*
1420          * Try to find the data in the page cache..
1421          */
1422         hash = page_hash(mapping, index);
1423 
1424         spin_lock(&pagecache_lock);
1425         page = __find_page_nolock(mapping, index, *hash);
1426         if (!page)
1427             goto no_cached_page;

: 1402This loop goes through each of the pages necessary to satisfy the read request
: 1406Calculate where the end of the file is in pages
: 1408-1409If the current index is beyond the end, then break out as we are trying to read beyond the end of the file
: 1410-1417Calculate nr to be the number of bytes remaining to be read in the current page. The block takes into account that this might be the last page used by the file and where the current file position is within the page
: 1422-1425Search for the page in the page cache
: 1426-1427If the page is not in the page cache, goto no_cached_page where it will be allocated

1428 found_page:
1429         page_cache_get(page);
1430         spin_unlock(&pagecache_lock);
1431 
1432         if (!Page_Uptodate(page))
1433             goto page_not_up_to_date;
1434         generic_file_readahead(reada_ok, filp, inode, page);

In this block, the page was found in the page cache.

: 1429Take a reference to the page in the page cache so it does not get freed prematurly
: 1432-1433If the page is not up-to-date, goto page_not_up_to_date to update the page with information on the disk
: 1434Perform file readahead with generic_file_readahead()(See Section D.6.1.3)

1435 page_ok:
1436         /* If users can be writing to this page using arbitrary
1437          * virtual addresses, take care about potential aliasing
1438          * before reading the page on the kernel side.
1439          */
1440         if (mapping->i_mmap_shared != NULL)
1441             flush_dcache_page(page);
1442 
1443         /*
1444          * Mark the page accessed if we read the
1445          * beginning or we just did an lseek.
1446          */
1447         if (!offset || !filp->f_reada)
1448             mark_page_accessed(page);
1449 
1450         /*
1451          * Ok, we have the page, and it's up-to-date, so
1452          * now we can copy it to user space...
1453          *
1454          * The actor routine returns how many bytes were actually used..
1455          * NOTE! This may not be the same as how much of a user buffer
1456          * we filled up (we may be padding etc), so we can only update
1457          * "pos" here (the actor routine has to update the user buffer
1458          * pointers and the remaining count).
1459          */
1460         ret = actor(desc, page, offset, nr);
1461         offset += ret;
1462         index += offset >> PAGE_CACHE_SHIFT;
1463         offset &= ~PAGE_CACHE_MASK;
1464 
1465         page_cache_release(page);
1466         if (ret == nr && desc->count)
1467             continue;
1468         break;

In this block, the page is present in the page cache and ready to be read by the file read actor function.

: 1440-1441As other users could be writing this page, call flush_dcache_page() to make sure the changes are visible
: 1447-1448As the page has just been accessed, call mark_page_accessed() (See Section J.2.3.1) to move it to the active_list
: 1460Call the actor function. In this case, the actor function is file_read_actor() (See Section L.3.2.3) which is responsible for copying the bytes from the page to userspace
: 1461Update the current offset within the file
: 1462Move to the next page if necessary
: 1463Update the offset within the page we are currently reading. Remember that we could have just crossed into the next page in the file
: 1465Release our reference to this page
: 1466-1468If there is still data to be read, loop again to read the next page. Otherwise break as the read operation is complete

1470 /*
1471  * Ok, the page was not immediately readable, so let's try to read 
      * ahead while we're at it..
1472  */
1473 page_not_up_to_date:
1474         generic_file_readahead(reada_ok, filp, inode, page);
1475 
1476         if (Page_Uptodate(page))
1477             goto page_ok;
1478 
1479         /* Get exclusive access to the page ... */
1480         lock_page(page);
1481 
1482         /* Did it get unhashed before we got the lock? */
1483         if (!page->mapping) {
1484             UnlockPage(page);
1485             page_cache_release(page);
1486             continue;
1487         }
1488 
1489         /* Did somebody else fill it already? */
1490         if (Page_Uptodate(page)) {
1491             UnlockPage(page);
1492             goto page_ok;
1493         }

In this block, the page being read was not up-to-date with information on the disk. generic_file_readahead() is called to update the current page and readahead as IO is required anyway.

: 1474Call generic_file_readahead()(See Section D.6.1.3) to sync the current page and readahead if necessary
: 1476-1477If the page is now up-to-date, goto page_ok to start copying the bytes to userspace
: 1480Otherwise something happened with readahead so lock the page for exclusive access
: 1483-1487If the page was somehow removed from the page cache while spinlocks were not held, then release the reference to the page and start all over again. The second time around, the page will get allocated and inserted into the page cache all over again
: 1490-1493If someone updated the page while we did not have a lock on the page then unlock it again and goto page_ok to copy the bytes to userspace

1495 readpage:
1496         /* ... and start the actual read. The read will 
              * unlock the page. */
1497         error = mapping->a_ops->readpage(filp, page);
1498 
1499         if (!error) {
1500             if (Page_Uptodate(page))
1501                 goto page_ok;
1502 
1503             /* Again, try some read-ahead while waiting for
                  * the page to finish.. */
1504             generic_file_readahead(reada_ok, filp, inode, page);
1505             wait_on_page(page);
1506             if (Page_Uptodate(page))
1507                 goto page_ok;
1508             error = -EIO;
1509         }
1510 
1511         /* UHHUH! A synchronous read error occurred. Report it */
1512         desc->error = error;
1513         page_cache_release(page);
1514         break;

At this block, readahead failed to we synchronously read the page with the address_space supplied readpage() function.

: 1497Call the address_space filesystem-specific readpage() function. In many cases this will ultimatly call the function block_read_full_page() declared in fs/buffer.c()
: 1499-1501If no error occurred and the page is now up-to-date, goto page_ok to begin copying the bytes to userspace
: 1504Otherwise, schedule some readahead to occur as we are forced to wait on IO anyway
: 1505-1507Wait for IO on the requested page to complete. If it finished successfully, then goto page_ok
: 1508Otherwise an error occured so set -EIO to be returned to userspace
: 1512-1514An IO error occured so record it and release the reference to the current page. This error will be picked up from the read_descriptor_t struct by generic_file_read() (See Section D.6.1.1)

1516 no_cached_page:
1517         /*
1518          * Ok, it wasn't cached, so we need to create a new
1519          * page..
1520          *
1521          * We get here with the page cache lock held.
1522          */
1523         if (!cached_page) {
1524             spin_unlock(&pagecache_lock);
1525             cached_page = page_cache_alloc(mapping);
1526             if (!cached_page) {
1527                 desc->error = -ENOMEM;
1528                 break;
1529             }
1530 
1531             /*
1532              * Somebody may have added the page while we
1533              * dropped the page cache lock. Check for that.
1534              */
1535             spin_lock(&pagecache_lock);
1536             page = __find_page_nolock(mapping, index, *hash);
1537             if (page)
1538                 goto found_page;
1539         }
1540 
1541         /*
1542          * Ok, add the new page to the hash-queues...
1543          */
1544         page = cached_page;
1545         __add_to_page_cache(page, mapping, index, hash);
1546         spin_unlock(&pagecache_lock);
1547         lru_cache_add(page);        
1548         cached_page = NULL;
1549 
1550         goto readpage;
1551     }

In this block, the page does not exist in the page cache so allocate one and add it.

: 1523-1539If a cache page has not already been allocated then allocate one and make sure that someone else did not insert one into the page cache while we were sleeping
: 1524Release pagecache_lock as page_cache_alloc() may sleep
: 1525-1529Allocate a page and set -ENOMEM to be returned if the allocation failed
: 1535-1536Acquire pagecache_lock again and search the page cache to make sure another process has not inserted it while the lock was dropped
: 1537If another process added a suitable page to the cache already, jump to found_page as the one we just allocated is no longer necessary
: 1544-1545Otherwise, add the page we just allocated to the page cache
: 1547Add the page to the LRU lists
: 1548Set cached_page to NULL as it is now in use
: 1550Goto readpage to schedule the page to be read from disk

1552 
1553     *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
1554     filp->f_reada = 1;
1555     if (cached_page)
1556         page_cache_release(cached_page);
1557     UPDATE_ATIME(inode);
1558 }

: 1553Update our position within the file
: 1555-1556If a page was allocated for addition to the page cache and then found to be unneeded, release it here
: 1557Update the access time to the file

D.6.1.3 Function: generic_file_readahead

Source: mm/filemap.c

This function performs generic file read-ahead. Readahead is one of the few areas that is very heavily commented upon in the code. It is highly recommended that you read the comments in mm/filemap.c marked with “Read-ahead context”.

1222 static void generic_file_readahead(int reada_ok,
1223     struct file * filp, struct inode * inode,
1224     struct page * page)
1225 {
1226     unsigned long end_index;
1227     unsigned long index = page->index;
1228     unsigned long max_ahead, ahead;
1229     unsigned long raend;
1230     int max_readahead = get_max_readahead(inode);
1231 
1232     end_index = inode->i_size >> PAGE_CACHE_SHIFT;
1233 
1234     raend = filp->f_raend;
1235     max_ahead = 0;

: 1227Get the index to start from based on the supplied page
: 1230Get the maximum sized readahead for this block device
: 1232Get the index, in pages, of the end of the file
: 1234Get the end of the readahead window from the struct file

1236 
1237 /*
1238  * The current page is locked.
1239  * If the current position is inside the previous read IO request, 
1240  * do not try to reread previously read ahead pages.
1241  * Otherwise decide or not to read ahead some pages synchronously.
1242  * If we are not going to read ahead, set the read ahead context
1243  * for this page only.
1244  */
1245     if (PageLocked(page)) {
1246         if (!filp->f_ralen || 
                 index >= raend || 
                 index + filp->f_rawin < raend) {
1247             raend = index;
1248             if (raend < end_index)
1249                 max_ahead = filp->f_ramax;
1250             filp->f_rawin = 0;
1251             filp->f_ralen = 1;
1252             if (!max_ahead) {
1253                 filp->f_raend  = index + filp->f_ralen;
1254                 filp->f_rawin += filp->f_ralen;
1255             }
1256         }
1257     }

This block has encountered a page that is locked so it must decide whether to temporarily disable readahead.

: 1245If the current page is locked for IO, then check if the current page is within the last readahead window. If it is, there is no point trying to readahead again. If it is not, or readahead has not been performed previously, update the readahead context
: 1246The first check is if readahead has been performed previously. The second is to see if the current locked page is after where the the previous readahead finished. The third check is if the current locked page is within the current readahead window
: 1247Update the end of the readahead window
: 1248-1249If the end of the readahead window is not after the end of the file, set max_ahead to be the maximum amount of readahead that should be used with this struct file(filp→f_ramax)
: 1250-1255Set readahead to only occur with the current page, effectively disabling readahead

1258 /*
1259  * The current page is not locked.
1260  * If we were reading ahead and,
1261  * if the current max read ahead size is not zero and,
1262  * if the current position is inside the last read-ahead IO
1263  * request, it is the moment to try to read ahead asynchronously.
1264  * We will later force unplug device in order to force
      * asynchronous read IO.
1265  */
1266     else if (reada_ok && filp->f_ramax && raend >= 1 &&
1267          index <= raend && index + filp->f_ralen >= raend) {
1268 /*
1269  * Add ONE page to max_ahead in order to try to have about the
1270  * same IO maxsize as synchronous read-ahead 
      * (MAX_READAHEAD + 1)*PAGE_CACHE_SIZE.
1271  * Compute the position of the last page we have tried to read
1272  * in order to begin to read ahead just at the next page.
1273  */
1274         raend -= 1;
1275         if (raend < end_index)
1276             max_ahead = filp->f_ramax + 1;
1277 
1278         if (max_ahead) {
1279             filp->f_rawin = filp->f_ralen;
1280             filp->f_ralen = 0;
1281             reada_ok      = 2;
1282         }
1283     }

This is one of the rare cases where the in-code commentary makes the code as clear as it possibly could be. Basically, it is saying that if the current page is not locked for IO, then extend the readahead window slight and remember that readahead is currently going well.

1284 /*
1285  * Try to read ahead pages.
1286  * We hope that ll_rw_blk() plug/unplug, coalescence, requests
1287  * sort and the scheduler, will work enough for us to avoid too 
      * bad actuals IO requests.
1288  */
1289     ahead = 0;
1290     while (ahead < max_ahead) {
1291         ahead ++;
1292         if ((raend + ahead) >= end_index)
1293             break;
1294         if (page_cache_read(filp, raend + ahead) < 0)
1295             break;
1296     }

This block performs the actual readahead by calling page_cache_read() for each of the pages in the readahead window. Note here how ahead is incremented for each page that is readahead.

1297 /*
1298  * If we tried to read ahead some pages,
1299  * If we tried to read ahead asynchronously,
1300  *   Try to force unplug of the device in order to start an
1301  *   asynchronous read IO request.
1302  * Update the read-ahead context.
1303  * Store the length of the current read-ahead window.
1304  * Double the current max read ahead size.
1305  *   That heuristic avoid to do some large IO for files that are
1306  *   not really accessed sequentially.
1307  */
1308     if (ahead) {
1309         filp->f_ralen += ahead;
1310         filp->f_rawin += filp->f_ralen;
1311         filp->f_raend = raend + ahead + 1;
1312 
1313         filp->f_ramax += filp->f_ramax;
1314 
1315         if (filp->f_ramax > max_readahead)
1316             filp->f_ramax = max_readahead;
1317 
1318 #ifdef PROFILE_READAHEAD
1319         profile_readahead((reada_ok == 2), filp);
1320 #endif
1321     }
1322 
1323     return;
1324 }

If readahead was successful, then update the readahead fields in the struct file to mark the progress. This is basically growing the readahead context but can be reset by do_generic_file_readahead() if it is found that the readahead is ineffective.

: 1309Update the f_ralen with the number of pages that were readahead in this pass
: 1310Update the size of the readahead window
: 1311Mark the end of hte readahead
: 1313Double the current maximum-sized readahead
: 1315-1316Do not let the maximum sized readahead get larger than the maximum readahead defined for this block device

D.6.2 Generic File `mmap()`

D.6.2.1 Function: generic_file_mmap

Source: mm/filemap.c

This is the generic mmap() function used by many struct files as their struct file_operations. It is mainly responsible for ensuring the appropriate address_space functions exist and setting what VMA operations to use.

2249 int generic_file_mmap(struct file * file, 
                           struct vm_area_struct * vma)
2250 {
2251     struct address_space *mapping = 
                              file->f_dentry->d_inode->i_mapping;
2252     struct inode *inode = mapping->host;
2253 
2254     if ((vma->vm_flags & VM_SHARED) && 
             (vma->vm_flags & VM_MAYWRITE)) {
2255         if (!mapping->a_ops->writepage)
2256             return -EINVAL;
2257     }
2258     if (!mapping->a_ops->readpage)
2259         return -ENOEXEC;
2260     UPDATE_ATIME(inode);
2261     vma->vm_ops = &generic_file_vm_ops;
2262     return 0;
2263 }

: 2251Get the address_space that is managing the file being mapped
: 2252Get the struct inode for this address_space
: 2254-2257If the VMA is to be shared and writable, make sure an a_ops→writepage() function exists. Return -EINVAL if it does not
: 2258-2259Make sure a a_ops→readpage() function exists
: 2260Update the access time for the inode
: 2261Use generic_file_vm_ops for the file operations. The generic VM operations structure, defined in mm/filemap.c, only supplies filemap_nopage() (See Section D.6.4.1) as it's nopage() function. No other callback is defined

D.6.3 Generic File Truncation

This section covers the path where a file is being truncated. The actual system call truncate() is implemented by sys_truncate() in fs/open.c. By the time the top-level function in the VM is called (vmtruncate()), the dentry information for the file has been updated and the inode's semaphore has been acquired.

D.6.3.1 Function: vmtruncate

Source: mm/memory.c

This is the top-level VM function responsible for truncating a file. When it completes, all page table entries mapping pages that have been truncated have been unmapped and reclaimed if possible.

1042 int vmtruncate(struct inode * inode, loff_t offset)
1043 {
1044     unsigned long pgoff;
1045     struct address_space *mapping = inode->i_mapping;
1046     unsigned long limit;
1047 
1048     if (inode->i_size < offset)
1049         goto do_expand;
1050     inode->i_size = offset;
1051     spin_lock(&mapping->i_shared_lock);
1052     if (!mapping->i_mmap && !mapping->i_mmap_shared)
1053         goto out_unlock;
1054 
1055     pgoff = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1056     if (mapping->i_mmap != NULL)
1057         vmtruncate_list(mapping->i_mmap, pgoff);
1058     if (mapping->i_mmap_shared != NULL)
1059         vmtruncate_list(mapping->i_mmap_shared, pgoff);
1060 
1061 out_unlock:
1062     spin_unlock(&mapping->i_shared_lock);
1063     truncate_inode_pages(mapping, offset);
1064     goto out_truncate;
1065 
1066 do_expand:
1067     limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
1068     if (limit != RLIM_INFINITY && offset > limit)
1069         goto out_sig;
1070     if (offset > inode->i_sb->s_maxbytes)
1071         goto out;
1072     inode->i_size = offset;
1073 
1074 out_truncate:
1075     if (inode->i_op && inode->i_op->truncate) {
1076         lock_kernel();
1077         inode->i_op->truncate(inode);
1078         unlock_kernel();
1079     }
1080     return 0;
1081 out_sig:
1082     send_sig(SIGXFSZ, current, 0);
1083 out:
1084     return -EFBIG;
1085 }

: 1042The parameters passed are the inode being truncated and the new offset marking the new end of the file. The old length of the file is stored in inode→i_size
: 1045Get the address_space responsible for the inode
: 1048-1049If the new file size is larger than the old size, then goto do_expand where the ulimits for the process will be checked before the file is grown
: 1050Here, the file is being shrunk so update inode→i_size to match
: 1051Lock the spinlock protecting the two lists of VMAs using this inode
: 1052-1053If no VMAs are mapping the inode, goto out_unlock where the pages used by the file will be reclaimed by truncate_inode_pages() (See Section D.6.3.6)
: 1055Calculate pgoff as the offset within the file in pages where the truncation will begin
: 1056-1057Truncate pages from all private mappings with vmtruncate_list() (See Section D.6.3.2)
: 1058-1059Truncate pages from all shared mappings
: 1062Unlock the spinlock protecting the VMA lists
: 1063Call truncate_inode_pages() (See Section D.6.3.6) to reclaim the pages if they exist in the page cache for the file
: 1064Goto out_truncate to call the filesystem specific truncate() function so the blocks used on disk will be freed
: 1066-1071If the file is being expanded, make sure that the process limits for maximum file size are not being exceeded and the hosting filesystem is able to support the new filesize
: 1072If the limits are fine, then update the inodes size and fall through to call the filesystem-specific truncate function which will fill the expanded filesize with zeros
: 1075-1079If the filesystem provides a truncate() function, then lock the kernel, call it and unlock the kernel again. Filesystems do not acquire the proper locks to prevent races between file truncation and file expansion due to writing or faulting so the big kernel lock is needed
: 1080Return success
: 1082-1084If the file size would grow to being too big, send the SIGXFSZ signal to the calling process and return -EFBIG

D.6.3.2 Function: vmtruncate_list

Source: mm/memory.c

This function cycles through all VMAs in an address_spaces list and calls zap_page_range() for the range of addresses which map a file that is being truncated.

1006 static void vmtruncate_list(struct vm_area_struct *mpnt, 
                                 unsigned long pgoff)
1007 {
1008     do {
1009         struct mm_struct *mm = mpnt->vm_mm;
1010         unsigned long start = mpnt->vm_start;
1011         unsigned long end = mpnt->vm_end;
1012         unsigned long len = end - start;
1013         unsigned long diff;
1014 
1015         /* mapping wholly truncated? */
1016         if (mpnt->vm_pgoff >= pgoff) {
1017             zap_page_range(mm, start, len);
1018             continue;
1019         }
1020 
1021         /* mapping wholly unaffected? */
1022         len = len >> PAGE_SHIFT;
1023         diff = pgoff - mpnt->vm_pgoff;
1024         if (diff >= len)
1025             continue;
1026 
1027         /* Ok, partially affected.. */
1028         start += diff << PAGE_SHIFT;
1029         len = (len - diff) << PAGE_SHIFT;
1030         zap_page_range(mm, start, len);
1031     } while ((mpnt = mpnt->vm_next_share) != NULL);
1032 }

: 1008-1031Loop through all VMAs in the list
: 1009Get the mm_struct that hosts this VMA
: 1010-1012Calculate the start, end and length of the VMA
: 1016-1019If the whole VMA is being truncated, call the function zap_page_range() (See Section D.6.3.3) with the start and length of the full VMA
: 1022Calculate the length of the VMA in pages
: 1023-1025Check if the VMA maps any of the region being truncated. If the VMA in unaffected, continue to the next VMA
: 1028-1029Else the VMA is being partially truncated so calculate where the start and length of the region to truncate is in pages
: 1030Call zap_page_range() (See Section D.6.3.3) to unmap the affected region

D.6.3.3 Function: zap_page_range

Source: mm/memory.c

This function is the top-level pagetable-walk function which unmaps userpages in the specified range from a mm_struct.

360 void zap_page_range(struct mm_struct *mm, 
                        unsigned long address, unsigned long size)
361 {
362     mmu_gather_t *tlb;
363     pgd_t * dir;
364     unsigned long start = address, end = address + size;
365     int freed = 0;
366 
367     dir = pgd_offset(mm, address);
368 
369     /*
370      * This is a long-lived spinlock. That's fine.
371      * There's no contention, because the page table
372      * lock only protects against kswapd anyway, and
373      * even if kswapd happened to be looking at this
374      * process we _want_ it to get stuck.
375      */
376     if (address >= end)
377         BUG();
378     spin_lock(&mm->page_table_lock);
379     flush_cache_range(mm, address, end);
380     tlb = tlb_gather_mmu(mm);
381 
382     do {
383         freed += zap_pmd_range(tlb, dir, address, end - address);
384         address = (address + PGDIR_SIZE) & PGDIR_MASK;
385         dir++;
386     } while (address && (address < end));
387 
388     /* this will flush any remaining tlb entries */
389     tlb_finish_mmu(tlb, start, end);
390 
391     /*
392      * Update rss for the mm_struct (not necessarily current->mm)
393      * Notice that rss is an unsigned long.
394      */
395     if (mm->rss > freed)
396         mm->rss -= freed;
397     else
398         mm->rss = 0;
399     spin_unlock(&mm->page_table_lock);
400 }

: 364Calculate the start and end address for zapping
: 367Calculate the PGD (dir) that contains the starting address
: 376-377Make sure the start address is not after the end address
: 378Acquire the spinlock protecting the page tables. This is a very long-held lock and would normally be considered a bad idea but the comment above the block explains why it is ok in this case
: 379Flush the CPU cache for this range
: 380tlb_gather_mmu() records the MM that is being altered. Later, tlb_remove_page() will be called to unmap the PTE which stores the PTEs in a struct free_pte_ctx until the zapping is finished. This is to avoid having to constantly flush the TLB as PTEs are freed
: 382-386For each PMD affected by the zapping, call zap_pmd_range() until the end address has been reached. Note that tlb is passed as well for tlb_remove_page() to use later
: 389tlb_finish_mmu() frees all the PTEs that were unmapped by tlb_remove_page() and then flushes the TLBs. Doing the flushing this way avoids a storm of TLB flushing that would be otherwise required for each PTE unmapped
: 395-398Update RSS count
: 399Release the pagetable lock

D.6.3.4 Function: zap_pmd_range

Source: mm/memory.c

This function is unremarkable. It steps through the PMDs that are affected by the requested range and calls zap_pte_range() for each one.

331 static inline int zap_pmd_range(mmu_gather_t *tlb, pgd_t * dir, 
                                    unsigned long address, 
        unsigned long size)
332 {
333     pmd_t * pmd;
334     unsigned long end;
335     int freed;
336 
337     if (pgd_none(*dir))
338         return 0;
339     if (pgd_bad(*dir)) {
340         pgd_ERROR(*dir);
341         pgd_clear(dir);
342         return 0;
343     }
344     pmd = pmd_offset(dir, address);
345     end = address + size;
346     if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
347         end = ((address + PGDIR_SIZE) & PGDIR_MASK);
348     freed = 0;
349     do {
350         freed += zap_pte_range(tlb, pmd, address, end - address);
351         address = (address + PMD_SIZE) & PMD_MASK; 
352         pmd++;
353     } while (address < end);
354     return freed;
355 }

: 337-338If no PGD exists, return
: 339-343If the PGD is bad, flag the error and return
: 344Get the starting pmd
: 345-347Calculate the end address of the zapping. If it is beyond the end of this PGD, then set end to the end of the PGD
: 349-353Step through all PMDs in this PGD. For each PMD, call zap_pte_range() (See Section D.6.3.5) to unmap the PTEs
: 354Return how many pages were freed

D.6.3.5 Function: zap_pte_range

Source: mm/memory.c

This function calls tlb_remove_page() for each PTE in the requested pmd within the requested address range.

294 static inline int zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, 
                                    unsigned long address, 
        unsigned long size)
295 {
296     unsigned long offset;
297     pte_t * ptep;
298     int freed = 0;
299 
300     if (pmd_none(*pmd))
301         return 0;
302     if (pmd_bad(*pmd)) {
303         pmd_ERROR(*pmd);
304         pmd_clear(pmd);
305         return 0;
306     }
307     ptep = pte_offset(pmd, address);
308     offset = address & ~PMD_MASK;
309     if (offset + size > PMD_SIZE)
310         size = PMD_SIZE - offset;
311     size &= PAGE_MASK;
312     for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
313         pte_t pte = *ptep;
314         if (pte_none(pte))
315             continue;
316         if (pte_present(pte)) {
317             struct page *page = pte_page(pte);
318             if (VALID_PAGE(page) && !PageReserved(page))
319                 freed ++;
320             /* This will eventually call __free_pte on the pte. */
321             tlb_remove_page(tlb, ptep, address + offset);
322         } else {
323             free_swap_and_cache(pte_to_swp_entry(pte));
324             pte_clear(ptep);
325         }
326     }
327 
328     return freed;
329 }

: 300-301If the PMD does not exist, return
: 302-306If the PMD is bad, flag the error and return
: 307Get the starting PTE offset
: 308Align hte offset to a PMD boundary
: 309If the size of the region to unmap is past the PMD boundary, fix the size so that only this PMD will be affected
: 311Align size to a page boundary
: 312-326Step through all PTEs in the region
: 314-315If no PTE exists, continue to the next one
: 316-322If the PTE is present, then call tlb_remove_page() to unmap the page. If the page is reclaimable, increment the freed count
: 322-325If the PTE is in use but the page is paged out or in the swap cache, then free the swap slot and page page with free_swap_and_cache() (See Section K.3.2.3). It is possible that a page is reclaimed if it was in the swap cache that is unaccounted for here but it is not of paramount importance
: 328Return the number of pages that were freed

D.6.3.6 Function: truncate_inode_pages

Source: mm/filemap.c

This is the top-level function responsible for truncating all pages from the page cache that occur after lstart in a mapping.

327 void truncate_inode_pages(struct address_space * mapping, 
                              loff_t lstart) 
328 {
329     unsigned long start = (lstart + PAGE_CACHE_SIZE - 1) >> 
                                                    PAGE_CACHE_SHIFT;
330     unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
331     int unlocked;
332 
333     spin_lock(&pagecache_lock);
334     do {
335         unlocked = truncate_list_pages(&mapping->clean_pages, 
                                           start, &partial);
336         unlocked |= truncate_list_pages(&mapping->dirty_pages, 
                                            start, &partial);
337         unlocked |= truncate_list_pages(&mapping->locked_pages, 
                                            start, &partial);
338     } while (unlocked);
339     /* Traversed all three lists without dropping the lock */
340     spin_unlock(&pagecache_lock);
341 }

: 329Calculate where to start the truncation as an index in pages
: 330Calculate partial as an offset within the last page if it is being partially truncated
: 333Lock the page cache
: 334This will loop until none of the calls to truncate_list_pages() return that a page was found that should have been reclaimed
: 335Use truncate_list_pages() (See Section D.6.3.7) to truncate all pages in the clean_pages list
: 336Similarly, truncate pages in the dirty_pages list
: 337Similarly, truncate pages in the locked_pages list
: 340Unlock the page cache

D.6.3.7 Function: truncate_list_pages

Source: mm/filemap.c

This function searches the requested list (head) which is part of an address_space. If pages are found after start, they will be truncated.

259 static int truncate_list_pages(struct list_head *head, 
                                   unsigned long start, 
                                   unsigned *partial)
260 {
261     struct list_head *curr;
262     struct page * page;
263     int unlocked = 0;
264 
265  restart:
266     curr = head->prev;
267     while (curr != head) {
268         unsigned long offset;
269 
270         page = list_entry(curr, struct page, list);
271         offset = page->index;
272 
273         /* Is one of the pages to truncate? */
274         if ((offset >= start) || 
                (*partial && (offset + 1) == start)) {
275             int failed;
276 
277             page_cache_get(page);
278             failed = TryLockPage(page);
279 
280             list_del(head);
281             if (!failed)
282                 /* Restart after this page */
283                 list_add_tail(head, curr);
284             else
285                 /* Restart on this page */
286                 list_add(head, curr);
287 
288             spin_unlock(&pagecache_lock);
289             unlocked = 1;
290 
291             if (!failed) {
292                 if (*partial && (offset + 1) == start) {
293                     truncate_partial_page(page, *partial);
294                     *partial = 0;
295                 } else 
296                     truncate_complete_page(page);
297 
298                 UnlockPage(page);
299             } else
300                 wait_on_page(page);
301 
302             page_cache_release(page);
303 
304             if (current->need_resched) {
305                 __set_current_state(TASK_RUNNING);
306                 schedule();
307             }
308 
309             spin_lock(&pagecache_lock);
310             goto restart;
311         }
312         curr = curr->prev;
313     }
314     return unlocked;
315 }

: 266-267Record the start of the list and loop until the full list has been scanned
: 270-271Get the page for this entry and what offset within the file it represents
: 274If the current page is after start or is a page that is to be partially truncated, then truncate this page, else move to the next one
: 277-278Take a reference to the page and try to lock it
: 280Remove the page from the list
: 281-283If we locked the page, add it back to the list where it will be skipped over on the next iteration of the loop
: 284-286Else add it back where it will be found again immediately. Later in the function, wait_on_page() is called until the page is unlocked
: 288Release the pagecache lock
: 299Set locked to 1 to indicate a page was found that had to be truncated. This will force truncate_inode_pages() to call this function again to make sure there are no pages left behind. This looks like an oversight and was intended to have the functions recalled only if a locked page was found but the way it is implemented means that it will called whether the page was locked or not
: 291-299If we locked the page, then truncate it
: 292-294If the page is to be partially truncated, call truncate_partial_page() (See Section D.6.3.10) with the offset within the page where the truncation beings (partial)
: 296Else call truncate_complete_page() (See Section D.6.3.8) to truncate the whole page
: 298Unlock the page
: 300If the page locking failed, call wait_on_page() to wait until the page can be locked
: 302Release the reference to the page. If there are no more mappings for the page, it will be reclaimed
: 304-307Check if the process should call schedule() before continuing. This is to prevent a truncating process from hogging the CPU
: 309Reacquire the spinlock and restart the scanning for pages to reclaim
: 312The current page should not be reclaimed so move to the next page
: 314Return 1 if a page was found in the list that had to be truncated

D.6.3.8 Function: truncate_complete_page

Source: mm/filemap.c

239 static void truncate_complete_page(struct page *page)
240 {
241     /* Leave it on the LRU if it gets converted into 
         * anonymous buffers */
242     if (!page->buffers || do_flushpage(page, 0))
243         lru_cache_del(page);
244 
245     /*
246      * We remove the page from the page cache _after_ we have
247      * destroyed all buffer-cache references to it. Otherwise some
248      * other process might think this inode page is not in the
249      * page cache and creates a buffer-cache alias to it causing
250      * all sorts of fun problems ...  
251      */
252     ClearPageDirty(page);
253     ClearPageUptodate(page);
254     remove_inode_page(page);
255     page_cache_release(page);
256 }

: 242If the page has buffers, call do_flushpage() (See Section D.6.3.9) to flush all buffers associated with the page. The comments in the following lines describe the problem concisely
: 243Delete the page from the LRU
: 252-253Clear the dirty and uptodate flags for the page
: 254Call remove_inode_page() (See Section J.1.2.1) to delete the page from the page cache
: 255Drop the reference to the page. The page will be later reclaimed when truncate_list_pages() drops it's own private refernece to it

D.6.3.9 Function: do_flushpage

Source: mm/filemap.c

This function is responsible for flushing all buffers associated with a page.

223 static int do_flushpage(struct page *page, unsigned long offset)
224 {
225     int (*flushpage) (struct page *, unsigned long);
226     flushpage = page->mapping->a_ops->flushpage;
227     if (flushpage)
228         return (*flushpage)(page, offset);
229     return block_flushpage(page, offset);
230 }

: 226-228If the page→mapping provides a flushpage() function, call it
: 229Else call block_flushpage() which is the generic function for flushing buffers associated with a page

D.6.3.10 Function: truncate_partial_page

Source: mm/filemap.c

This function partially truncates a page by zeroing out the higher bytes no longer in use and flushing any associated buffers.

232 static inline void truncate_partial_page(struct page *page, 
                                             unsigned partial)
233 {
234     memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
235     if (page->buffers)
236         do_flushpage(page, partial);
237 }

: 234memclear_highpage_flush() fills an address range with zeros. In this case, it will zero from partial to the end of the page
: 235-236If the page has any associated buffers, flush any buffers containing data in the truncated region

D.6.4 Reading Pages for the Page Cache

D.6.4.1 Function: filemap_nopage

Source: mm/filemap.c

This is the generic nopage() function used by many VMAs. This loops around itself with a large number of goto's which can be difficult to trace but there is nothing novel here. It is principally responsible for fetching the faulting page from either the pgae cache or reading it from disk. If appropriate it will also perform file read-ahead.

1994 struct page * filemap_nopage(struct vm_area_struct * area, 
                                  unsigned long address, 
                                  int unused)
1995 {
1996     int error;
1997     struct file *file = area->vm_file;
1998     struct address_space *mapping = 
                              file->f_dentry->d_inode->i_mapping;
1999     struct inode *inode = mapping->host;
2000     struct page *page, **hash;
2001     unsigned long size, pgoff, endoff;
2002 
2003     pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + 
                 area->vm_pgoff;
2004     endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + 
                 area->vm_pgoff;
2005

This block acquires the struct file, addres_space and inode important for this page fault. It then acquires the starting offset within the file needed for this fault and the offset that corresponds to the end of this VMA. The offset is the end of the VMA instead of the end of the page in case file read-ahead is performed.

: 1997-1999Acquire the struct file, address_space and inode required for this fault
: 2003Calculate pgoff which is the offset within the file corresponding to the beginning of the fault
: 2004Calculate the offset within the file corresponding to the end of the VMA

2006 retry_all:
2007     /*
2008      * An external ptracer can access pages that normally aren't
2009      * accessible..
2010      */
2011     size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
2012     if ((pgoff >= size) && (area->vm_mm == current->mm))
2013         return NULL;
2014 
2015     /* The "size" of the file, as far as mmap is concerned, isn't 
            bigger than the mapping */
2016     if (size > endoff)
2017         size = endoff;
2018 
2019     /*
2020      * Do we have something in the page cache already?
2021      */
2022     hash = page_hash(mapping, pgoff);
2023 retry_find:
2024     page = __find_get_page(mapping, pgoff, hash);
2025     if (!page)
2026         goto no_cached_page;
2027 
2028     /*
2029      * Ok, found a page in the page cache, now we need to check
2030      * that it's up-to-date.
2031      */
2032     if (!Page_Uptodate(page))
2033         goto page_not_uptodate;

: 2011Calculate the size of the file in pages
: 2012If the faulting pgoff is beyond the end of the file and this is not a tracing process, return NULL
: 2016-2017If the VMA maps beyond the end of the file, then set the size of the file to be the end of the mapping
: 2022-2024Search for the page in the page cache
: 2025-2026If it does not exist, goto no_cached_page where page_cache_read() will be called to read the page from backing storage
: 2032-2033If the page is not up-to-date, goto page_not_uptodate where the page will either be declared invalid or else the data in the page updated

2035 success:
2036     /*
2037      * Try read-ahead for sequential areas.
2038      */
2039     if (VM_SequentialReadHint(area))
2040         nopage_sequential_readahead(area, pgoff, size);
2041 
2042     /*
2043      * Found the page and have a reference on it, need to check sharing
2044      * and possibly copy it over to another page..
2045      */
2046     mark_page_accessed(page);
2047     flush_page_to_ram(page);
2048     return page;
2049

: 2039-2040If this mapping specified the VM_SEQ_READ hint, then the pages are the current fault will be pre-faulted with nopage_sequential_readahead()
: 2046Mark the faulted-in page as accessed so it will be moved to the active_list
: 2047As the page is about to be installed into a process page table, call flush_page_to_ram() so that recent stores by the kernel to the page will definitly be visible to userspace
: 2048Return the faulted-in page

2050 no_cached_page:
2051     /*
2052      * If the requested offset is within our file, try to read
2053      * a whole cluster of pages at once.
2054      *
2055      * Otherwise, we're off the end of a privately mapped file,
2056      * so we need to map a zero page.
2057      */
2058     if ((pgoff < size) && !VM_RandomReadHint(area))
2059         error = read_cluster_nonblocking(file, pgoff, size);
2060     else
2061         error = page_cache_read(file, pgoff);
2062 
2063     /*
2064      * The page we want has now been added to the page cache.
2065      * In the unlikely event that someone removed it in the
2066      * meantime, we'll just come back here and read it again.
2067      */
2068     if (error >= 0)
2069         goto retry_find;
2070 
2071     /*
2072      * An error return from page_cache_read can result if the
2073      * system is low on memory, or a problem occurs while trying
2074      * to schedule I/O.
2075      */
2076     if (error == -ENOMEM)
2077         return NOPAGE_OOM;
2078     return NULL;

: 2058-2059If the end of the file has not been reached and the random-read hint has not been specified, call read_cluster_nonblocking() to pre-fault in just a few pages near ths faulting page
: 2061Else, the file is being accessed randomly, so just call page_cache_read() (See Section D.6.4.2) to read in just the faulting page
: 2068-2069If no error occurred, goto retry_find at line 1958 which will check to make sure the page is in the page cache before returning
: 2076-2077If the error was due to being out of memory, return that so the fault handler can act accordingly
: 2078Else return NULL to indicate that a non-existant page was faulted resulting in a SIGBUS signal being sent to the faulting process

2080 page_not_uptodate:
2081     lock_page(page);
2082 
2083     /* Did it get unhashed while we waited for it? */
2084     if (!page->mapping) {
2085         UnlockPage(page);
2086         page_cache_release(page);
2087         goto retry_all;
2088     }
2089 
2090     /* Did somebody else get it up-to-date? */
2091     if (Page_Uptodate(page)) {
2092         UnlockPage(page);
2093         goto success;
2094     }
2095 
2096     if (!mapping->a_ops->readpage(file, page)) {
2097         wait_on_page(page);
2098         if (Page_Uptodate(page))
2099             goto success;
2100     }

In this block, the page was found but it was not up-to-date so the reasons for the page not being up to date are checked. If it looks ok, the appropriate readpage() function is called to resync the page.

: 2081Lock the page for IO
: 2084-2088If the page was removed from the mapping (possible because of a file truncation) and is now anonymous, then goto retry_all which will try and fault in the page again
: 2090-2094Check again if the Uptodate flag in case the page was updated just before we locked the page for IO
: 2096Call the address_space→readpage() function to schedule the data to be read from disk
: 2097Wait for the IO to complete and if it is now up-to-date, goto success to return the page. If the readpage() function failed, fall through to the error recovery path

2101 
2102     /*
2103      * Umm, take care of errors if the page isn't up-to-date.
2104      * Try to re-read it _once_. We do this synchronously,
2105      * because there really aren't any performance issues here
2106      * and we need to check for errors.
2107      */
2108     lock_page(page);
2109 
2110     /* Somebody truncated the page on us? */
2111     if (!page->mapping) {
2112         UnlockPage(page);
2113         page_cache_release(page);
2114         goto retry_all;
2115     }
2116 
2117     /* Somebody else successfully read it in? */
2118     if (Page_Uptodate(page)) {
2119         UnlockPage(page);
2120         goto success;
2121     }
2122     ClearPageError(page);
2123     if (!mapping->a_ops->readpage(file, page)) {
2124         wait_on_page(page);
2125         if (Page_Uptodate(page))
2126             goto success;
2127     }
2128 
2129     /*
2130      * Things didn't work out. Return zero to tell the
2131      * mm layer so, possibly freeing the page cache page first.
2132      */
2133     page_cache_release(page);
2134     return NULL;
2135 }

In this path, the page is not up-to-date due to some IO error. A second attempt is made to read the page data and if it fails, return.

: 2110-2127This is almost identical to the previous block. The only difference is that ClearPageError() is called to clear the error caused by the previous IO
: 2133If it still failed, release the reference to the page because it is useless
: 2134Return NULL because the fault failed

D.6.4.2 Function: page_cache_read

Source: mm/filemap.c

This function adds the page corresponding to the offset within the file to the page cache if it does not exist there already.

702 static int page_cache_read(struct file * file, 
                               unsigned long offset)
703 {
704     struct address_space *mapping = 
                              file->f_dentry->d_inode->i_mapping;
705     struct page **hash = page_hash(mapping, offset);
706     struct page *page; 
707 
708     spin_lock(&pagecache_lock);
709     page = __find_page_nolock(mapping, offset, *hash);
710     spin_unlock(&pagecache_lock);
711     if (page)
712         return 0;
713 
714     page = page_cache_alloc(mapping);
715     if (!page)
716         return -ENOMEM;
717 
718     if (!add_to_page_cache_unique(page, mapping, offset, hash)) {
719         int error = mapping->a_ops->readpage(file, page);
720         page_cache_release(page);
721         return error;
722     }
723     /*
724      * We arrive here in the unlikely event that someone 
725      * raced with us and added our page to the cache first.
726      */
727     page_cache_release(page);
728     return 0;
729 }

: 704Acquire the address_space mapping managing the file
: 705The page cache is a hash table and page_hash() returns the first page in the bucket for this mapping and offset
: 708-709Search the page cache with __find_page_nolock() (See Section J.1.4.3). This basically will traverse the list starting at hash to see if the requested page can be found
: 711-712If the page is already in the page cache, return
: 714Allocate a new page for insertion into the page cache. page_cache_alloc() will allocate a page from the buddy allocator using GFP mask information contained in mapping
: 718Insert the page into the page cache with add_to_page_cache_unique() (See Section J.1.1.2). This function is used because a second check needs to be made to make sure the page was not inserted into the page cache while the pagecache_lock spinlock was not acquired
: 719If the allocated page was inserted into the page cache, it needs to be populated with data so the readpage() function for the mapping is called. This schedules the IO to take place and the page will be unlocked when the IO completes
: 720The path in add_to_page_cache_unique() (See Section J.1.1.2) takes an extra reference to the page being added to the page cache which is dropped here. The page will not be freed
: 727If another process added the page to the page cache, it is released here by page_cache_release() as there will be no users of the page

D.6.5 File Readahead for `nopage()`

D.6.5.1 Function: nopage_sequential_readahead

Source: mm/filemap.c

This function is only called by filemap_nopage() when the VM_SEQ_READ flag has been specified in the VMA. When half of the current readahead-window has been faulted in, the next readahead window is scheduled for IO and pages from the previous window are freed.

1936 static void nopage_sequential_readahead(
         struct vm_area_struct * vma,
1937     unsigned long pgoff, unsigned long filesize)
1938 {
1939     unsigned long ra_window;
1940 
1941     ra_window = get_max_readahead(vma->vm_file->f_dentry->d_inode);
1942     ra_window = CLUSTER_OFFSET(ra_window + CLUSTER_PAGES - 1);
1943 
1944     /* vm_raend is zero if we haven't read ahead 
          * in this area yet.  */
1945     if (vma->vm_raend == 0)
1946         vma->vm_raend = vma->vm_pgoff + ra_window;
1947

: 1941get_max_readahead() returns the maximum sized readahead window for the block device the specified inode resides on
: 1942CLUSTER_PAGES is the number of pages that are paged-in or paged-out in bulk. The macro CLUSTER_OFFSET() will align the readahead window to a cluster boundary
: 1180-1181If read-ahead has not occurred yet, set the end of the read-ahead window (vm_reend)

1948     /*
1949      * If we've just faulted the page half-way through our window,
1950      * then schedule reads for the next window, and release the
1951      * pages in the previous window.
1952      */
1953     if ((pgoff + (ra_window >> 1)) == vma->vm_raend) {
1954         unsigned long start = vma->vm_pgoff + vma->vm_raend;
1955         unsigned long end = start + ra_window;
1956 
1957         if (end > ((vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff))
1958             end = (vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff;
1959         if (start > end)
1960             return;
1961 
1962         while ((start < end) && (start < filesize)) {
1963             if (read_cluster_nonblocking(vma->vm_file,
1964                             start, filesize) < 0)
1965                 break;
1966             start += CLUSTER_PAGES;
1967         }
1968         run_task_queue(&tq_disk);
1969 
1970         /* if we're far enough past the beginning of this area,
1971            recycle pages that are in the previous window. */
1972         if (vma->vm_raend > 
                              (vma->vm_pgoff + ra_window + ra_window)) {
1973             unsigned long window = ra_window << PAGE_SHIFT;
1974 
1975             end = vma->vm_start + (vma->vm_raend << PAGE_SHIFT);
1976             end -= window + window;
1977             filemap_sync(vma, end - window, window, MS_INVALIDATE);
1978         }
1979 
1980         vma->vm_raend += ra_window;
1981     }
1982 
1983     return;
1984 }

: 1953If the fault has occurred half-way through the read-ahead window then schedule the next readahead window to be read in from disk and free the pages for the first half of the current window as they are presumably not required any more
: 1954-1955Calculate the start and end of the next readahead window as we are about to schedule it for IO
: 1957If the end of the readahead window is after the end of the VMA, then set end to the end of the VMA
: 1959-1960If we are at the end of the mapping, just return as there is no more readahead to perform
: 1962-1967Schedule the next readahead window to be paged in by calling read_cluster_nonblocking()(See Section D.6.5.2)
: 1968Call run_task_queue() to start the IO
: 1972-1978Recycle the pages in the previous read-ahead window with filemap_sync() as they are no longer required
: 1980Update where the end of the readahead window is

D.6.5.2 Function: read_cluster_nonblocking

Source: mm/filemap.c

737 static int read_cluster_nonblocking(struct file * file, 
                                        unsigned long offset,
738     unsigned long filesize)
739 {
740     unsigned long pages = CLUSTER_PAGES;
741 
742     offset = CLUSTER_OFFSET(offset);
743     while ((pages-- > 0) && (offset < filesize)) {
744         int error = page_cache_read(file, offset);
745         if (error < 0)
746             return error;
747         offset ++;
748     }
749 
750     return 0;
751 }

: 740CLUSTER_PAGES will be 4 pages in low memory systems and 8 pages in larger ones. This means that on an x86 with ample memory, 32KiB will be read in one cluster
: 742CLUSTER_OFFSET() will align the offset to a cluster-sized alignment
: 743-748Read the full cluster into the page cache by calling page_cache_read() (See Section D.6.4.2) for each page in the cluster
: 745-746If an error occurs during read-ahead, return the error
: 750Return success

D.6.6 Swap Related Read-Ahead

D.6.6.1 Function: swapin_readahead

Source: mm/memory.c

This function will fault in a number of pages after the current entry. It will stop with either CLUSTER_PAGES have been swapped in or an unused swap entry is found.

1093 void swapin_readahead(swp_entry_t entry)
1094 {
1095     int i, num;
1096     struct page *new_page;
1097     unsigned long offset;
1098 
1099     /*
1100      * Get the number of handles we should do readahead io to.
1101      */
1102     num = valid_swaphandles(entry, &offset);
1103     for (i = 0; i < num; offset++, i++) {
1104         /* Ok, do the async read-ahead now */
1105         new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), 
                                                        offset));
1106         if (!new_page)
1107             break;
1108         page_cache_release(new_page);
1109     }
1110     return;
1111 }

: 1102valid_swaphandles() is what determines how many pages should be swapped in. It will stop at the first empty entry or when CLUSTER_PAGES is reached
: 1103-1109Swap in the pages
: 1105Attempt to swap the page into the swap cache with read_swap_cache_async() (See Section K.3.1.1)
: 1106-1107If the page could not be paged in, break and return
: 1108Drop the reference to the page that read_swap_cache_async() takes
: 1110Return

D.6.6.2 Function: valid_swaphandles

Source: mm/swapfile.c

This function determines how many pages should be readahead from swap starting from offset. It will readahead to the next unused swap slot but at most, it will return CLUSTER_PAGES.

1238 int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
1239 {
1240     int ret = 0, i = 1 << page_cluster;
1241     unsigned long toff;
1242     struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
1243 
1244     if (!page_cluster)      /* no readahead */
1245         return 0;
1246     toff = (SWP_OFFSET(entry) >> page_cluster) << page_cluster;
1247     if (!toff)          /* first page is swap header */
1248         toff++, i--;
1249     *offset = toff;
1250 
1251     swap_device_lock(swapdev);
1252     do {
1253         /* Don't read-ahead past the end of the swap area */
1254         if (toff >= swapdev->max)
1255             break;
1256         /* Don't read in free or bad pages */
1257         if (!swapdev->swap_map[toff])
1258             break;
1259         if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
1260             break;
1261         toff++;
1262         ret++;
1263     } while (--i);
1264     swap_device_unlock(swapdev);
1265     return ret;
1266 }

: 1240i is set to CLUSTER_PAGES which is the equivalent of the bitshift shown here
: 1242Get the swap_info_struct that contains this entry
: 1244-1245If readahead has been disabled, return
: 1246Calculate toff to be entry rounded down to the nearest CLUSTER_PAGES-sized boundary
: 1247-1248If toff is 0, move it to 1 as the first page contains information about the swap area
: 1251Lock the swap device as we are about to scan it
: 1252-1263Loop at most i, which is initialised to CLUSTER_PAGES, times
: 1254-1255If the end of the swap area is reached, then that is as far as can be readahead
: 1257-1258If an unused entry is reached, just return as it is as far as we want to readahead
: 1259-1260Likewise, return if a bad entry is discovered
: 1261Move to the next slot
: 1262Increment the number of pages to be readahead
: 1264Unlock the swap device
: 1265Return the number of pages which should be readahead