Skip to content
Xiangyan Sun edited this page Sep 11, 2015 · 1 revision

Memory management in flinux is a complicated component. Because the many memory operations in Windows has incompatible semantics of that in Linux, memory management system calls cannot be directly mapped to their Windows counterparts and we have to implement our own additional emulation layer on top of Windows.

Differences between Linux and Windows semantics

  • mmap() allow an allocation granularity of 4KB, but Windows only allows 64KB allocation granularity, that is any starting address of allocation must be aligned at 64KB.
  • Mapped memory regions can be partially unmapped in Linux, but in Windows they can only be unmapped as a whole.
  • mmap() on existing memory regions automatically replaces them, but in Windows it fails.

Map entry

A map entry (struct map_entry) is an item describing the properties of an allocated memory range. It consists of the following tags:

  • start_page: The first page of the memory region.
  • end_page: The last page of the memory region.
  • prot: The Linux memory protection flags of the memory region. Any combinations of PROT_READ, PROT_WRITE, and PROT_EXEC.
  • f: The file mapped in this memory range. NULL if it is an anonymous mapping.
  • offset_pages: Which page in the file the first page of this memory range maps to.
  • flags: Additional flags for special memory ranges, like shared memory mappings.

Map entries represents a continuous memory region, which means all these properties are the same. If for example mprotect() is called on half of a map entry and would make the two parts have different protection flags, a split_map_entry() call will be made to split the entry to two entries.

All map entries are stored in a so-called virtual address descriptor (VAD) tree which is implemented as a balanced binary search tree (red black tree) in flinux to enable O(logN) lookup and modifications.

Copy on Write paging and fork()

The way flinux implements fork() is very unique. It creates an anonymous section object (file mapping object in terms of Win32) for each 64KB aligned memory region. All memory management functions are operated on these mapped sections to emulate 4K allocation granularity.

When forking, all section objects are mapped instead of copied in child, and protection flags of all parent and child memory regions are changed to read-only. When a write occurs, an access violation exception will occur, the exception handler will then call mm_handle_page_fault() to duplicate the section and remap it read/write.

On demand paging

In glibc it will typically pre-allocate a large memory region (like 128MB) before use. In Linux this is not a problem since the OS supports on demand paging and these pages will only consume memory when they are actually used. In flinux this is mostly okay - except for i) we may create a large number of totally unused section objects which wastes system resources, ii) mapping a large number of sections in fork() will slow down it.

To solve this problem, user space on demand paging is implemented. It is a bit similar to CoW handling. In mmap(), we only create map entries and don't actually maps them. When the memory regions are accessed at the first time, an exception will occur and mm_handle_page_fault() will load the pages and continue program execution.

mm_check_*() helpers

Many system calls accepts arguments of pointers, which points to memory buffers for input/output. Program bugs could cause NULL pointer or dangling pointer to be specified as an argument. In this case, the kernel should gracefully return EFAULT error code instead of crashing itself.

Since this occasion rarely happens, we don't want to check the validity of the memory regions in VAD tree beforehand since it needs to lock mm and is a expensive nop most of the time. We implemented a series of functions - mm_check_read(), mm_check_write(), and mm_check_read_string() to do a fast check.

They are assembly functions which tries to read/write one byte in each page of the specified memory region. Because memory allocation and memory protection is at page granularity. When a problem happen it will raise a "access violation" exception and there are special code in exception handler to detect whether we are in these functions. It will then redirect the execution to a point where the function will return a false - much like SEH or C++ exception handling but with less footprint.

These memory check functions are also required to implement on demand paging. Because the memory regions may be directly passed to Windows native functions like NtReadFile() and NtWriteFile(). These check functions ensure that all pages are properly loaded and present before calling these functions, which just returns an error when the specified memory region is not valid.

brk() system call

The brk() system call is a legacy way of allocating memory in Linux. When an executable is loaded, a pointer called program break points to the virtual address of the end of the mapped executable. A brk() system call can be called to move the program break further and the extra space left behind could be used by the application.

The implementation is simple, but requires a special internal flag: INTERNAL_MAP_NOOVERWRITE. As the name suggests, this flag tells mmap() to not overwrite existing pages and instead returns an error. This is important for brk() to know that we have reached a mmap()-ed memory region and reports that the program break cannot be extended any more.

Static allocation and global shared allocation

Because mm managed memory regions provides automatic copying on fork, many subsystems use it instead of using VirtualAlloc() which needs manual remapping on fork. But since mmap() can only work with 4KB granularity and many subsystems don't need that much memory, much memory spaces are potentially wasted. Another problem is the placement of these memory regions should not collide with user app's.

To make this cleaner and better a special memory allocation - static allocation is implemented. Instead of allocating pages by the subsystems themselves, we preallocate a sufficient memory block and let the subsystems to allocate their static forkable memory at initialization and on fork(). We keep the initialization order consistent thus they will always get the same static address.

Global shared allocation is similar to this. But it is shared by all flinux processes. It is important to implement process management functions since it often needs to know information of all flinux processes.