Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demand paging for low-memory MMU-based systems #26088

Closed
andrewboie opened this issue Jun 9, 2020 · 5 comments
Closed

demand paging for low-memory MMU-based systems #26088

andrewboie opened this issue Jun 9, 2020 · 5 comments
Assignees
Labels
area: Memory Protection Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug

Comments

@andrewboie
Copy link
Contributor

andrewboie commented Jun 9, 2020

Is your enhancement proposal related to a problem? Please describe.
Some systems have more firmware than available RAM. Given a situation where 20% of the code runs 80% of the time (or thereabouts), it's not unheard-of to implement demand paging support in microcontroller operating system on targets that have an MMU. Nuttx, for example supports this.

Describe the solution you'd like
Any demand paging implementation ought to meet the following broad requirements:

  • This needs to be orthogonal to user mode or any other infrastructure which uses the MMU, including stack guards and boot-time memory permissions.

  • The kernel will need to be linked at some designated virtual base address for the kernel's memory space (for example, 0xC0000000), with the physical-to-virtual relocation of the instruction pointer taking place very early in the boot process. (Google "higher half kernel" for examples)

  • Page tables themselves use a nontrivial amount of memory, and we should allow such usage to be minimized. The bounds of the kernel's address space should be configurable. For example, on IA32, if the kernel's address space starts at 0xC0000000 and is capped at 4MB in size, then a single page directory and a single page table is sufficient for all memory mapping needs.

  • We will need support for mapping device driver MMIO ranges into the kernel's memory space. Drivers can't assume that physical MMIO addresses are directly writable. Any solution for this must not impose unnecessary footprint or performance overhead on systems that do not have an MMU. If done at runtime, we'll need a generic API for mapping physical memory into the kernel's address space.

  • We will need the capability to pin certain pages into memory, such that they are never evicted. Core kernel functionality, interrupt handling, MMU tables, CPU tables, etc go here. It should be well-understood the minimum set of pinned memory pages. We'll need macros such that application code can pin code/data as well.

  • There needs to be a layer of abstraction for the page replacement algorithm used (the "eviction algorithm"), so that the user can choose among any provided by Zephyr or roll their own. https://en.wikipedia.org/wiki/Page_replacement_algorithm

  • There needs to be a layer of abstraction (the "transport mechanism") for page-ins/page-outs to/from the backing store. It could be flash memory, a different type of RAM, a DMA transfer operation, the core kernel must not care.

  • We need a rich set of instrumentation points to capture performance metrics of the demand paging implementation and how much paging took place in any given time period.

Additional context

  • It may be possible to do some driver virtual mappings at build time, with page tables produced by the build system already mapping driver MMIO ranges exported by DTS appropriately and the drivers compiled to use the virtual addresses. This is nicer than setting this all up at runtime, at the expense of complexity/build-time voodoo. This will likely be a future enhancement not in the initial implementation.

  • The kernel will need accounting data structures for all physical memory pages, much like struct page in Linux.

  • The initial implementation will be on 32-bit x86. We will provide a simple random eviction algorithm, and an emulated backing store that works in QEMU (probably just some reserved memory region)

  • We eventually want to support ARM Cortex-A, ARM, Xtensa, any device with an MMU. The next implementation after 32-bit x86 will likely be 64-bit x86 so that we can sort out SMP issues.

  • Initial implementation on 32-bit x86 may not support user mode. After this is complete, user mode on MMU-based systems will be re-architected to use virtual memory spaces.

  • Much later on when we implement virtual memory spaces, any virtual memory below the kernel's base virtual address will be "userspace".

@andrewboie andrewboie added Enhancement Changes/Updates/Additions to existing features priority: medium Medium impact/importance bug area: Memory Protection labels Jun 9, 2020
@andrewboie andrewboie added this to the v2.4.0 milestone Jun 9, 2020
@andrewboie andrewboie self-assigned this Jun 9, 2020
@andrewboie andrewboie modified the milestones: v2.4.0, v2.5.0 Aug 12, 2020
@andrewboie
Copy link
Contributor Author

andrewboie commented Aug 12, 2020

This won't land for 2.4 although I have started on this now that I have proper memory mapping implemented on x86.

First thing is to establish the kernel's runtime ontology of virtual data pages and physical memory frames. Define what additional arch_ APIs and definitions we need. For documentation purposes let's clearly differentiate between physical page frames vs virtual data pages. A page fits into a frame, but we store extra pages elsewhere when all the frames are full.

For physical frames take a cue from Linux and establish a page frame struct, one for every page frame.

  • Have a field for flags. Pinned, dirty what other flags do we need?
  • If multiple page tables are in use, will need reverse map links to iterate over all page tables that have the page frame mapped, requiring some kernel-level abstraction for the address space. I've talked about k_process before but maybe it's better to use memory domains on MMU systems for this purpose? k_mem_domain would need an arch pointer for the page tables, and all threads would necessarily belong to a memory domain, it would never be NULL. We could have a default, at-boot one for main(). Still very close to memory domains on MPU where the domain really does reflect the MPU configuration for its members aside from the thread stack.
  • reference count, if 0 then instead we build a linked list of free physical pages
  • embedded struct defined by the eviction algorithm, if required (LRU counts, etc)
  • embedded struct defined by the page transport mechanism, if required. (less likely?)
  • use every dirty bit twiddling hack in the book to keep the size of the structure small since there's one for every RAM page frame
  • arch-specific boot hook to mark any physical pages as unavailable (x86 reserved legacy I/O regions, the first 1MB is a minefield), we have some E820 code already

For virtual data pages, we also need to set up an ontology:

  • need to implement anonymous mappings, right now k_mem_map() always takes a physical page frame to map, but in a demand paging scenario you have more data pages than page frames.
  • swapped out pages could have their backing store location information embedded within the PTE since the CPU ignores everything else if the Present bit isn't set (do all/most MMUs support this? check ARM Cortex-A). evicting a page walks the reverse map and saves the backing store location in every relevant PTE as they are also marked non-present.
  • need to decide just how far deep of an ontology we need for address space mappings, I'm thinking a small subset of Linux but not sure yet what, for reference: https://www.kernel.org/doc/gorman/html/understand/understand007.html

Finally two interfaces: eviction algorithm and transport mechanism. Implement trivial random eviction. Implement a transport mechanism simulator in QEMU that just uses additional RAM past where the kernel thinks it stops.

Complexities/annoyances

  • SMP requires TLB shootdown if the same mappings exist on multiple CPUs and one of them is modified or made non-present. We don't currently do this because mappings wrt supervisor mode don't change and every user thread has its own page tables, but this is probably all changing soon
  • Linker stuff - ties into what parts of the kernel are pinned? What's the loading like? Is the kernel a separate blob or still rolled together with the application?

Clearly define how we are going to differ from Linux. For example mappings will be permanent once established. Memory domains are still around (for now anyway).

@andrewboie
Copy link
Contributor Author

We need to clearly separate the work needed to support basic demand paging without user mode or SMP, and later work to support with those configurations.

So for the first version:

  • No SMP support, CONFIG_DEMAND_PAGING will require !SMP. Do not need to implement SMP shootdowns when page tables are modified.
  • No user mode support, CONFIG_DEMAND_PAGING will require !USERSPACE. With CONFIG_USERSPACE=n, there is just one set of page tables, the kernel's page tables. We can impose a policy that all swappable pages are linked once in the kernel's address space, any frame with multiple mappings like VDSOs will have to be pinned. If we evict a page, there will be exactly one PTE that will need to be updated, and we can just store the virtual address in the struct z_page_frame. We solve hairy problems related to reverse mapping to multiple page tables later (google "rmap linux" for some fun reading)
  • punt on E820 for x86, just skip the first megabyte in QEMU testing

@andrewboie
Copy link
Contributor Author

andrewboie commented Nov 11, 2020

Updated to-do list:

  • Update x86 to the new page table management scheme (per-domain or single page table)
  • TLB shootdown for 64-bit. Having said this, we will not enable further on x86_64 if it requires extra work on top of 32-bit.
  • Reverse map policy. I think it's sufficient that page frames have at most one mapping; if we need to dual map page frames for any reason, they will be pinned. If we are re-mapping memory, such as for memory-mapped stacks, we can un-map from the previous mapping instead of having two of them.
  • Emulator target for exercising demand paging. Using qemu_x86_tiny for this.
  • Implement kernel page frame ontology with something like an array of struct k_page_frame for every RAM page
  • Define all APIs which have implications for arch_ code, transport drivers, and eviction algorithms.
  • x86 page fault handler implementation
  • QEMU-based backing store implementation, just enable some extra RAM that Zephyr doesn't know about and use it as the backing store
  • Textbook NRU eviction algorithm, will need arch_ API to query accessed/dirty bits in the page tables

2.6 stuff:

  • Linker considerations. Determine set of pages known to be pinned at boot. How do we load an image if the total size of text/rodata/data is larger than RAM? We may need to consolidate pages that are known to be pinned at boot (critical kernel text, page tables, etc) to a single contiguous region which is loaded first. What all gets loaded at boot time? Just the pinned pages?
  • Performance instrumentation: page-in and page-out time for clean and dirty pages, histogram of eviction mechanism selection execution time, pinned page footprint (including page tables)
  • How to do anonymous mappings if no page frames are free. General demand paging interactions with generic page pool for MMU-based systems #29526
  • Test case design considerations and implementation
  • Documentation (doc: add some bits about demand paging #35240)
  • Debugger support. When debugging with GDB, if we need to inspect paged-out memory it can be paged in

Future:

  • Integrity protection for pages; verify checksum when retrieving pages from backing store

Things we will punt on:

  • Non-contiguous memory maps; multiple RAM regions. Demand paging will be limited to whatever region represents main system RAM.
  • x86_64 implementation, we just don't need it

@andrewboie
Copy link
Contributor Author

PR: #30907

@nashif nashif assigned dcpleung and unassigned andrewboie Feb 4, 2021
@nashif nashif modified the milestones: v2.5.0, v2.6.0 Feb 4, 2021
@galak galak removed this from the v2.6.0 milestone Jun 2, 2021
@nashif
Copy link
Member

nashif commented Feb 13, 2024

we had this for a while, closing.

@nashif nashif closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Memory Protection Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants