Linux Deep Dive #4: Memory Management — Virtual Memory and the Page Cache
Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.
Run free -h on any Linux system that's been running for a while:
$ free -h
total used free shared buff/cache available
Mem: 30Gi 22Gi 200Mi 500Mi 7.8Gi 7.3Gi
Swap: 8.0Gi 0B 8.0Gi
Most people see the free column — 200 MB — and worry. Only 200 MB of RAM left? But the system is perfectly healthy. available is 7.3 GiB. Something is using 7.8 GiB under buff/cache, and it can be reclaimed on demand.
This is the kernel doing its job. Idle RAM is wasted RAM. The kernel fills unused memory with file data it might need again, making disk reads fast. When a process needs memory, the kernel reclaims the cache. The free column tells you almost nothing useful. available is the number that matters.
This chapter explains why — by tracing the full memory architecture from virtual addresses through page tables to physical frames, through the page cache and back. We'll cover how the kernel allocates physical memory, what really happens on a page fault, how the page cache works, and how the kernel reclaims memory under pressure.
Memory at a Glance
User process (virtual address space)
│
│ Virtual address → MMU → page table walk
▼
┌──────────────────────────────────────────────────────┐
│ 4-Level Page Tables │
│ PGD → PUD → PMD → PTE → physical frame │
│ TLB caches recent translations │
└──────────────────────────────────────────────────────┘
│
│ physical frame number
▼
┌──────────────────────────────────────────────────────┐
│ Physical Memory │
│ ┌────────────────────┐ ┌──────────────────────┐ │
│ │ Anonymous pages │ │ File-backed pages │ │
│ │ heap, stack, CoW │ │ (the page cache) │ │
│ │ must swap to evict│ │ can drop for free │ │
│ └────────────────────┘ └──────────────────────┘ │
│ │
│ Buddy allocator manages frames (4KB to 4MB blocks) │
│ SLUB allocator carves pages into kernel objects │
└──────────────────────────────────────────────────────┘
│
│ when free memory drops below watermarks
▼
┌──────────────────────────────────────────────────────┐
│ Reclaim │
│ kswapd scans LRU lists │
│ ├── file pages: drop (clean) or writeback + drop │
│ └── anon pages: write to swap, then free frame │
│ │
│ Last resort: OOM killer │
└──────────────────────────────────────────────────────┘
Virtual Addresses and Page Tables
Chapter 3 showed that every process has an mm_struct containing a set of VMAs — contiguous virtual address regions with permissions like r-xp or rw-p. But how does a virtual address in a VMA actually become a physical address the CPU can use?
The answer is the page table.
The 4-Level Page Table on x86-64
On a 64-bit x86 system, virtual addresses are 48 bits wide. The CPU uses a 4-level page table structure to translate them:
Virtual address (48-bit):
┌───────┬───────┬───────┬───────┬──────────────┐
│ PGD │ PUD │ PMD │ PTE │ Page offset │
│ 9 bit │ 9 bit │ 9 bit │ 9 bit │ 12 bit │
└───────┴───────┴───────┴───────┴──────────────┘
PGD = Page Global Directory (top-level)
PUD = Page Upper Directory
PMD = Page Middle Directory
PTE = Page Table Entry (leaf: physical frame number + flags)
The translation works like a 4-level lookup:
CR3 register → PGD table
│ index with bits 47–39
▼
PGD entry → PUD table
│ index with bits 38–30
▼
PUD entry → PMD table
│ index with bits 29–21
▼
PMD entry → PTE table
│ index with bits 20–12
▼
PTE entry: physical frame number + R/W/X/U flags
│ + page offset (bits 11–0)
▼
Physical address
Each level is a 4KB page holding 512 8-byte entries (9 bits of index → 2^9 = 512). The final 12 bits of the virtual address index into the 4KB page. This gives 48-bit coverage: 9+9+9+9+12 = 48.
The flags in each PTE are what drive many kernel mechanisms:
| PTE flag | Effect |
|---|---|
| Present | Page is in RAM; if clear, triggers a page fault |
| Writable | Write permission; if clear on a mapped page, triggers CoW fault |
| User | Accessible from user space; kernel-only pages have this clear |
| Accessed | CPU sets this when the page is read (used by reclaim) |
| Dirty | CPU sets this on write (used to detect pages needing writeback) |
| NX (No-Execute) | Prevents execution of data pages |
You can see the page table walk in action through /proc/[pid]/pagemap, though that requires root. The more accessible view is /proc/[pid]/maps, which shows the VMAs:
$ cat /proc/$$/maps | head -8
5592b7c38000-5592b7c3e000 r--p 00000000 fd:01 530753 /usr/bin/bash
5592b7c3e000-5592b7c7e000 r-xp 00006000 fd:01 530753 /usr/bin/bash
5592b7c7e000-5592b7cad000 r--p 00046000 fd:01 530753 /usr/bin/bash
5592b7cae000-5592b7cb2000 rw-p 00075000 fd:01 530753 /usr/bin/bash
5592b9040000-5592b90ae000 rw-p 00000000 00:00 0 [heap]
7f2345a00000-7f2345c00000 r--p 00000000 fd:01 398912 /usr/lib/locale/locale-archive
...
7ffe8e3b0000-7ffe8e3d2000 rw-p 00000000 00:00 0 [stack]
7ffe8e3f8000-7ffe8e3fc000 r--p 00000000 00:00 0 [vvar]
Each row is a VMA. The permissions r-xp, rw-p, r--p map directly to PTE flags. The page table for this process contains entries for each page within each VMA — but only for pages that have actually been touched (page tables are themselves demand-paged).
The TLB: Caching Page Table Lookups
Four memory accesses to translate one address would be crippling. The CPU keeps a Translation Lookaside Buffer (TLB) — a small hardware cache of recent virtual→physical translations. A TLB hit returns the physical address in a single cycle. A miss triggers the full 4-level walk.
TLB misses are part of why context switches have measurable cost: switching processes means switching page tables (loading a new value into CR3), which flushes the TLB. The next few hundred memory accesses in the new process are all TLB misses until the cache warms up.
The kernel minimizes TLB pressure through huge pages: instead of 4KB pages (needing many TLB entries), the kernel can use 2MB pages (PMD-level mappings), covering 512× more memory with a single TLB entry. On this system, transparent huge pages (THP) are in madvise mode — the kernel uses 2MB pages for anonymous memory when the application explicitly requests them:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
Physical Memory — Zones, the Buddy Allocator, and Slab
Memory Zones
Not all physical RAM is equivalent. Legacy ISA devices could only DMA to the first 16 MB. Older PCI devices can only address below 4 GB. The kernel divides physical memory into zones to track these constraints:
$ cat /proc/buddyinfo
Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 2
Node 0, zone DMA32 17 13 13 12 9 9 14 12 9 9 768
Node 0, zone Normal 65656 39923 16923 8715 2995 987 367 127 53 27 1876
| Zone | Physical range | Purpose |
|---|---|---|
| DMA | 0–16 MB | ISA/legacy DMA devices |
| DMA32 | 0–4 GB | 32-bit PCI devices |
| Normal | > 4 GB | All other allocations |
On this 30 GB machine, the Normal zone holds almost all RAM.
The Buddy Allocator
The kernel's fundamental physical memory allocator is the buddy allocator. It maintains 11 free lists, indexed by order 0 through 10, where order N holds blocks of 2^N contiguous pages:
Order 0: 1 page = 4KB
Order 1: 2 pages = 8KB
Order 2: 4 pages = 16KB
...
Order 10: 1024 pages = 4MB
The numbers in buddyinfo are the count of free blocks at each order. In the Normal zone above, there are 65,656 order-0 blocks (4KB each), 39,923 order-1 blocks (8KB each), and so on.
When the kernel needs N pages:
- Find the smallest order ≥ N with a free block
- If that order is larger than needed, split the block: one half goes to the next lower order's free list, the other half is used
- When pages are freed, the kernel checks if the adjacent "buddy" block is also free; if so, they merge back to the higher order
Splitting and merging keep the allocator efficient and minimize fragmentation. You can watch allocation pressure by checking how many high-order blocks remain — if order-10 is empty and order-0 is full, the system is fragmented and can't satisfy large contiguous allocations.
# Watch buddy allocator state live
watch -d cat /proc/buddyinfo
The SLUB Allocator
The buddy allocator deals in full pages. But the kernel constantly allocates objects much smaller than 4KB: task_struct (~7KB but padded), file (~300 bytes), dentry (~200 bytes), inode (~600 bytes), socket buffers (~200 bytes each). Handing out a full 4KB page for each would waste enormous amounts of RAM.
The SLUB allocator (Simplified Unqueued Layer) sits on top of the buddy allocator. It maintains per-CPU slabs — pages filled with pre-allocated objects of a single size. Allocating a small kernel object takes a free slot from a slab; freeing it returns the slot. No buddyallocator call needed.
# See the largest slab caches by memory usage
$ sudo slabtop -o | head -15
Active / Total Objects (% used) : 2344037 / 2429115 (96.5%)
Active / Total Slabs (% used) : 82684 / 82684 (100.0%)
Active / Total Caches (% used) : 193 / 244 (79.1%)
Active / Total Size (% used) : 622408.38K / 655006.31K (95.0%)
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
253056 253056 100% 0.19K 6016 42 24064K dentry
198400 186988 94% 0.06K 3100 64 6200K kmalloc-64
110916 95897 86% 1.06K 9243 12 147888K inode_cache
37248 37248 100% 0.50K 1164 32 18624K kmalloc-512
dentry (directory entry cache) and inode_cache are typically the largest slab consumers — the kernel caches filesystem metadata aggressively because lookups are frequent and disk access is slow.
Page Faults
When the CPU walks the page table and something is wrong, it raises a page fault — a hardware exception that transfers control to the kernel's fault handler (handle_mm_fault() in mm/memory.c). There are several distinct kinds:
| Fault type | Cause | Kernel action | Disk I/O? |
|---|---|---|---|
| Minor | Page in VMA but PTE not present (first access) | Allocate frame, fill PTE | No |
| Major | Page was evicted to disk; PTE marked not-present | Read page from disk, fill PTE | Yes |
| CoW | Write to a read-only shared page (after fork) | Copy page, update PTE to writable | No |
| Invalid | Access outside any VMA | Deliver SIGSEGV | No |
Minor faults are normal and constant — they happen every time a program accesses a new page of its heap or stack. The kernel allocates a physical frame, zeros it (to prevent information leaks), updates the PTE, and the program continues. No disk I/O, just a brief detour through the kernel.
Major faults are a symptom of memory pressure. They mean the kernel previously evicted this page to disk to free RAM, and now the program needs it back. Major faults cause visible latency — the application stalls waiting for the disk read to complete.
# Page fault totals since boot
$ grep -E '^pgfault|^pgmajfault' /proc/vmstat
pgfault 39670010 # minor faults — normal, healthy
pgmajfault 3007 # major faults — disk reads to recover evicted pages
39 million minor faults since boot is expected. 3007 major faults is low — this system hasn't been under enough memory pressure to evict much. On a memory-constrained system, pgmajfault climbs steadily and latency spikes whenever a major fault stalls a thread.
The CoW fault is the mechanism from Chapter 3: after fork(), parent and child share physical pages marked read-only. The first write from either side triggers a fault, the kernel copies the page, and both now have their own writable copy.
You can measure per-process fault rates in real time:
# Fault rates for a running process
$ cat /proc/1/status | grep -i fault
voluntary_ctxt_switches: 31248
nonvoluntary_ctxt_switches: 1156
# For cumulative fault count, smaps_rollup is faster than summing smaps:
$ sudo cat /proc/1/smaps_rollup | grep -E 'Rss|Pss'
Rss: 21432 kB
Pss: 13891 kB
The Page Cache
This is the mechanism behind the buff/cache column in free. It's also the single most important thing to understand about Linux memory management.
The Basic Idea
When you read() from a file, the kernel doesn't transfer bytes directly from disk to your process. It maintains a page cache — a pool of physical pages indexed by (filesystem inode, page offset). The read path is:
read() syscall
│
├── Is (inode, offset) already in the page cache?
│ ├── Yes: copy from cache to user buffer → done (no disk I/O)
│ └── No: read from disk into a new cache page,
│ then copy to user buffer
│
└── Cache page stays in RAM after the read
The page cache entry persists after your read() returns. The next read() of the same region — whether by you, another process, or another program — finds it in cache and pays no disk cost.
This is why a fresh make that compiles 10,000 source files goes faster on the second run: all those file reads hit the cache. The first run was I/O-bound; the second is CPU-bound.
File-Backed vs Anonymous Memory
Physical memory pages fall into two categories:
Physical pages
│
├── File-backed (page cache)
│ ├── Clean: identical to disk → kernel can drop for free
│ └── Dirty: modified, not yet written back
│ → kernel must flush to disk before dropping
│
└── Anonymous (no file)
├── Heap, stack, CoW copies, anonymous mmap
└── To free: must write to swap device first
(or kill the process)
File-backed pages are cheap to evict — the kernel just drops the page and re-reads it from disk if needed again. Anonymous pages are expensive: they have no file backing, so eviction requires writing to swap and reading back later.
This distinction drives the entire reclaim strategy.
mmap() and the Page Cache
mmap() is the page cache seen from the process side. When your program opens a file and calls:
void *ptr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
The kernel creates a VMA that represents a window into the file's pages in the page cache. No data is loaded yet. When you dereference ptr, a minor page fault fires; the kernel checks the page cache (reading from disk if necessary), and installs a PTE pointing to that cache page. The process's PTE and the page cache entry point to the same physical frame.
This has an important consequence: every process running the same program shares its text segment.
# How many processes share bash's text pages?
$ cat /proc/$(pgrep -n bash)/smaps | grep -A 10 'bash$' | head -12
5592b7c3e000-5592b7c7e000 r-xp 00006000 fd:01 530753 /usr/bin/bash
Size: 256 kB
KernelPageSize: 4 kB
Rss: 252 kB
Pss: 14 kB ← proportional share — 252KB / ~18 processes
Shared_Clean: 252 kB ← all of it is shared
Private_Clean: 0 kB
Rss (Resident Set Size) is the total physical memory mapped. Pss (Proportional Set Size) divides shared pages by the number of processes sharing them. The r-xp text segment shows Shared_Clean: 252 kB and Pss: 14 kB — 18 bash processes share the same physical pages. One copy in RAM, many processes using it.
Reading the Page Cache Statistics
$ grep -E '^(Cached|Buffers|Mapped|Dirty|Writeback)' /proc/meminfo
Buffers: 4208 kB
Cached: 6564188 kB
Mapped: 1189268 kB
Dirty: 1336 kB
Writeback: 0 kB
| Field | Meaning |
|---|---|
Cached | Total page cache size |
Buffers | Block device metadata cache (separate from file data cache) |
Mapped | Page cache pages currently mapped into at least one process's address space |
Dirty | Modified pages not yet written to disk |
Writeback | Pages currently being written to disk |
On this machine, Cached is 6.4 GB — 6.4 GB of file data sitting in RAM, ready for instant re-use. Dirty is only 1.3 MB, meaning the kernel's writeback threads are keeping up with writes. Writeback is 0, so no I/O is in flight right now.
Reclaim — How the Kernel Manages Memory Pressure
LRU Lists
The kernel tracks which pages are worth keeping with four LRU (Least Recently Used) lists:
LRU lists
├── active_file — recently used file-backed pages
├── inactive_file — older file-backed pages (reclaim candidates)
├── active_anon — recently used anonymous pages
└── inactive_anon — older anonymous pages (swap candidates)
Pages move between these lists based on access patterns. The hardware Accessed bit in PTEs tells the kernel when a page was last used. Pages migrate from active to inactive when they haven't been accessed for a while; from inactive, they become reclaim candidates.
$ grep -E '^(Active|Inactive)' /proc/meminfo
Active: 7844660 kB
Inactive: 4958656 kB
Active(anon): 4685204 kB
Inactive(anon): 0 kB
Active(file): 3159456 kB
Inactive(file): 4958656 kB
On this system: 4.7 GB of anonymous memory is active (process heaps/stacks in use), 3.2 GB of file cache is active (recently read files). The 4.9 GB of inactive file cache is the first thing kswapd will drop if memory gets tight.
kswapd and Watermarks
kswapd is a kernel thread that maintains free memory within a healthy range. The kernel defines three watermarks for each zone:
Zone free pages
│
├── High watermark — kswapd stops here (plenty of free memory)
│
├── Low watermark — kswapd wakes up and starts reclaiming
│
└── Min watermark — direct reclaim kicks in; allocations stall
until kswapd frees enough pages
When free memory drops below the low watermark, kswapd wakes and scans the inactive lists:
- Inactive file pages (clean): drop immediately — they're exact copies of disk data
- Inactive file pages (dirty): schedule writeback, then drop after write completes
- Inactive anon pages: write to swap device, then free the frame
# Is kswapd doing any work?
$ grep -E 'pgsteal_kswapd|pgscan_kswapd|kswapd_low_wmark' /proc/vmstat
pgsteal_kswapd 0 # pages reclaimed by kswapd (0 = no pressure)
pgscan_kswapd 0 # pages scanned during reclaim
kswapd_low_wmark_hit_quickly 0
All zeros: this system is not under memory pressure. kswapd is idle. On a memory-constrained system, pgsteal_kswapd climbs continuously.
Swappiness
The vm.swappiness sysctl (0–200, default 60) controls the balance between evicting file cache and swapping anonymous pages. It's a weight in the reclaim cost calculation — higher values make the kernel more willing to swap anonymous pages alongside evicting file cache.
$ cat /proc/sys/vm/swappiness
60
Contrary to common belief, swappiness is not a memory-use threshold. swappiness=60 does not mean "start swapping at 60% memory use." It means: when choosing between evicting a file page and swapping an anonymous page, weight the decision so that both are considered in roughly a 60:100 ratio.
Setting swappiness=1 makes the kernel almost never swap — it will exhaust file cache before touching anonymous memory. Setting it to 200 makes the kernel aggressively swap, preferring to keep file cache hot. Neither extreme is universally right.
# Check swap usage
$ grep -E '^(SwapTotal|SwapFree|SwapCached|Pswpin|Pswpout)' /proc/meminfo
SwapTotal: 8388604 kB
SwapFree: 8388604 kB
SwapCached: 0 kB
# Swap activity since boot
$ grep -E '^pswp' /proc/vmstat
pswpin 0 # pages swapped in (major faults from swap)
pswpout 0 # pages swapped out
Zero swap activity on this machine — there's enough RAM that kswapd hasn't needed to swap anything out.
The OOM Killer
When free memory hits zero, all reclaim options are exhausted, and a memory allocation is still failing, the kernel invokes the OOM killer (Out Of Memory killer) as a last resort.
The OOM killer selects a process to terminate based on oom_score — a value from 0 to 1000 calculated from:
- RSS (how much RAM the process is actually using)
- Runtime (longer-running processes are slightly protected)
- Nice value (low-priority processes score higher)
- Whether the process is privileged (root processes score slightly lower)
The process with the highest score gets killed with SIGKILL.
# OOM score for each running process
$ cat /proc/1/oom_score # systemd — should be 0 (protected)
0
$ cat /proc/$$/oom_score # your shell
5
# Processes can adjust their score (-1000 to +1000)
$ cat /proc/$$/oom_score_adj
0
Setting oom_score_adj to -1000 makes a process immune to OOM killing. Systemd sets itself to -1000 for exactly this reason. Critical system daemons often do the same. Setting it to +1000 means "kill me first in an OOM."
# Which process would the OOM killer target right now?
$ ps -eo pid,oom_score,rss,comm --sort=-oom_score | head -10
PID OOM_SCORE RSS COMMAND
2847 182 2345200 firefox
3012 89 854320 slack
3201 45 412800 gnome-shell
The OOM killer fires rarely on a well-sized system. If it fires often, the right fix is more RAM, fixing a memory leak, or adding limits with memory cgroups — not tuning OOM parameters.
Why free Shows Almost No "Free" Memory
Now we have all the pieces. Return to the output from a memory-loaded system:
$ free -h
total used free shared buff/cache available
Mem: 30Gi 22Gi 200Mi 500Mi 7.8Gi 7.3Gi
Swap: 8.0Gi 0B 8.0Gi
What does each column actually mean?
| Column | What it counts |
|---|---|
total | Total physical RAM |
used | Process memory + kernel memory not reclaimable |
free | Completely idle frames — holding nothing |
shared | tmpfs and shared memory |
buff/cache | Page cache + buffer cache — reclaimable on demand |
available | Estimated memory a new allocation could get ≈ free + reclaimable buff/cache |
The free column being 200 MB does not mean the system is nearly out of memory. It means the kernel has filled almost all idle frames with page cache — which it should. Those 7.8 GB of buff/cache are serving file reads instantly.
When a process needs more memory, the kernel reclaims from buff/cache first (by dropping clean file pages). The new allocation succeeds, and buff/cache shrinks. available — not free — is what tells you whether a large new allocation will succeed.
The actual machine for this series has 30 GB of RAM and was not heavily loaded at the time this chapter was written:
$ free -h
total used free shared buff/cache available
Mem: 30Gi 11Gi 12Gi 153Mi 6.6Gi 19Gi
Swap: 8.0Gi 0B 8.0Gi
Here free is 12 GB — plenty. But the pattern is the same: buff/cache is 6.6 GB of page cache that the kernel built up from disk reads. It's not waste. It's usable memory being put to work.
The rule: the system is healthy as long as available is above zero and Swap:used is not climbing rapidly.
# One-liner to check memory health
$ free -h && grep -E '^(SwapFree|Dirty|Writeback)' /proc/meminfo
Try It Yourself
1. The big picture
free -h
cat /proc/meminfo
2. Where is memory going, by process?
# Sort by RSS (resident, actual physical pages)
ps -eo pid,rss,vsz,comm --sort=-rss | head -15
# VSZ is virtual size (mapped but not necessarily in RAM)
# RSS is what's actually resident in physical memory
3. Per-VMA detail for a process
# smaps shows Rss, Pss (proportional), and sharing info per mapping
cat /proc/$$/smaps | grep -E '^(Size|Rss|Pss|Shared_Clean|Private)' | head -30
4. Proportional memory use across the system
# PSS (Proportional Set Size) divides shared pages fairly between sharers
sudo cat /proc/*/smaps_rollup 2>/dev/null | grep ^Pss | awk '{s+=$2} END {print s/1024 "MB total PSS"}'
5. The buddy allocator's free list
# Higher-order blocks = less fragmentation = better large allocations
cat /proc/buddyinfo
6. Slab allocator caches
sudo slabtop -o | head -20
# OBJ SIZE × SLABS × OBJ/SLAB = total memory used by each cache
7. Page fault counts since boot
grep -E '^pgfault|^pgmajfault' /proc/vmstat
# pgmajfault counts disk reads due to evicted pages — low is good
8. Is kswapd doing any work?
grep -E 'pgsteal_kswapd|pgscan_kswapd|pswp' /proc/vmstat
# Non-zero pgsteal_kswapd means active reclaim — system is under memory pressure
9. Drop the page cache and watch free change
# Sync first to avoid dropping dirty data
sync
# Drop page cache, dentries, and inodes (3 = all)
# This is SAFE — it's read-only metadata and clean file data
echo 3 | sudo tee /proc/sys/vm/drop_caches
# free will jump; buff/cache will drop
free -h
# The cache refills automatically as you read files
10. OOM scores across the system
# Which process would the OOM killer target?
ps -eo pid,oom_score,rss,comm --sort=-oom_score | head -10
# Check if any process is protecting itself from OOM
grep -r -l '\-1000' /proc/*/oom_score_adj 2>/dev/null | head -5
Putting It All Together
Process writes to a new heap page
│
│ CPU walks page table → PTE not present
▼
Minor page fault
│ kernel allocates a physical frame (buddy allocator: order-0)
│ SLUB if kernel object; direct page if user allocation
│ zeroes the frame (security: no previous content leaks)
│ installs PTE: virtual page → physical frame, writable
│
▼ frame is now in memory, process continues
Process reads a file it hasn't opened before
│
│ read() → VFS → page cache lookup → miss
▼
Major page fault (or synchronous read)
│ block I/O reads the file's page from disk
│ page lands in page cache (indexed by inode + offset)
│ PTE installed (for mmap) or data copied to user buffer (for read())
│
▼ page cache entry persists
File page stays in page cache
│
├── accessed again → serves from RAM, no disk I/O
│
└── memory pressure → kswapd
│
├── page is clean → drop for free
│ (re-read from disk if accessed again → major fault)
│
└── page is dirty → writeback, then drop
Anonymous page under pressure
│
└── kswapd writes to swap device
│ frame is freed
│ PTE updated: page-not-present (swap offset stored elsewhere)
│
└── process accesses the page again
│ major fault: read from swap → restore frame → PTE updated
▼ slow, but correct
Memory completely exhausted
│
└── OOM killer: select highest oom_score process → SIGKILL
frame freed, pressure relieved
The page cache is the unifying thread: read(), write(), mmap(), executable loading, and library sharing all flow through it. The kernel's job is to keep the working set of your running processes in RAM while using the rest for cache — and to reclaim that cache quietly, before you notice, whenever someone needs more memory.
What's Next
In Chapter 5, we'll look at the scheduler — how the kernel decides which of many runnable processes gets CPU time. We've already seen task_struct and its se.vruntime field (CFS virtual runtime). Chapter 5 fills that in: what CFS virtual runtime means, how nice values translate to actual CPU share, what load average really measures, and how to diagnose scheduler-related latency.
Part of the Linux Deep Dive series.