Linux Under the Hood

Most developers use Linux every day — as their workstation, their servers, their containers. But the system beneath the shell prompt remains a black box.

This book changes that.

Linux Under the Hood is a deep-dive series for developers who want to understand what Linux is actually doing: how it boots, how it runs your programs, how it manages memory, how the filesystem works, how the network stack processes your packets, and how security primitives like containers actually work under the hood.

Each chapter combines:

  • Narrative explanation — the why before the how
  • Hands-on experiments — real commands you can run on your own Fedora system
  • Source-level details — pointers into the kernel source when it matters

No prior kernel development experience required. If you can write a program and deploy it on Linux, you're ready.

How to read this book

The chapters are designed to be read in order — each one builds on the last. But if you already understand processes, you can skip ahead to memory management without being lost.

All examples are tested on Fedora 43, kernel 6.19. The concepts apply to any Linux distribution; the specific commands and paths may vary slightly.

The chapters

#ChapterWhat you'll learn
1The Boot SequenceUEFI → GRUB2 → kernel → initramfs → systemd
2Processes & Schedulingtask_struct, fork/exec, the CFS scheduler
3Memory ManagementVirtual memory, paging, mmap, OOM killer
4The FilesystemVFS, ext4/Btrfs internals, /proc and /sys
5System Calls & I/OThe syscall mechanism, io_uring
6The Network StackTCP/IP in the kernel, eBPF
7Security & ContainersNamespaces, cgroups, how Docker works

Let's start at the very beginning — the moment you press the power button.

Linux Deep Dive #1: The Boot Sequence — From Power Button to Shell Prompt

Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.


You press the power button. About 10–30 seconds later, a shell prompt appears. What happened in between?

Most developers treat this as a black box. That's a shame — the Linux boot sequence is one of the most elegant pieces of engineering in the entire system. It's a relay race where each stage does just enough work to hand off to the next. Miss any baton and the system halts.

This post tears the black box open. We'll trace every handoff from firmware to your first interactive shell, using a real Fedora 43 machine as the specimen.


The Relay Race at a Glance

Power on
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 1 — UEFI Firmware                       ~6s          │
│  POST → find ESP → load shim.efi → load grubx64.efi        │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 2 — GRUB2 Bootloader                    ~3s          │
│  Read grub.cfg → show menu → load vmlinuz + initramfs       │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 3 — Kernel Initialization               ~2s          │
│  Decompress → setup memory → init subsystems → mount initrd │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 4 — initramfs                           ~15s         │
│  Unlock LUKS → find root filesystem → switch_root           │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 5 — systemd (PID 1)                     ~12s         │
│  Parse units → mount filesystems → start services           │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 6 — Login / Shell                                    │
│  getty → login → PAM → your shell                           │
└─────────────────────────────────────────────────────────────┘

Those phase timings aren't invented — they come directly from systemd-analyze on this machine:

$ systemd-analyze
Startup finished in 5.884s (firmware) + 2.696s (loader) + 1.614s (kernel)
                 + 15.109s (initrd) + 12.194s (userspace) = 37.501s

How does systemd-analyze know these numbers? Each phase boundary is a measured handoff, not an estimate:

  • Kernel phase: When the kernel finishes its initialization and mounts the initramfs as the root filesystem, it records a timestamp to the monotonic clock. This marks the end of the "kernel" phase and the start of the "initrd" phase.
  • initrd and userspace phases: When systemd starts running (first inside initramfs, then as the real PID 1 after switch_root), it records its own timestamps. The difference between these timestamps gives you the initrd and userspace durations.

We'll use this data throughout.


Phase 1 — UEFI Firmware

What BIOS used to do

The old BIOS (Basic Input/Output System) was a 16-bit program burned into a chip. At power-on it ran POST (Power-On Self Test) — checking RAM, CPU, devices — then looked for a bootable disk by reading the first 512 bytes (the Master Boot Record). The first 446 bytes were executable bootloader code, 64 bytes were a partition table, and the last 2 bytes were a magic signature 0x55 0xAA. This is why old bootloaders had to fit in 446 bytes. Absurd.

Modern systems use UEFI (Unified Extensible Firmware Interface), which is essentially a small operating system. It understands GPT partition tables, reads FAT32 filesystems, and can load proper .efi executables. No 446-byte constraint.

The EFI System Partition (ESP)

UEFI stores bootloaders in a dedicated FAT32 partition called the EFI System Partition (ESP). On this Fedora machine:

$ efibootmgr -v
BootCurrent: 0005
BootOrder: 0005,0004,0002,0001,0000,0003,0006,0007

Boot0005* Fedora  HD(1,GPT,e7f9e70d-...)/\EFI\FEDORA\SHIM.EFI
Boot0001  Fedora  HD(1,GPT,e7f9e70d-...)/\EFI\FEDORA\SHIMX64.EFI
Boot0000  Windows Boot Manager  ...

BootCurrent: 0005 — we booted entry 0005, which points to \EFI\FEDORA\SHIM.EFI on partition 1 (the ESP).

The UEFI firmware reads this table, opens the ESP, loads SHIM.EFI, and jumps to it.

Secure Boot and the Shim

Why SHIM.EFI and not grubx64.efi directly?

Secure Boot is a UEFI feature that refuses to execute unsigned binaries. Only code signed by a key in the firmware's database is allowed to run. Microsoft controls the keys in most consumer firmware — which creates a problem for Linux distributions: they can't ship a GRUB2 signed by Microsoft for every distro release.

The solution is a shim: a tiny, Microsoft-signed binary whose only job is to load another bootloader after verifying it against a second key database — one that Red Hat (or Canonical, or SUSE) controls.

UEFI firmware
    │
    │  verifies against Microsoft key database
    ▼
shim.efi          ← signed by Microsoft
    │
    │  verifies against Red Hat key database
    ▼
grubx64.efi       ← signed by Red Hat
    │
    │  verifies against Red Hat key database
    ▼
vmlinuz            ← signed by Red Hat

Without Secure Boot enabled, UEFI loads grubx64.efi directly — the shim is unnecessary and the chain is one step shorter.

You can inspect the ESP yourself (you'll need root):

# The ESP is mounted at /boot/efi
ls /boot/efi/EFI/fedora/
# shim.efi  shimx64.efi  grubx64.efi  grub.cfg  ...

/boot vs /boot/efi — a common point of confusion. These are two different things. /boot is a regular directory (or sometimes its own partition) on your root filesystem — it holds kernels, initramfs images, and GRUB config files, typically on ext4 or Btrfs. /boot/efi is where the ESP is mounted — it's a separate FAT32 partition (FAT32 is required by the UEFI spec) that contains the .efi bootloader binaries. So when you see /boot/efi/EFI/fedora/shim.efi, you're looking at a file on the FAT32 ESP partition, not on your root filesystem.

Most distros (Fedora, Debian, Ubuntu) mount the ESP at /boot/efi. Arch Linux often mounts the ESP directly at /boot, which works fine for single-boot setups. If you're dual-booting or sharing a machine with a Debian-based distro, use /boot/efi as the ESP mount point to avoid conflicts — different distros have different expectations about what lives in /boot.

What UEFI hands to GRUB

UEFI doesn't just load GRUB and forget it. UEFI hands control to GRUB by locating and executing the grubx64.efi application (or shimx64.efi for Secure Boot) stored on the FAT32-formatted ESP. It passes a handoff structure containing:

  • The memory map (what physical RAM exists and which regions are usable)
  • A pointer to the UEFI runtime services (which the kernel uses later for things like efivarfs)
  • ACPI tables

Once GRUB is running, UEFI is mostly done.


Phase 2 — GRUB2 Bootloader

GRUB2 (GRand Unified Bootloader version 2) runs as a UEFI application. Its job: find the kernel and initramfs on disk, load them into RAM, set up the kernel command line, and jump to the kernel entry point.

Reading the config

GRUB2 reads /boot/grub2/grub.cfg. On Fedora this file is auto-generated by grub2-mkconfig. You typically never edit it by hand — instead edit /etc/default/grub and regenerate. The config describes menu entries, timeouts, and kernel arguments.

# See the GRUB environment (saved default, etc.)
sudo grub2-editenv list
# saved_entry=...
# kernelopts=root=UUID=... ro rootflags=subvol=root ...
# boot_success=1

The kernel command line

After loading the kernel, GRUB passes a command line string. You can always inspect what was actually used:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.19.11-200.fc43.x86_64
  root=UUID=37e2f5e1-66e7-4a8a-a69e-57bcc9a44af2
  ro
  rootflags=subvol=root
  rd.luks.uuid=luks-5ea459ba-b7ec-439a-a3cf-7d25ff3b2889
  rhgb quiet

There's a lot of information here. Let's decode it:

ParameterMeaning
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-...The kernel file GRUB loaded, from partition 2 of disk 0
root=UUID=37e2f5e1-...The root filesystem to mount after boot
roMount root read-only initially (fsck can run; remounted rw later)
rootflags=subvol=rootBtrfs-specific: mount the root subvolume
rd.luks.uuid=luks-5ea459ba-...Tell initramfs to unlock a LUKS-encrypted device
rhgbRed Hat Graphical Boot (Plymouth splash screen)
quietSuppress most kernel messages on console

This one command line tells the whole story of this machine's storage setup: there's a Btrfs filesystem inside a LUKS-encrypted container, and the bootable root is a subvolume within it.

Who actually reads the command line?

The kernel command line is a single string, but it's consumed by multiple components. You might wonder: how does rd.luks.uuid end up in the initramfs while quiet goes to the kernel? The answer is a filtering hierarchy:

  1. Known kernel parameters: If the kernel recognizes the string (root=, ro, quiet, etc.), it uses it to configure itself during start_kernel().

  2. Module parameters: If the string contains a dot (e.g., nvidia.modeset=1), the kernel treats the part before the dot as a module name and passes the value to that module when it loads.

  3. initramfs directives: Parameters prefixed with rd.* are conventions established by dracut. The kernel doesn't interpret them — dracut's scripts inside the initramfs read /proc/cmdline and act on the ones they recognize (like rd.luks.uuid).

  4. systemd parameters: systemd also reads /proc/cmdline when it starts. Parameters like systemd.unit=multi-user.target or systemd.log_level=debug let you configure PID 1 from the bootloader.

  5. The unknowns: Historically, any parameter not recognized by the kernel and not containing a dot was passed to PID 1 as an environment variable. Modern systemd is stricter about this for security reasons — it won't turn arbitrary strings into $VARIABLES. If you need to set an environment variable via the command line, use the explicit systemd.setenv=VAR=VALUE syntax.

The whole thing works because /proc/cmdline is readable by anyone — every component just picks out the parameters it cares about and ignores the rest.

What GRUB loads

GRUB reads two files from /boot and loads them into RAM:

  1. vmlinuz-6.19.11-200.fc43.x86_64 — the compressed kernel image
  2. initramfs-6.19.11-200.fc43.x86_64.img — a compressed archive (70MB on this system)
$ ls -lh /boot/vmlinuz-6.19.11-200.fc43.x86_64
-rwxr-xr-x. 1 root root 18M Apr  2 16:55 /boot/vmlinuz-6.19.11-200.fc43.x86_64

$ ls -lh /boot/initramfs-6.19.11-200.fc43.x86_64.img
-rw-------. 1 root root 70M Apr  2 21:22 /boot/initramfs-6.19.11-200.fc43.x86_64.img

Once both are in RAM, GRUB jumps to the kernel's entry point. GRUB is done.

What about systemd-boot?

GRUB2 isn't the only bootloader in the Linux world. systemd-boot (formerly known as gummiboot) is a simpler alternative that's gaining adoption — Arch Linux, some Ubuntu configurations, and Fedora all support it.

The key differences from GRUB2:

  • No scripting language or config generator. GRUB2 has its own shell, scripting, and grub2-mkconfig. systemd-boot has none of that — it reads simple drop-in files directly.
  • Uses the Boot Loader Specification (BLS). Each kernel gets a small .conf file in the ESP (typically under loader/entries/) that lists the kernel path, initramfs path, and command line options. Adding a kernel means dropping a file; removing one means deleting it.
  • Lives entirely on the ESP. GRUB2 reads from both the ESP and /boot (a separate partition). systemd-boot reads everything from the ESP's FAT32 filesystem.
  • No theming or interactive shell. It shows a plain menu and boots. That's it.

This machine uses GRUB2, so that's what we trace in this post. But if you run bootctl status and see output instead of an error, your system is using systemd-boot — the handoff to the kernel works the same way, just with less machinery in between.


Phase 3 — Kernel Initialization

vmlinuz: what's actually in that file

The kernel image isn't a plain ELF binary. It's a bzImage — a self-extracting compressed archive:

$ file /boot/vmlinuz-6.19.11-200.fc43.x86_64
Linux kernel x86 boot executable, bzImage,
  version 6.19.11-200.fc43.x86_64,
  ZST compressed,
  64-bit EFI handoff entry point

"bzImage" stands for "big zImage" (the b doesn't mean bzip2 — it means "big", as in it can be loaded above 1MB). Modern kernels are compressed with Zstandard (ZST) for faster decompression. When the kernel starts executing, the first thing it does is decompress itself into RAM.

[ GRUB jumps here ]
vmlinuz (compressed)
    │  arch/x86/boot/header.S  — 16-bit setup code
    │  arch/x86/boot/compressed/head_64.S — decompress kernel
    ▼
vmlinux (uncompressed) loaded into RAM
    │  arch/x86/kernel/head_64.S — 64-bit startup
    │  start_kernel() in init/main.c
    ▼
kernel is running

start_kernel(): the origin of everything

Once decompressed, the kernel calls start_kernel() in init/main.c — one of the most consequential function calls in all of software. It initializes, in roughly this order:

start_kernel()
├── setup_arch()          — CPU detection, parse command line, set up memory map
├── mm_init()             — memory management subsystem
├── sched_init()          — the scheduler (CFS)
├── rcu_init()            — RCU (Read-Copy-Update) — kernel's lock-free data structure mechanism
├── init_IRQ()            — interrupt controllers
├── time_init()           — timekeeping
├── softirq_init()        — deferred interrupt processing
├── console_init()        — printk to the screen
├── rest_init()           — spawn PID 1 and PID 2

You can see this happening in real time:

$ dmesg | head -30
[    0.000000] Linux version 6.19.11-200.fc43.x86_64 (mockbuild@...) gcc 15.2.1
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.19.11-...
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
...

The [ 0.000000] timestamps are in seconds since kernel start. You can watch the kernel build up: memory map, then CPU features, then interrupts, then the scheduler. Every subsystem announces itself.

The memory map (e820)

The first thing the kernel does is ask UEFI (via the data passed by GRUB): "what physical memory is available?" The answer comes as an e820 map (named after BIOS interrupt 0xe820, the ancient API that established the convention):

$ dmesg | grep -i 'e820\|usable\|reserved' | head -15
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
...
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000080e2fffff] usable

The first 640KB (0x00000–0x9ffff) is "usable" — traditional DOS territory. Then 0xa0000–0xfffff is reserved — the old video RAM and ROM region. Nothing runs there. Above 1MB, RAM is usable again. Above 4GB (0x100000000), the rest of RAM is available.

Spawning PID 1 and PID 2

At the end of start_kernel(), rest_init() creates two kernel threads:

  • PID 1kernel_init(): will become the init process (eventually executes /usr/lib/systemd/systemd)
  • PID 2kthreadd: the kernel thread daemon, parent of all other kernel threads

The original start_kernel() thread becomes the idle thread (PID 0) — it runs whenever the CPU has nothing else to do, executing the hlt instruction.

Before PID 1 can exec systemd though, there's an important intermediate step.


Phase 4 — initramfs: The Bootstrap Filesystem

The chicken-and-egg problem

Here's the problem: the kernel needs to mount the root filesystem to run init. But to mount the root filesystem, you might need:

  • Kernel modules for your storage controller (NVMe, SATA, etc.)
  • Tools to unlock a LUKS-encrypted device
  • Tools to assemble a RAID array or LVM volume
  • Scripts to figure out which UUID maps to which /dev/nvme0n1p3

Those modules and tools live on the root filesystem — but the root filesystem isn't mounted yet. Classic chicken-and-egg.

The solution: initramfs (initial RAM filesystem). The kernel bundles a minimal temporary root filesystem, mounts it first, does the preparatory work, then switches to the real root.

What's in initramfs

The initramfs is a cpio archive (a Unix archiving format, like tar) compressed with zstd. It contains a miniature Linux environment:

$ lsinitrd /boot/initramfs-6.19.11-200.fc43.x86_64.img | head -40
Image: /boot/initramfs-6.19.11-200.fc43.x86_64.img: 70M
========================================================================

# Explore the full contents:
$ lsinitrd /boot/initramfs-6.19.11-200.fc43.x86_64.img | grep -E '\.(ko|service|sh)$' | head -20

On this system the initramfs is 70MB — because it contains:

  • A copy of systemd (yes, systemd runs inside initramfs first)
  • LUKS tools (cryptsetup) — required because the root fs is encrypted
  • Btrfs tools — required to mount the subvolume
  • NVMe and storage drivers
  • A minimal /etc, /dev, /sys, /proc

dracut is Fedora's tool for building initramfs images. It takes a modular approach: you declare which "dracut modules" you need, and it assembles the minimal environment for your specific machine.

dracut runs at three points in a system's life:

  1. During OS installation — the installer calls dracut at the end to build an initramfs tailored to the hardware it just installed onto.
  2. On every kernel update — when dnf installs a new kernel, a post-install hook automatically triggers dracut to build a matching initramfs. Kernel modules are version-specific, so the old initramfs won't work with the new kernel.
  3. Manually, when you change low-level storage configuration — moving to a different machine, enabling LUKS, or installing early-boot drivers (like Nvidia) requires a fresh initramfs.
# Rebuild initramfs for the current kernel (run as root):
dracut --force

# See what dracut modules were included:
lsinitrd /boot/initramfs-$(uname -r).img | grep 'dracut module'

How the kernel enters initramfs

After mounting the initramfs as its root filesystem, the kernel executes a single file: /init. Whatever that file is becomes PID 1.

  • In older systems, /init was a shell script that sequentially unlocked storage, mounted the real root, and called pivot_root.
  • In modern dracut-built initramfs images, /init is a symlink to a stripped-down systemd. This early systemd has one job: get to /sysroot. It doesn't start your network, desktop, or user services — just the storage and crypto units needed to mount the real root.

Using systemd here rather than a script enables parallelism: it can simultaneously wait for a USB device to initialize, prompt for a LUKS passphrase, and assemble an LVM/RAID array.

LUKS unlocking in the initramfs

This machine uses full-disk encryption. Look at the kernel command line again:

rd.luks.uuid=luks-5ea459ba-b7ec-439a-a3cf-7d25ff3b2889

rd.* parameters are initramfs directives (the rd prefix comes from dracut: "ram disk"). During the initramfs phase, before the real root is mounted, the system must:

  1. Find the LUKS container by UUID
  2. Prompt you for a passphrase (or use a stored key)
  3. Use cryptsetup luksOpen to create a decrypted device mapper device
  4. Then mount that device as root

This is why initramfs took 15 seconds on this boot — a significant chunk of that is the time for the user to type a passphrase (or for a TPM to unseal a key). On an unencrypted system, initramfs skips the LUKS step entirely, and this phase drops from ~15s to ~2s. The initramfs image is also much smaller since cryptsetup and its dependencies aren't needed.

switch_root: handing off to the real filesystem

Once the real root filesystem is mounted at /sysroot, the initramfs systemd performs switch_root — a three-step operation:

  1. Pivot: /sysroot becomes the new /. The old initramfs root is displaced.
  2. Free: The initramfs is deleted from RAM, reclaiming its memory.
  3. Exec: The real /usr/lib/systemd/systemd on disk is exec'd, taking over as PID 1.

The key distinction from the lower-level pivot_root syscall: switch_root actively frees the initramfs memory and is performed by systemd inside the initramfs — the kernel doesn't do this on its own. Notably, switch_root doesn't even call the pivot_root syscall under the hood — it uses mount --move (MS_MOVE) instead, which relocates the mount point without the namespace gymnastics that pivot_root requires.


Phase 5 — systemd: The Init System

PID 1's job

systemd is now PID 1. This is significant: PID 1 is special in Linux. It is the parent of all orphaned processes (any process whose parent dies gets reparented to PID 1). It cannot be killed with SIGKILL. If it crashes, the kernel panics.

systemd's core job is to start services in the right order as fast as possible. It reads unit files — declarative descriptions of what to start and what it depends on.

Units and targets

Everything in systemd is a unit. The most common types:

Unit typeSuffixPurpose
Service.serviceA daemon or one-shot process
Mount.mountA filesystem to mount
Device.deviceA kernel device (auto-created from udev)
Socket.socketA socket to listen on (for socket activation)
Target.targetA synchronization point (like a milestone)

Targets are especially important — they replace the old concept of runlevels:

$ systemctl list-units --type=target --state=active
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION
  basic.target           loaded active active Basic System
  cryptsetup.target      loaded active active Local Encrypted Volumes
  getty.target           loaded active active Login Prompts
  graphical.target       loaded active active Graphical Interface
  local-fs.target        loaded active active Local File Systems
  multi-user.target      loaded active active Multi-User System
  network.target         loaded active active Network
  sysinit.target         loaded active active System Initialization

graphical.target is the final destination on a desktop system — it's reached when everything is ready.

The dependency graph

systemd builds a dependency graph and walks it in parallel. The critical path to graphical.target on this machine:

$ systemd-analyze critical-chain
graphical.target @12.194s
└─multi-user.target @12.194s
  └─plymouth-quit-wait.service @9.496s +2.697s
    └─systemd-user-sessions.service @9.478s +11ms
      └─remote-fs.target @9.460s
        └─network.target @3.455s
          └─wpa_supplicant.service @4.376s +31ms
            └─basic.target @2.218s
              └─dbus-broker.service @2.154s +46ms
                └─sysinit.target @2.133s
                  └─systemd-resolved.service @2.075s +57ms
                    └─local-fs.target @1.927s
                      └─boot-efi.mount @1.907s +19ms
                        └─boot.mount @1.883s +23ms

Read this bottom-up: boot.mount must succeed before local-fs.target, which must complete before systemd-tmpfiles-setup, and so on up to graphical.target. The number after @ is when the unit became active; the + number is how long it took to start.

What sysinit.target does

sysinit.target is the first major milestone — basic system setup. By the time it's reached:

  • /proc, /sys, /dev are all mounted
  • udev has run and populated /dev with device nodes
  • The system clock is set
  • Cryptographic volumes are open
  • systemd-journald is running (the journal)

Socket activation: starting services lazily

One of systemd's clever tricks: socket activation. Instead of starting dbus.service immediately and waiting for it to be ready, systemd can:

  1. Create and bind the D-Bus socket immediately (instant)
  2. Queue any messages sent to that socket
  3. Start dbus-broker.service only when something actually connects
  4. Deliver the queued messages once the service is ready

Callers don't see a delay — from their perspective the socket was always there. This is how systemd achieves fast parallel startup: almost everything can start "at the same time" without actually racing.


Phase 6 — Login and Your Shell

getty and the TTY

Once multi-user.target is reached, systemd starts getty processes. A getty (get tty) opens a terminal device, prints the login prompt, and waits for input.

On a headless system you'd interact with tty1–tty6. On a desktop, a display manager (like GDM for GNOME) handles the graphical login instead.

# See which getty units are running
systemctl status getty@tty1.service

The login chain

When you type your password at a text login:

getty (opens /dev/tty1, prints "login: ")
    │
    ▼
login binary
    │  calls PAM (Pluggable Authentication Modules)
    │  PAM checks /etc/passwd, /etc/shadow, optionally LDAP, fingerprint, etc.
    ▼
user session created
    │  PAM runs pam_systemd.so → registers session with systemd-logind
    │  PAM runs pam_env.so → loads environment variables
    │  login reads /etc/profile, sets HOME, SHELL, PATH
    ▼
exec $SHELL   ← your shell, finally

Your shell (bash, zsh, fish) sources its startup files (~/.bashrc, ~/.zshrc, etc.) and you get a prompt. The boot is complete.


Try It Yourself

Here are the commands that let you observe the boot sequence from inside a running system:

1. How long did each phase take?

systemd-analyze
# Startup finished in 5.884s (firmware) + 2.696s (loader) + 1.614s (kernel)
#              + 15.109s (initrd) + 12.194s (userspace) = 37.501s

2. What took the longest to start?

systemd-analyze blame | head -20

3. Visualize the dependency graph

# Generate an SVG of the full boot dependency graph
systemd-analyze plot > boot.svg
# Open with a browser or image viewer

4. What's in your initramfs?

lsinitrd /boot/initramfs-$(uname -r).img | less

5. What did the kernel print during boot?

# Current boot
dmesg | less

# With timestamps as wall clock time
dmesg -T | less

# Stored in the journal (survives reboots)
journalctl -b 0       # this boot
journalctl -b -1      # previous boot
journalctl -b -1 -p err   # only errors from last boot

6. What EFI boot entries exist?

efibootmgr -v

7. Examine the kernel image itself

file /boot/vmlinuz-$(uname -r)
# Linux kernel x86 boot executable, bzImage, ZST compressed ...

8. What was the actual kernel command line?

cat /proc/cmdline

9. Trace a single boot unit

# See exactly when and how a unit started
systemd-analyze critical-chain NetworkManager.service

# See the log output from a unit during boot
journalctl -b -u NetworkManager.service

10. Profile a specific unit

systemd-analyze blame | grep -i luks
# If LUKS is slow, this shows up in the blame list

Putting It All Together

Let's revisit the original diagram, but now with the full picture filled in:

Power on
    │
    ▼  UEFI firmware runs POST
       Reads EFI variable: boot entry 0005 → \EFI\FEDORA\SHIM.EFI
       Loads shim.efi (Microsoft-signed), verifies signature
       shim loads grubx64.efi (Red Hat-signed), verifies signature
    │
    ▼  GRUB2 reads /boot/grub2/grub.cfg
       Presents boot menu (1 second timeout on this machine)
       Loads vmlinuz-6.19.11 and initramfs-6.19.11.img into RAM
       Sets kernel command line
       Calls EFI handoff entry point in vmlinuz
    │
    ▼  Kernel decompresses itself (ZST → vmlinux in RAM)
       Processes e820 memory map from UEFI
       start_kernel(): initializes mm, scheduler, IRQs, console
       Mounts initramfs as root filesystem
       Spawns PID 1 (kernel_init) and PID 2 (kthreadd)
    │
    ▼  initramfs systemd runs
       Finds LUKS container (UUID: 5ea459ba-...)
       Prompts for passphrase / unseals TPM key
       cryptsetup luksOpen → creates /dev/mapper/luks-...
       Mounts Btrfs filesystem, subvolume "root", at /sysroot
       switch_root: /sysroot becomes new /, initramfs freed from RAM
       Exec real /usr/lib/systemd/systemd
    │
    ▼  systemd (PID 1) reads unit files
       Builds dependency graph
       Starts sysinit.target in parallel
       Starts local-fs.target, network.target, ...
       Reaches multi-user.target, then graphical.target
    │
    ▼  GDM (display manager) starts
       OR getty opens /dev/tty1
       login → PAM authentication
       Session registered with systemd-logind
       Exec bash/zsh/fish
    │
    ▼
$ _

Every character in that $ _ represents a chain of decisions made by UEFI, GRUB, the kernel, dracut, and systemd — all working together, most of it in well under a minute.


What's Next

In the next post, we'll go deeper into something start_kernel() sets up: processes. What is a process really? What does the kernel actually store for each one? How does fork() work at the kernel level? And how does the scheduler decide which process runs next?


Part of the Linux Deep Dive series.

Linux Deep Dive #2: systemd — The Init System and Beyond

Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.


The boot sequence chapter ended with systemd becoming PID 1. But that was just the handoff. What does systemd actually do once it's in charge?

The short answer: everything. systemd is not just an init system — it's a suite of components that manages services, mounts filesystems, handles logging, schedules jobs, tracks login sessions, and enforces resource limits. Understanding it is understanding how a modern Linux system is organized.


Why PID 1 Is Special

Before diving into systemd, it's worth understanding why PID 1 matters at all.

In Linux, every process has a parent. When a parent process dies, its children are reparented — adopted by PID 1. This makes PID 1 the universal parent of all orphaned processes. It's also the only process that cannot be killed with SIGKILL. If PID 1 dies, the kernel panics. The whole system depends on it staying alive and responsive.

The traditional Unix init (SysV init) was simple: a shell script runner that started services sequentially from /etc/rc.d/. Each service started only after the previous one finished. On modern hardware with many services, this was slow. It also had no standard way to track whether a service was actually running, restart it if it crashed, or collect its logs.

systemd was designed to fix all of this.


The Unit: systemd's Fundamental Building Block

Everything in systemd is described by a unit file — a declarative INI-style text file that tells systemd what something is, what it needs, and how to manage it.

Anatomy of a unit file

Here's the real unit file for sshd on this Fedora machine:

[Unit]
Description=OpenSSH server daemon
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target sshd-keygen.target
Wants=sshd-keygen.target

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/sshd
ExecStart=/usr/sbin/sshd -D $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=42s

[Install]
WantedBy=multi-user.target

Every unit file has the same three-section structure:

[Unit] — metadata and dependencies. The most important directives:

DirectiveMeaning
Description=Human-readable name shown in systemctl status
After=Start this unit only after these units are active
Before=Start this unit before these units
Wants=Weak dependency: try to start these, but don't fail if they don't start
Requires=Hard dependency: if these fail, this unit fails too
BindsTo=Like Requires=, but also stop this unit if the dependency stops

[Service] (or [Timer], [Socket], etc.) — how to run it:

DirectiveMeaning
Type=notifyThe process signals systemd when it's fully ready (via sd_notify())
ExecStart=The command to run
ExecReload=How to reload config without restarting
Restart=on-failureAutomatically restart if the process exits with a non-zero status
RestartSec=42sWait 42 seconds before restarting
KillMode=processOnly kill the main process on stop, not its children

[Install] — when to activate this unit:

DirectiveMeaning
WantedBy=multi-user.targetEnable this unit when multi-user.target is enabled

The [Install] section only matters for systemctl enable — it's how systemd knows which target to hook a unit into.

Where unit files live

systemd loads unit files from several locations, in priority order:

/etc/systemd/system/        ← administrator overrides (highest priority)
/run/systemd/system/        ← runtime-generated units
/usr/lib/systemd/system/    ← vendor/package-provided units (lowest priority)

If you want to override a vendor unit, don't edit /usr/lib/ directly — create a drop-in file:

# Create a drop-in directory for the sshd service
mkdir -p /etc/systemd/system/sshd.service.d/

# Write an override — only the directives you want to change
cat > /etc/systemd/system/sshd.service.d/override.conf << 'EOF'
[Service]
RestartSec=10s
EOF

# Tell systemd to reload its config
systemctl daemon-reload

Drop-ins are merged with the original unit file. Your changes survive package updates.


Unit Types

systemd has several unit types, each with its own suffix and section:

TypeSuffixPurpose
Service.serviceA daemon or one-shot process
Socket.socketA socket for socket activation
Timer.timerA scheduled job (cron replacement)
Mount.mountA filesystem mount point
Device.deviceA kernel device (auto-generated from udev)
Target.targetA synchronization milestone
Path.pathTrigger a service when a path changes
Slice.sliceA cgroup resource boundary
Scope.scopeA group of externally-started processes

Dependencies and Targets

The dependency graph

When systemd starts, it doesn't just run services one by one. It builds a dependency graph from all unit files, then walks the graph in parallel — starting everything it can simultaneously, waiting only where ordering (After=/Before=) requires it.

The key insight: Wants=/Requires= express what must exist, while After=/Before= express when to start it. These are orthogonal. A unit can require another without ordering itself after it (they'd start in parallel), or be ordered after something it doesn't strictly require.

Targets as milestones

Targets are units with no [Service] section — they're pure synchronization points. Other units declare WantedBy=some.target, and systemd activates those units when pulling in that target.

This replaces the old SysV concept of runlevels (single-user, multi-user, graphical, etc.). The mapping:

SysV runlevelsystemd target
1rescue.target
3multi-user.target
5graphical.target
6reboot.target

On this machine, graphical.target is the default — it pulls in multi-user.target, which pulls in hundreds of other units:

$ systemctl get-default
graphical.target

$ systemctl list-dependencies graphical.target | head -20
graphical.target
● ├─accounts-daemon.service
● ├─gdm.service
● ├─rtkit-daemon.service
● ├─systemd-update-utmp-runlevel.service
● └─multi-user.target
●   ├─atd.service
●   ├─auditd.service
●   ├─avahi-daemon.service
●   ├─chronyd.service
●   ├─crond.service
●   ├─dbus-broker.service
...

To switch to a different target temporarily (without rebooting):

# Drop to multi-user (no graphical session)
systemctl isolate multi-user.target

# Enter emergency shell
systemctl isolate emergency.target

Service Lifecycle

A service unit can be in several states. The full lifecycle:

inactive
    │
    │  systemctl start
    ▼
activating   ← ExecStartPre= commands running
    │
    │  ExecStart= process running, waiting for ready signal
    ▼
active (running)
    │
    │  systemctl stop  OR  process exits unexpectedly
    ▼
deactivating  ← ExecStopPost= commands running
    │
    ▼
inactive  (or → failed if it crashed and Restart= doesn't apply)

You can inspect this at any time:

$ systemctl status NetworkManager.service
● NetworkManager.service - Network Manager
     Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-04-16 13:44:57 CST; 1h 20min ago
   Main PID: 1618 (NetworkManager)
      Tasks: 4 (limit: 37682)
     Memory: 12.2M
        CPU: 1.107s
     CGroup: /system.slice/NetworkManager.service
             └─1618 /usr/bin/NetworkManager --no-daemon

The key fields: Loaded (was the unit file found and parsed?), Active (is it running?), Main PID (what process is it?), CGroup (where in the cgroup hierarchy does it live?).

Service types

The Type= directive controls how systemd knows a service is "ready":

TypeHow systemd knows it's ready
simpleImmediately after ExecStart= forks (default)
notifyThe process calls sd_notify(READY=1)
dbusThe process acquires a D-Bus name
forkingThe process forks and the parent exits (old-style daemons)
oneshotThe process runs to completion and exits
idleLike simple, but delays start until all jobs are dispatched

notify is preferred for modern daemons — it means systemd actually knows the service is ready to accept connections, not just that it started.


Socket Activation

One of systemd's most useful features: socket activation. Instead of starting a service immediately and waiting for it to be ready, systemd can:

  1. Create and bind the socket before the service starts (instant)
  2. Queue any incoming connections
  3. Start the service only when something actually connects
  4. Hand the socket's file descriptor to the service process

From a client's perspective, the socket is always there. The service only actually starts when needed.

The D-Bus socket is the canonical example. Here's its socket unit:

# /usr/lib/systemd/system/dbus.socket
[Unit]
Description=D-Bus System Message Bus Socket

[Socket]
ListenStream=/run/dbus/system_bus_socket

[Install]
WantedBy=sockets.target

systemd creates /run/dbus/system_bus_socket at boot (part of sockets.target, which is reached very early). The actual dbus-broker.service only starts when something tries to connect. This is why D-Bus is available the moment any service needs it, without serializing on dbus-broker.service startup.

Socket activation also enables parallel startup without races: two services that both need D-Bus can start simultaneously, because they'll both see the socket immediately and block until dbus-broker is ready.


Timers: The cron Replacement

systemd timers replace cron for scheduling recurring jobs. They have two parts: a .timer unit (the schedule) and a matching .service unit (what to run).

Here's a real timer on this system:

$ systemctl list-timers
UNIT                         NEXT                            LEFT
dnf-makecache.timer          Thu 2026-04-16 22:12:09 CST     7h left
fstrim.timer                 Mon 2026-04-20 00:00:00 CST     3 days left
logrotate.timer              Thu 2026-04-17 00:00:00 CST     9h left
plocate-updatedb.timer       Thu 2026-04-17 00:00:00 CST     9h left
systemd-tmpfiles-clean.timer Thu 2026-04-17 00:23:36 CST     9h left

The systemd-tmpfiles-clean.timer unit:

[Unit]
Description=Daily Cleanup of Temporary Directories

[Timer]
OnBootSec=15min
OnUnitActiveSec=1d

OnBootSec=15min — run 15 minutes after boot. OnUnitActiveSec=1d — run again every 24 hours after that. No crontab syntax, no minute/hour/day columns to memorize.

Timer directives:

DirectiveMeaning
OnBootSec=Time after system boot
OnUnitActiveSec=Time after this timer last activated
OnCalendar=A calendar expression (like cron, but readable: daily, weekly, Mon *-*-* 02:00:00)
Persistent=trueIf the timer was missed (system was off), run immediately on next boot

The advantage over cron: the corresponding .service unit gets full systemd treatment — logging to the journal, resource limits, failure tracking.


journald: Structured Logging

Traditional syslog wrote plain text to files. systemd-journald writes structured, binary logs with metadata attached to every message:

  • Timestamp (microsecond precision)
  • Unit name
  • PID, UID, GID
  • SELinux context
  • Kernel boot ID (so you can distinguish logs from different boots)
# All logs from this boot
journalctl -b 0

# Logs from last boot
journalctl -b -1

# Follow logs in real time (like tail -f)
journalctl -f

# Only from one service
journalctl -u NetworkManager.service

# Only errors and worse
journalctl -p err

# Since a specific time
journalctl --since "2026-04-16 13:00:00"

# Kernel messages only (like dmesg)
journalctl -k

The journal is queryable in ways plain text files aren't. You can filter by any metadata field:

# All messages from PID 1618
journalctl _PID=1618

# All messages with a specific syslog identifier
journalctl SYSLOG_IDENTIFIER=NetworkManager

Journal files live in /var/log/journal/ (persistent across reboots) or /run/log/journal/ (volatile, lost on reboot). On this machine:

$ journalctl --disk-usage
Archived and active journals take up 264.0M in the file system.

cgroups: Resource Isolation

Every service systemd manages runs inside a cgroup (control group) — a kernel mechanism for grouping processes and enforcing resource limits.

You can see the cgroup hierarchy right now:

$ systemd-cgls
CGroup /:
-.slice
├─system.slice
│ ├─NetworkManager.service
│ │ └─1618 /usr/bin/NetworkManager --no-daemon
│ ├─dbus-broker.service
│ │ └─1432 /usr/bin/dbus-broker ...
│ └─...
├─user.slice
│ └─user-1000.slice
│   └─user@1000.service
│     └─...
└─init.scope
  └─1 /usr/lib/systemd/systemd

The tree reflects the systemd unit hierarchy:

  • system.slice — all system services
  • user.slice — all user sessions
  • init.scope — PID 1 itself

Because each service has its own cgroup, systemd can:

  • Track all processes in a service, even ones the main process forked — no more daemons escaping their service
  • Kill a service cleanlySIGTERM then SIGKILL to every process in the cgroup, not just the main PID
  • Enforce resource limits set in the unit file

You can add resource limits directly in a unit's [Service] section:

[Service]
MemoryMax=512M
CPUQuota=25%
TasksMax=100

This tells the kernel: this service gets at most 512MB RAM, 25% of one CPU core, and 100 processes/threads. No external tooling needed.


Try It Yourself

1. Inspect any service

systemctl status <service-name>

# Examples:
systemctl status NetworkManager.service
systemctl status gdm.service

2. Follow what systemd is doing right now

journalctl -f

3. See all running services and their memory/CPU

systemctl list-units --type=service --state=running

4. Find the slowest services to start

systemd-analyze blame

5. Trace why a service started

systemd-analyze critical-chain <service>
# Example:
systemd-analyze critical-chain NetworkManager.service

6. List all active timers and when they next fire

systemctl list-timers

7. Read a unit file

systemctl cat sshd.service
# Shows the file as loaded (including drop-ins)

8. Create a drop-in override

systemctl edit sshd.service
# Opens $EDITOR with the right drop-in path pre-created

9. See the full cgroup tree

systemd-cgls

10. Query structured journal fields

# List all journal fields you can filter on
journalctl -F _SYSTEMD_UNIT

# All log entries from NetworkManager, only errors
journalctl -u NetworkManager.service -p err

Putting It All Together

systemd's design reflects a single philosophy: everything is a unit, everything is tracked, everything is declared. Rather than shell scripts with implicit ordering, you write unit files that explicitly state dependencies. Rather than ad-hoc logging, every process writes to the journal with metadata. Rather than hoping services stay running, you declare Restart=on-failure and systemd handles it.

systemd (PID 1)
    │
    ├── reads all unit files from /usr/lib/systemd/system/ and /etc/systemd/system/
    │
    ├── builds dependency graph (Wants/Requires + After/Before)
    │
    ├── activates sockets early (socket activation)
    │
    ├── walks graph in parallel toward default.target
    │   ├── sysinit.target   → mounts, udev, journal, crypto
    │   ├── basic.target     → sockets, timers, paths
    │   ├── multi-user.target → all system services
    │   └── graphical.target → display manager
    │
    ├── places each service in its own cgroup
    │
    ├── collects all stdout/stderr into journald
    │
    └── monitors services, restarts on failure

The next chapter goes one level deeper: what is a process, what does the kernel actually store for each one, and how does fork() work under the hood?


Part of the Linux Deep Dive series.

Linux Deep Dive #3: Processes — fork, exec, and the Process Table

Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.


You've been using processes since you first typed a command. But what is a process, really?

Not "a running program" — that's the textbook answer, and it's too vague to be useful. A process is a specific data structure in kernel memory, a virtual address space, a set of open file descriptors, and a position in a tree that traces back to PID 1. Understanding what the kernel actually stores — and what happens when you call fork() or exec() — gives you a mental model that makes everything else click: why forking is fast, how exec replaces a program without changing its PID, why zombie processes exist, and what threads actually are under the hood.

This post traces the lifecycle of a process from fork() to exit(), using the kernel source and /proc as our specimen jars.


The Process at a Glance

Before drilling in, here's the lifecycle we'll trace:

Parent process calls fork()
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  fork()                                                      │
│  Kernel creates a new task_struct for the child             │
│  CoW: parent and child share physical pages (read-only)     │
│  Child gets a new PID; parent gets child's PID back         │
└─────────────────────────────────────────────────────────────┘
    │
    │  (child continues here — fork() returned 0)
    ▼
┌─────────────────────────────────────────────────────────────┐
│  exec()                                                      │
│  Child replaces its virtual address space with new program  │
│  Old code/data/stack gone; new ELF loaded                   │
│  PID unchanged, some file descriptors inherited             │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  Process runs                                                │
│  Scheduled on CPU, makes syscalls, handles signals          │
│  task_struct tracks state: R, S, D, T, Z                   │
└─────────────────────────────────────────────────────────────┘
    │
    │  process calls exit() or is killed
    ▼
┌─────────────────────────────────────────────────────────────┐
│  Zombie state                                                │
│  Memory freed, but task_struct remains                      │
│  Parent calls wait() → kernel reaps the zombie              │
└─────────────────────────────────────────────────────────────┘

The task_struct: What the Kernel Stores

Every process and thread on a Linux system has exactly one task_struct — the kernel's complete record of that execution context. It lives in kernel memory, allocated when the process is created and freed when the parent reaps it with wait().

The task_struct is defined in include/linux/sched.h. It's large — hundreds of fields. The ones that matter most:

struct task_struct {
    /* State */
    unsigned int        __state;     /* TASK_RUNNING, TASK_INTERRUPTIBLE, ... */

    /* Identity */
    pid_t               pid;         /* process ID */
    pid_t               tgid;        /* thread group ID (= pid for single-threaded) */

    /* Family */
    struct task_struct *parent;      /* parent process */
    struct list_head    children;    /* list of child processes */
    struct list_head    sibling;     /* position in parent's children list */

    /* Memory */
    struct mm_struct   *mm;          /* virtual address space descriptor */

    /* Files */
    struct files_struct *files;      /* open file descriptor table */
    struct fs_struct    *fs;         /* filesystem context: cwd, root, umask */

    /* Credentials */
    const struct cred  *cred;        /* UID, GID, capabilities */

    /* Signals */
    struct signal_struct *signal;    /* signal handlers, pending signals */

    /* Scheduling */
    int                 prio;        /* scheduling priority */
    u64                 se.vruntime; /* CFS virtual runtime (more in Chapter 5) */
};

This is the nucleus of the entire process model. When you run ps, it reads /proc/[pid]/stat, which is synthesized from the task_struct. When the scheduler picks the next process to run, it's choosing between task_structs. When you send a signal with kill, the kernel sets a bit in the target's task_struct.

The mm pointer (memory descriptor) points to mm_struct, which describes the entire virtual address space. The files pointer points to the open file descriptor table. Both of these can be shared between task_structs — that's what threads are, as we'll see.


Exploring Processes via /proc

The kernel exposes every task_struct through the /proc filesystem. For every running process, there's a directory /proc/[pid]/:

$ ls /proc/1/
attr/     cmdline  environ  fd/   maps  mountstats  ns/      root
cgroup    comm     exe      io    mem   net/         oom_adj  smaps
coredump_filter    gid_map  ...   mounts             pagemap  stat  status

The most useful files:

# What command is PID 1?
$ cat /proc/1/cmdline | tr '\0' ' '
/usr/lib/systemd/systemd --switched-root --system --deserialize 31

# Human-readable status
$ cat /proc/1/status
Name:   systemd
Umask:  0000
State:  S (sleeping)
Tgid:   1
Pid:    1
PPid:   0
...
VmRSS:  15348 kB
Threads:        1

# All open file descriptors of NetworkManager
$ ls -la /proc/1618/fd/ | head -8
lrwx------. 1 root root 64 Apr 23 09:12 0 -> /dev/null
lrwx------. 1 root root 64 Apr 23 09:12 1 -> /dev/null
lrwx------. 1 root root 64 Apr 23 09:12 2 -> /dev/null
lrwx------. 1 root root 64 Apr 23 09:12 4 -> 'socket:[28431]'
lrwx------. 1 root root 64 Apr 23 09:12 5 -> 'socket:[28432]'

(PID 1618 is NetworkManager on this system, as established in Chapter 2.)

The /proc/[pid]/maps file shows the full virtual address space:

$ cat /proc/1618/maps | head -12
55f8e4d20000-55f8e4d7a000 r--p 00000000 fd:01 524302  /usr/bin/NetworkManager
55f8e4d7a000-55f8e5012000 r-xp 0005a000 fd:01 524302  /usr/bin/NetworkManager
55f8e5012000-55f8e5134000 r--p 002f2000 fd:01 524302  /usr/bin/NetworkManager
55f8e5134000-55f8e5141000 rw-p 00413000 fd:01 524302  /usr/bin/NetworkManager
55f8e67a3000-55f8e6a1b000 rw-p 00000000 00:00 0       [heap]
7f9b3c000000-7f9b3c021000 rw-p 00000000 00:00 0
7ffce2a3c000-7ffce2a5e000 rw-p 00000000 00:00 0       [stack]
7ffce2b76000-7ffce2b78000 r-xp 00000000 00:00 0       [vdso]

Each line is a VMA (Virtual Memory Area) — a contiguous range of virtual addresses with a single set of permissions. r-xp means read + execute + private (not shared). The executable has separate VMAs for the read-only text segment, the writable data segment, the heap, and the stack. Chapter 4 goes deep on this.


The Process Tree

Every process except PID 1 has a parent. The parent-child relationship forms a tree rooted at systemd:

$ pstree -p | head -20
systemd(1)─┬─ModemManager(1203)─┬─{ModemManager}(1221)
           │                     └─{ModemManager}(1222)
           ├─NetworkManager(1618)─┬─{NetworkManager}(1636)
           │                      └─{NetworkManager}(1637)
           ├─dbus-broker(1432)
           ├─gdm(1801)─┬─gdm-session-wor(2015)
           │            └─{gdm}(1803)
           ├─kthreadd(2)─┬─kworker/0:0H(12)
           │              ├─ksoftirqd/0(9)
           │              └─...
           └─...

The {...} entries — like {NetworkManager}(1636) — are threads. Multiple task_structs sharing the same memory space appear under their process as thread entries.

Notice PID 2 (kthreadd). Kernel threads — kworker, ksoftirqd, ksystemd-udevd — all descend from PID 2 and never enter user space. Their mm pointer is NULL; they only ever run kernel code.

You can walk the parent chain manually through /proc:

# Who is NetworkManager's parent?
$ cat /proc/1618/status | grep PPid
PPid:   1

# Shell in a shell: check your current PID and its parent
$ echo $$
47234

$ cat /proc/47234/status | grep -E '^(Pid|PPid)'
Pid:    47234
PPid:   47198

$ cat /proc/47198/status | grep -E '^(Name|Pid)'
Name:   bash
Pid:    47198

fork(): Creating a New Process

fork() is the only way to create a new process in Linux. Every process you've ever seen — every shell command, every daemon, every application — was created by a fork() call somewhere in its ancestry.

The syscall

At the C library level, fork() is a function that returns twice: once in the parent, once in the child:

pid_t pid = fork();
if (pid == 0) {
    // We're in the child — fork() returned 0
    printf("child PID: %d\n", getpid());
    exit(0);
} else if (pid > 0) {
    // We're in the parent — pid is the child's PID
    printf("parent sees child PID: %d\n", pid);
    wait(NULL);  // wait for child to exit
} else {
    // pid == -1: fork failed (ENOMEM, EAGAIN, etc.)
    perror("fork");
}

Under the hood, glibc's fork() calls the clone() syscall — the real workhorse. fork() is clone() with a specific set of flags:

// What fork() does internally (simplified):
clone(SIGCHLD, 0);

clone() gives fine-grained control over what the child shares with the parent. fork() shares nothing (new memory space, new FD table copy). pthread_create() shares almost everything. Same syscall, different flags.

What the kernel does on fork()

The kernel executes copy_process() in kernel/fork.c. The steps:

clone() syscall
    │
    ├── Allocate a new task_struct for the child
    │
    ├── Copy the parent's task_struct fields
    │   ├── pid  ← new PID, assigned from the PID namespace
    │   ├── ppid ← parent's PID
    │   └── state ← TASK_RUNNING
    │
    ├── copy_mm() ── duplicate the virtual address space (CoW — see below)
    │
    ├── copy_files() ── copy the file descriptor table
    │   (child gets the same FD numbers, pointing to same open file entries)
    │
    ├── copy_fs() ── copy the filesystem context
    │   (child inherits parent's cwd, root, umask)
    │
    ├── copy_sighand() ── copy signal handlers
    │
    └── Place child on the scheduler's run queue

After copy_process() returns, the kernel returns the child's PID to the parent, and the child is placed on the run queue. Both parent and child are now runnable; the scheduler decides which goes first.

The child starts where fork() was called

The child doesn't start at main(). It starts at the exact instruction after fork() returned, with a copy of the parent's entire register state — stack pointer, instruction pointer, everything. This is how fork can "return twice": from the child's perspective it woke up having just returned from a syscall with 0 in rax, so fork() returns 0.


Copy-on-Write: Why fork() Is Fast

Duplicating a process's virtual address space sounds expensive — a process might have gigabytes of memory mapped. Copying it all on every fork() would make shells unusable.

The solution is copy-on-write (CoW).

When fork() calls copy_mm(), it doesn't copy physical memory pages. Instead:

  1. The parent's page table entries are duplicated — the child gets its own page table pointing to the same physical pages.
  2. Both the parent's and child's PTEs for those pages are marked read-only.
  3. If either process writes to a shared page, a page fault fires.
  4. The page fault handler sees this is a CoW fault: it allocates a new physical page, copies the content, updates the faulting process's PTE to point to the new page (with write permission restored), and resumes execution.

The result: fork() costs roughly the time to copy the page table structures — not the pages themselves. You pay only for pages that actually get written after the fork.

Before fork():
Parent VMAs                  Physical RAM
  PTE: 0x1000 → frame A ←── [frame A: code/data]
  PTE: 0x2000 → frame B ←── [frame B: data]

After fork() — CoW setup:
Parent VMAs                  Physical RAM
  PTE: 0x1000 → frame A (ro) ←── [frame A: code/data]   ← same frame
  PTE: 0x2000 → frame B (ro) ←── [frame B: data]         ← same frame
Child VMAs                           ↑
  PTE: 0x1000 → frame A (ro) ────────┘
  PTE: 0x2000 → frame B (ro) ─────────────────────────────┘

After child writes to 0x2000 (page fault → CoW copy):
Parent VMAs                  Physical RAM
  PTE: 0x1000 → frame A (ro) ←── [frame A: code/data]   (unchanged)
  PTE: 0x2000 → frame B (ro) ←── [frame B: original]    (unchanged)
Child VMAs
  PTE: 0x1000 → frame A (ro) ←── [frame A: code/data]   (still shared)
  PTE: 0x2000 → frame C (rw) ←── [frame C: modified]    (new copy)

If you immediately call exec() after fork() — the common case for shells — the child's address space is thrown away before it writes anything. CoW means this common path allocates almost no physical memory at all.


exec(): Replacing the Process Image

fork() creates a copy. exec() replaces a process's program with a different one.

The syscall

execve("/usr/bin/ls", (char *[]){"/usr/bin/ls", "-la", NULL},
       environ);
// If execve returns, something went wrong
perror("execve");
exit(127);

execve() takes three arguments: the path to the new program, the argument array (argv), and the environment array (envp). If it succeeds, it never returns — the calling process's virtual address space is completely replaced.

What the kernel does on exec()

The kernel's do_execve() in fs/exec.c:

execve() syscall
    │
    ├── Open and read the target file
    │   Starts with #! (shebang)? → redirect to the interpreter
    │   Looks like an ELF binary? → call the ELF loader
    │
    ├── Flush the old virtual address space
    │   (all VMAs unmapped, page tables cleared, mm_struct reset)
    │
    ├── Load the new ELF binary:
    │   ├── Map text segment (r-xp) into virtual memory
    │   ├── Map data segment (rw-p) into virtual memory
    │   ├── Set up a new stack
    │   └── Dynamically linked? → map ld.so first; it loads libraries
    │
    ├── Push argc, argv, envp onto the new stack
    │
    ├── Set instruction pointer to the ELF entry point
    │
    └── Return to user space at _start()

The key: the PID does not change. The task_struct is the same one. What changed is the mm field — the old memory descriptor is gone, replaced with one describing the new program's address space.

# exec() replaces the program but keeps the PID
$ echo $$
47234

$ exec bash       # replace this bash with a new bash
$ echo $$
47234             # same PID — exec() doesn't create a new process

What exec() preserves

Despite replacing everything, exec carries forward some state:

PreservedDiscarded
PID, PPIDVirtual address space
Open file descriptors (unless O_CLOEXEC)Memory mappings
Real UID/GIDSignal handlers (reset to default)
Ignored signals
cwd, root directory
Resource limits (ulimit)

The O_CLOEXEC flag tells the kernel to close an FD when exec happens. Modern code sets it on almost every FD that shouldn't leak into child programs — sockets, pipe ends, anything sensitive.


The fork + exec Pattern

The standard Unix pattern for running a program is the fork-exec idiom:

pid_t pid = fork();
if (pid == 0) {
    // Child: set up redirections/pipes, then exec
    dup2(pipe_fd[1], STDOUT_FILENO);  // redirect stdout to pipe
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    execvp("ls", argv);
    exit(127);  // only reached if exec fails
} else {
    // Parent: manages the child
    close(pipe_fd[1]);
    read(pipe_fd[0], buf, sizeof(buf));
    wait(NULL);
}

This is exactly how your shell works. When you type ls -la:

bash (PID 47234)
    │  calls fork()
    ├──────────────────────────────────┐
    │                                  ▼
    │                         bash (PID 47235)  ← child copy
    │                                  │  sets up redirections
    │                                  │  calls execve("/usr/bin/ls", ...)
    │                                  ▼
    │                         ls (PID 47235)    ← same PID, new program
    │                                  │  runs, writes output
    │                                  │  exits(0)
    │                                  │
    │  wait() returns ─────────────────┘
    ▼
bash (PID 47234)  ← resumes with exit status

The genius of this design: all the setup work (redirections, pipe wiring, environment changes) happens in the child after fork but before exec. The parent doesn't have to know about any of it.

You can watch every execve() call a command makes:

$ strace -e execve bash -c 'ls /tmp' 2>&1
execve("/bin/bash", ["bash", "-c", "ls /tmp"], 0x... /* environ */)
...
[pid 47235] execve("/usr/bin/ls", ["ls", "/tmp"], 0x... /* environ */)

Two execve calls: one for bash itself at startup, one for ls inside bash.


Process States

Every process is always in one of a handful of states. The __state field in task_struct tracks this:

TASK_RUNNING (R)
    The process is either running on a CPU right now,
    or sitting in a run queue waiting for a CPU.
    Both cases show as 'R' in ps.

TASK_INTERRUPTIBLE (S)
    Sleeping, waiting for an event: I/O complete, timer,
    mutex, network data, etc. CAN be woken by a signal.

TASK_UNINTERRUPTIBLE (D)
    Sleeping, waiting for an event. CANNOT be interrupted
    by signals — not even SIGKILL. Typically waiting on
    disk I/O, kernel locks, or NFS. A process stuck in D
    state usually means a device or mount isn't responding.

__TASK_STOPPED (T)
    Stopped by SIGSTOP or job control (Ctrl+Z).
    Resumes on SIGCONT.

EXIT_ZOMBIE (Z)
    Has exited, but parent hasn't called wait() yet.
    No memory, no CPU time — just the task_struct remains
    to hold the exit status.

You see these in ps output:

$ ps aux | head -12
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 173936 15348 ?        Ss   Apr22   0:01 /usr/lib/systemd/systemd
root           2  0.0  0.0      0     0 ?        S    Apr22   0:00 [kthreadd]
root           9  0.0  0.0      0     0 ?        S    Apr22   0:00 [ksoftirqd/0]
root        1432  0.0  0.0  10604  5124 ?        Ss   Apr22   0:00 /usr/bin/dbus-broker
root        1618  0.0  0.1  54680 18204 ?        Ss   Apr22   0:01 /usr/bin/NetworkManager
xiaofeng   47234  0.0  0.0  14228  8316 pts/0    Ss   09:15   0:00 -bash

The STAT column encodes state plus modifiers:

LetterMeaning
RRunning or runnable
SInterruptible sleep
DUninterruptible sleep (disk/kernel wait)
ZZombie
TStopped
IIdle kernel thread
sSession leader
lMulti-threaded
+Foreground process group
NLow priority (nice > 0)

D state is worth understanding on its own. A process in D state cannot be killed — not even with SIGKILL. The signal is queued, but the process can't check it until the kernel operation it's waiting for completes. If a disk or NFS mount stops responding, processes waiting on I/O pile up in D state and stay there until the device responds or the mount is force-unmounted.

# Find D-state processes (usually none; if you see them, investigate the mount or device):
$ ps aux | awk '$8 ~ /^D/'

Threads: Processes That Share Memory

So far, "process" and "task_struct" have been interchangeable. But what about threads?

In Linux, there's no fundamental distinction between a process and a thread at the kernel level. A thread is just a task_struct that shares its mm (virtual address space) with another task_struct.

This sharing is controlled by the flags passed to clone():

// fork(): new address space, new FD table, new everything
clone(SIGCHLD, 0);

// pthread_create(): shared address space, shared FDs, shared signal handlers
clone(CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND |
      CLONE_THREAD | CLONE_SETTLS, stack_ptr);

When you call pthread_create(), glibc calls clone() with the flags above. The result is a new task_struct — its own kernel entity, its own PID, schedulable independently — but with the mm pointer pointing to the same mm_struct as the creating thread. They share one virtual address space.

From the kernel's perspective, threads and processes are the same thing. Both are task_structs. Both get scheduled by the same CFS scheduler. Both appear in /proc. The tgid field (thread group ID) ties them together: all threads in a process share the same tgid, which equals the PID of the first thread.

# NetworkManager is multi-threaded
$ cat /proc/1618/status | grep -E '(Tgid|Pid|Threads)'
Tgid:   1618
Pid:    1618
Threads:        3

# The threads are visible under /proc/1618/task/
$ ls /proc/1618/task/
1618  1636  1637

# Thread 1636 has its own PID but the same TGID
$ cat /proc/1636/status | grep -E '(Tgid|Pid)'
Tgid:   1618
Pid:    1636

/proc/1618/task/ lists all threads in the thread group. Thread 1618 is the main thread (its PID equals the TGID); 1636 and 1637 are worker threads with their own PIDs but sharing the same memory space.

This architecture — threads as clone()d task_structs — is why Linux doesn't need separate thread scheduling code. The CFS scheduler handles both.


Zombies and Reaping

When a process calls exit() (or is killed by a signal), the kernel:

  1. Frees all its memory (pages, page tables, VMAs)
  2. Closes all open file descriptors
  3. Detaches from its thread group
  4. Changes __state to EXIT_ZOMBIE
  5. Sends SIGCHLD to the parent

The task_struct is kept alive in zombie state so the parent can retrieve the exit status. The parent does this by calling wait() or waitpid():

int status;
pid_t child = wait(&status);

if (WIFEXITED(status)) {
    printf("exited normally, status %d\n", WEXITSTATUS(status));
} else if (WIFSIGNALED(status)) {
    printf("killed by signal %d\n", WTERMSIG(status));
}

Once wait() returns, the kernel removes the zombie's task_struct. This is called reaping the zombie.

What if the parent never calls wait()?

If a parent exits without reaping its children, those children are reparented to PID 1 (systemd). systemd calls waitpid() on any process reparented to it, so they get reaped promptly. This is one of the reasons PID 1 must never crash — if it did, unreapable zombies would accumulate until the system ran out of PID space.

If the parent stays alive but never calls wait(), zombies accumulate. Each zombie is a task_struct still consuming kernel memory:

# Find zombie processes (on a healthy system, there should be none):
$ ps aux | awk '$8 == "Z"'

# The PPID points to the process that should be reaping:
$ ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/'

Signals: Asynchronous Notifications

Signals are the kernel's mechanism for asynchronous notification. A signal can be sent by the kernel (e.g., SIGSEGV on a segfault, SIGCHLD when a child exits), by another process (kill -9 1234), or by the user (Ctrl+C sends SIGINT).

How signals are delivered

When a signal is sent to a process:

  1. The kernel sets a bit in the process's pending signal bitmask (task_struct → pending)
  2. The next time the process is scheduled or returns from a syscall, the kernel checks for pending signals
  3. If a signal is pending and not blocked, the kernel delivers it

Delivery means one of:

  • Default action: terminate, terminate+core dump, ignore, stop, or continue — depending on the signal
  • Custom handler: if the process called sigaction() to install a handler, the kernel redirects execution to it

The common signals:

SignalDefaultTriggered by
SIGHUPTerminateControlling terminal closed
SIGINTTerminateCtrl+C
SIGQUITCore dumpCtrl+\
SIGKILLTerminateCannot be caught or ignored
SIGSEGVCore dumpInvalid memory access
SIGTERMTerminatePolite termination request
SIGCHLDIgnoreChild exited or stopped
SIGSTOPStopCannot be caught or ignored
SIGCONTContinueResume a stopped process
SIGPIPETerminateWrite to a closed pipe

SIGKILL and SIGSTOP cannot be caught

These two signals are special: no process can install a handler for them, block them, or ignore them. This gives the system a guaranteed way to kill or stop any process regardless of its state or what code it's running.

This is also why killing a D-state process doesn't work: SIGKILL is queued (the bit is set in the pending bitmask), but the process is in uninterruptible sleep and won't check pending signals until the kernel operation it's waiting for completes.

Sending signals

# By PID:
kill -SIGTERM 1618      # ask NetworkManager to stop gracefully
kill -SIGKILL 1618      # force it

# By name (resolves PID automatically):
pkill NetworkManager
killall -SIGTERM bash

# Check what a signal number means:
kill -l 9               # prints KILL

Inspecting signal state

# What signals does NetworkManager have blocked/ignored?
$ grep -E '^Sig' /proc/1618/status
SigQ:   0/59282
SigPnd: 0000000000000000
SigBlk: 0000000000001000
SigIgn: 0000000000003003
SigCgt: 0000000000010002

These are 64-bit bitmasks, one bit per signal. SigBlk (blocked) and SigIgn (ignored) are set by the process through sigprocmask() and sigaction(). You can decode them:

$ python3 -c "
blocked = 0x0000000000001000
for i in range(64):
    if blocked & (1 << i):
        print(f'  signal {i+1} is blocked')
"
  signal 13 is blocked    # SIGPIPE — NetworkManager ignores broken pipes

Try It Yourself

1. Explore the full process tree

pstree -p | less

2. Inspect any process's complete state

cat /proc/1618/status
cat /proc/1618/cmdline | tr '\0' ' '
ls -la /proc/1618/fd/
cat /proc/1618/maps | head -20

3. Watch fork() and exec() at the syscall level

strace -e clone,execve,wait4 bash -c 'ls /tmp' 2>&1 | head -20

4. Observe copy-on-write in action

# Fork a process that holds 50MB but doesn't write anything
python3 -c "
import os, time
data = bytearray(50 * 1024 * 1024)
pid = os.fork()
if pid == 0:
    print('child PID:', os.getpid())
    time.sleep(30)
else:
    print('parent PID:', os.getpid())
    time.sleep(30)
" &

# While it sleeps, check VmRSS and RssAnon for both PIDs:
# Both will show ~50MB VmRSS, but they share the physical frames
grep -E '(VmRSS|RssAnon)' /proc/$BGPID/status

5. Create a zombie intentionally

python3 -c "
import os, time
pid = os.fork()
if pid == 0:
    os._exit(0)      # child exits immediately
else:
    time.sleep(30)   # parent doesn't call wait()
" &

ps aux | awk '$8 == "Z"'   # zombie appears here

6. Explore threads of a multi-threaded process

ls /proc/1618/task/            # one directory per thread
ps -eLf | grep 1618            # threads in ps format (LWP column)

7. Observe process state transitions

sleep 100 &
BGPID=$!

ps -p $BGPID -o pid,stat       # S (interruptible sleep)

kill -SIGSTOP $BGPID
ps -p $BGPID -o pid,stat       # T (stopped)

kill -SIGCONT $BGPID
ps -p $BGPID -o pid,stat       # S (sleeping again)

kill $BGPID

8. Decode your shell's signal masks

grep Sig /proc/$$/status

9. Trace the fork-exec chain for any command

strace -f -e clone,execve bash -c 'cat /proc/version' 2>&1 | grep -E '(clone|exec)'

10. Find D-state processes

ps aux | awk '$8 ~ /^D/'
# Usually empty; if not, check which device/mount is involved

Putting It All Together

The Linux process model is built on one abstraction: everything is a task_struct.

Everything is a task_struct
    │
    ├── Created via clone() (fork = clone with no sharing)
    │   ├── New task_struct allocated
    │   ├── Virtual address space duplicated — CoW, no actual copy
    │   ├── File descriptor table copied
    │   └── Child placed on run queue; both parent and child are runnable
    │
    ├── Program replaced via execve()
    │   ├── Old address space flushed
    │   ├── New ELF mapped into memory
    │   └── PID unchanged — same task_struct, new mm_struct
    │
    ├── State tracked in task_struct.__state
    │   R (runnable) → S (waiting) ↔ D (uninterruptible) → Z (zombie) → reaped
    │
    ├── Threads = clone() with CLONE_VM
    │   ├── Same mm_struct, separate task_struct
    │   └── Same scheduler, same /proc, different PID — same TGID
    │
    └── Exits → zombie → parent calls wait() → reaped
        If parent dies first → reparented to PID 1 → systemd reaps it

Every shell command follows this path: bash calls fork(), the child sets up redirections, calls execve(), the program runs, exits, and bash reaps it with wait(). Pipes, redirections, job control, and process groups all fall out of this single model — composable primitives with no hidden machinery.


What's Next

In the next chapter, we go into memory in depth. We've touched on virtual memory — VMAs, page tables, copy-on-write — but only as much as the process model required. Chapter 4 traces the full story: how the kernel manages physical memory, what happens on a page fault, how the page cache works, and why free shows almost no "free" memory on a healthy, idle system.


Part of the Linux Deep Dive series.

Linux Deep Dive #4: Memory Management — Virtual Memory and the Page Cache

Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.


Run free -h on any Linux system that's been running for a while:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            30Gi        22Gi       200Mi       500Mi       7.8Gi        7.3Gi
Swap:          8.0Gi          0B       8.0Gi

Most people see the free column — 200 MB — and worry. Only 200 MB of RAM left? But the system is perfectly healthy. available is 7.3 GiB. Something is using 7.8 GiB under buff/cache, and it can be reclaimed on demand.

This is the kernel doing its job. Idle RAM is wasted RAM. The kernel fills unused memory with file data it might need again, making disk reads fast. When a process needs memory, the kernel reclaims the cache. The free column tells you almost nothing useful. available is the number that matters.

This chapter explains why — by tracing the full memory architecture from virtual addresses through page tables to physical frames, through the page cache and back. We'll cover how the kernel allocates physical memory, what really happens on a page fault, how the page cache works, and how the kernel reclaims memory under pressure.


Memory at a Glance

User process (virtual address space)
    │
    │  Virtual address → MMU → page table walk
    ▼
┌──────────────────────────────────────────────────────┐
│  4-Level Page Tables                                  │
│  PGD → PUD → PMD → PTE → physical frame              │
│  TLB caches recent translations                      │
└──────────────────────────────────────────────────────┘
    │
    │  physical frame number
    ▼
┌──────────────────────────────────────────────────────┐
│  Physical Memory                                      │
│  ┌────────────────────┐  ┌──────────────────────┐   │
│  │  Anonymous pages   │  │  File-backed pages   │   │
│  │  heap, stack, CoW  │  │  (the page cache)    │   │
│  │  must swap to evict│  │  can drop for free   │   │
│  └────────────────────┘  └──────────────────────┘   │
│                                                       │
│  Buddy allocator manages frames (4KB to 4MB blocks)  │
│  SLUB allocator carves pages into kernel objects     │
└──────────────────────────────────────────────────────┘
    │
    │  when free memory drops below watermarks
    ▼
┌──────────────────────────────────────────────────────┐
│  Reclaim                                              │
│  kswapd scans LRU lists                              │
│  ├── file pages: drop (clean) or writeback + drop    │
│  └── anon pages: write to swap, then free frame      │
│                                                       │
│  Last resort: OOM killer                             │
└──────────────────────────────────────────────────────┘

Virtual Addresses and Page Tables

Chapter 3 showed that every process has an mm_struct containing a set of VMAs — contiguous virtual address regions with permissions like r-xp or rw-p. But how does a virtual address in a VMA actually become a physical address the CPU can use?

The answer is the page table.

The 4-Level Page Table on x86-64

On a 64-bit x86 system, virtual addresses are 48 bits wide. The CPU uses a 4-level page table structure to translate them:

Virtual address (48-bit):
  ┌───────┬───────┬───────┬───────┬──────────────┐
  │  PGD  │  PUD  │  PMD  │  PTE  │ Page offset  │
  │ 9 bit │ 9 bit │ 9 bit │ 9 bit │   12 bit     │
  └───────┴───────┴───────┴───────┴──────────────┘

PGD = Page Global Directory   (top-level)
PUD = Page Upper Directory
PMD = Page Middle Directory
PTE = Page Table Entry        (leaf: physical frame number + flags)

The translation works like a 4-level lookup:

CR3 register → PGD table
    │  index with bits 47–39
    ▼
PGD entry → PUD table
    │  index with bits 38–30
    ▼
PUD entry → PMD table
    │  index with bits 29–21
    ▼
PMD entry → PTE table
    │  index with bits 20–12
    ▼
PTE entry: physical frame number + R/W/X/U flags
    │  + page offset (bits 11–0)
    ▼
Physical address

Each level is a 4KB page holding 512 8-byte entries (9 bits of index → 2^9 = 512). The final 12 bits of the virtual address index into the 4KB page. This gives 48-bit coverage: 9+9+9+9+12 = 48.

The flags in each PTE are what drive many kernel mechanisms:

PTE flagEffect
PresentPage is in RAM; if clear, triggers a page fault
WritableWrite permission; if clear on a mapped page, triggers CoW fault
UserAccessible from user space; kernel-only pages have this clear
AccessedCPU sets this when the page is read (used by reclaim)
DirtyCPU sets this on write (used to detect pages needing writeback)
NX (No-Execute)Prevents execution of data pages

You can see the page table walk in action through /proc/[pid]/pagemap, though that requires root. The more accessible view is /proc/[pid]/maps, which shows the VMAs:

$ cat /proc/$$/maps | head -8
5592b7c38000-5592b7c3e000 r--p 00000000 fd:01 530753  /usr/bin/bash
5592b7c3e000-5592b7c7e000 r-xp 00006000 fd:01 530753  /usr/bin/bash
5592b7c7e000-5592b7cad000 r--p 00046000 fd:01 530753  /usr/bin/bash
5592b7cae000-5592b7cb2000 rw-p 00075000 fd:01 530753  /usr/bin/bash
5592b9040000-5592b90ae000 rw-p 00000000 00:00 0       [heap]
7f2345a00000-7f2345c00000 r--p 00000000 fd:01 398912  /usr/lib/locale/locale-archive
...
7ffe8e3b0000-7ffe8e3d2000 rw-p 00000000 00:00 0       [stack]
7ffe8e3f8000-7ffe8e3fc000 r--p 00000000 00:00 0       [vvar]

Each row is a VMA. The permissions r-xp, rw-p, r--p map directly to PTE flags. The page table for this process contains entries for each page within each VMA — but only for pages that have actually been touched (page tables are themselves demand-paged).

The TLB: Caching Page Table Lookups

Four memory accesses to translate one address would be crippling. The CPU keeps a Translation Lookaside Buffer (TLB) — a small hardware cache of recent virtual→physical translations. A TLB hit returns the physical address in a single cycle. A miss triggers the full 4-level walk.

TLB misses are part of why context switches have measurable cost: switching processes means switching page tables (loading a new value into CR3), which flushes the TLB. The next few hundred memory accesses in the new process are all TLB misses until the cache warms up.

The kernel minimizes TLB pressure through huge pages: instead of 4KB pages (needing many TLB entries), the kernel can use 2MB pages (PMD-level mappings), covering 512× more memory with a single TLB entry. On this system, transparent huge pages (THP) are in madvise mode — the kernel uses 2MB pages for anonymous memory when the application explicitly requests them:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Physical Memory — Zones, the Buddy Allocator, and Slab

Memory Zones

Not all physical RAM is equivalent. Legacy ISA devices could only DMA to the first 16 MB. Older PCI devices can only address below 4 GB. The kernel divides physical memory into zones to track these constraints:

$ cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2
Node 0, zone    DMA32     17     13     13     12      9      9     14     12      9      9    768
Node 0, zone   Normal  65656  39923  16923   8715   2995    987    367    127     53     27   1876
ZonePhysical rangePurpose
DMA0–16 MBISA/legacy DMA devices
DMA320–4 GB32-bit PCI devices
Normal> 4 GBAll other allocations

On this 30 GB machine, the Normal zone holds almost all RAM.

The Buddy Allocator

The kernel's fundamental physical memory allocator is the buddy allocator. It maintains 11 free lists, indexed by order 0 through 10, where order N holds blocks of 2^N contiguous pages:

Order 0:  1 page   = 4KB
Order 1:  2 pages  = 8KB
Order 2:  4 pages  = 16KB
...
Order 10: 1024 pages = 4MB

The numbers in buddyinfo are the count of free blocks at each order. In the Normal zone above, there are 65,656 order-0 blocks (4KB each), 39,923 order-1 blocks (8KB each), and so on.

When the kernel needs N pages:

  1. Find the smallest order ≥ N with a free block
  2. If that order is larger than needed, split the block: one half goes to the next lower order's free list, the other half is used
  3. When pages are freed, the kernel checks if the adjacent "buddy" block is also free; if so, they merge back to the higher order

Splitting and merging keep the allocator efficient and minimize fragmentation. You can watch allocation pressure by checking how many high-order blocks remain — if order-10 is empty and order-0 is full, the system is fragmented and can't satisfy large contiguous allocations.

# Watch buddy allocator state live
watch -d cat /proc/buddyinfo

The SLUB Allocator

The buddy allocator deals in full pages. But the kernel constantly allocates objects much smaller than 4KB: task_struct (~7KB but padded), file (~300 bytes), dentry (~200 bytes), inode (~600 bytes), socket buffers (~200 bytes each). Handing out a full 4KB page for each would waste enormous amounts of RAM.

The SLUB allocator (Simplified Unqueued Layer) sits on top of the buddy allocator. It maintains per-CPU slabs — pages filled with pre-allocated objects of a single size. Allocating a small kernel object takes a free slot from a slab; freeing it returns the slot. No buddyallocator call needed.

# See the largest slab caches by memory usage
$ sudo slabtop -o | head -15
 Active / Total Objects (% used)    : 2344037 / 2429115 (96.5%)
 Active / Total Slabs (% used)      : 82684 / 82684 (100.0%)
 Active / Total Caches (% used)     : 193 / 244 (79.1%)
 Active / Total Size (% used)       : 622408.38K / 655006.31K (95.0%)

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
253056 253056 100%    0.19K   6016       42     24064K dentry
198400 186988  94%    0.06K   3100       64      6200K kmalloc-64
110916  95897  86%    1.06K   9243       12    147888K inode_cache
 37248  37248 100%    0.50K   1164       32     18624K kmalloc-512

dentry (directory entry cache) and inode_cache are typically the largest slab consumers — the kernel caches filesystem metadata aggressively because lookups are frequent and disk access is slow.


Page Faults

When the CPU walks the page table and something is wrong, it raises a page fault — a hardware exception that transfers control to the kernel's fault handler (handle_mm_fault() in mm/memory.c). There are several distinct kinds:

Fault typeCauseKernel actionDisk I/O?
MinorPage in VMA but PTE not present (first access)Allocate frame, fill PTENo
MajorPage was evicted to disk; PTE marked not-presentRead page from disk, fill PTEYes
CoWWrite to a read-only shared page (after fork)Copy page, update PTE to writableNo
InvalidAccess outside any VMADeliver SIGSEGVNo

Minor faults are normal and constant — they happen every time a program accesses a new page of its heap or stack. The kernel allocates a physical frame, zeros it (to prevent information leaks), updates the PTE, and the program continues. No disk I/O, just a brief detour through the kernel.

Major faults are a symptom of memory pressure. They mean the kernel previously evicted this page to disk to free RAM, and now the program needs it back. Major faults cause visible latency — the application stalls waiting for the disk read to complete.

# Page fault totals since boot
$ grep -E '^pgfault|^pgmajfault' /proc/vmstat
pgfault       39670010   # minor faults — normal, healthy
pgmajfault        3007   # major faults — disk reads to recover evicted pages

39 million minor faults since boot is expected. 3007 major faults is low — this system hasn't been under enough memory pressure to evict much. On a memory-constrained system, pgmajfault climbs steadily and latency spikes whenever a major fault stalls a thread.

The CoW fault is the mechanism from Chapter 3: after fork(), parent and child share physical pages marked read-only. The first write from either side triggers a fault, the kernel copies the page, and both now have their own writable copy.

You can measure per-process fault rates in real time:

# Fault rates for a running process
$ cat /proc/1/status | grep -i fault
voluntary_ctxt_switches: 31248
nonvoluntary_ctxt_switches: 1156

# For cumulative fault count, smaps_rollup is faster than summing smaps:
$ sudo cat /proc/1/smaps_rollup | grep -E 'Rss|Pss'
Rss:               21432 kB
Pss:               13891 kB

The Page Cache

This is the mechanism behind the buff/cache column in free. It's also the single most important thing to understand about Linux memory management.

The Basic Idea

When you read() from a file, the kernel doesn't transfer bytes directly from disk to your process. It maintains a page cache — a pool of physical pages indexed by (filesystem inode, page offset). The read path is:

read() syscall
    │
    ├── Is (inode, offset) already in the page cache?
    │   ├── Yes: copy from cache to user buffer → done (no disk I/O)
    │   └── No:  read from disk into a new cache page,
    │             then copy to user buffer
    │
    └── Cache page stays in RAM after the read

The page cache entry persists after your read() returns. The next read() of the same region — whether by you, another process, or another program — finds it in cache and pays no disk cost.

This is why a fresh make that compiles 10,000 source files goes faster on the second run: all those file reads hit the cache. The first run was I/O-bound; the second is CPU-bound.

File-Backed vs Anonymous Memory

Physical memory pages fall into two categories:

Physical pages
    │
    ├── File-backed (page cache)
    │   ├── Clean: identical to disk → kernel can drop for free
    │   └── Dirty: modified, not yet written back
    │           → kernel must flush to disk before dropping
    │
    └── Anonymous (no file)
        ├── Heap, stack, CoW copies, anonymous mmap
        └── To free: must write to swap device first
                      (or kill the process)

File-backed pages are cheap to evict — the kernel just drops the page and re-reads it from disk if needed again. Anonymous pages are expensive: they have no file backing, so eviction requires writing to swap and reading back later.

This distinction drives the entire reclaim strategy.

mmap() and the Page Cache

mmap() is the page cache seen from the process side. When your program opens a file and calls:

void *ptr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);

The kernel creates a VMA that represents a window into the file's pages in the page cache. No data is loaded yet. When you dereference ptr, a minor page fault fires; the kernel checks the page cache (reading from disk if necessary), and installs a PTE pointing to that cache page. The process's PTE and the page cache entry point to the same physical frame.

This has an important consequence: every process running the same program shares its text segment.

# How many processes share bash's text pages?
$ cat /proc/$(pgrep -n bash)/smaps | grep -A 10 'bash$' | head -12
5592b7c3e000-5592b7c7e000 r-xp 00006000 fd:01 530753  /usr/bin/bash
Size:                256 kB
KernelPageSize:        4 kB
Rss:                 252 kB
Pss:                  14 kB    ← proportional share — 252KB / ~18 processes
Shared_Clean:        252 kB    ← all of it is shared
Private_Clean:         0 kB

Rss (Resident Set Size) is the total physical memory mapped. Pss (Proportional Set Size) divides shared pages by the number of processes sharing them. The r-xp text segment shows Shared_Clean: 252 kB and Pss: 14 kB — 18 bash processes share the same physical pages. One copy in RAM, many processes using it.

Reading the Page Cache Statistics

$ grep -E '^(Cached|Buffers|Mapped|Dirty|Writeback)' /proc/meminfo
Buffers:            4208 kB
Cached:          6564188 kB
Mapped:          1189268 kB
Dirty:              1336 kB
Writeback:             0 kB
FieldMeaning
CachedTotal page cache size
BuffersBlock device metadata cache (separate from file data cache)
MappedPage cache pages currently mapped into at least one process's address space
DirtyModified pages not yet written to disk
WritebackPages currently being written to disk

On this machine, Cached is 6.4 GB — 6.4 GB of file data sitting in RAM, ready for instant re-use. Dirty is only 1.3 MB, meaning the kernel's writeback threads are keeping up with writes. Writeback is 0, so no I/O is in flight right now.


Reclaim — How the Kernel Manages Memory Pressure

LRU Lists

The kernel tracks which pages are worth keeping with four LRU (Least Recently Used) lists:

LRU lists
    ├── active_file    — recently used file-backed pages
    ├── inactive_file  — older file-backed pages (reclaim candidates)
    ├── active_anon    — recently used anonymous pages
    └── inactive_anon  — older anonymous pages (swap candidates)

Pages move between these lists based on access patterns. The hardware Accessed bit in PTEs tells the kernel when a page was last used. Pages migrate from active to inactive when they haven't been accessed for a while; from inactive, they become reclaim candidates.

$ grep -E '^(Active|Inactive)' /proc/meminfo
Active:          7844660 kB
Inactive:        4958656 kB
Active(anon):    4685204 kB
Inactive(anon):        0 kB
Active(file):    3159456 kB
Inactive(file):  4958656 kB

On this system: 4.7 GB of anonymous memory is active (process heaps/stacks in use), 3.2 GB of file cache is active (recently read files). The 4.9 GB of inactive file cache is the first thing kswapd will drop if memory gets tight.

kswapd and Watermarks

kswapd is a kernel thread that maintains free memory within a healthy range. The kernel defines three watermarks for each zone:

Zone free pages
    │
    ├── High watermark — kswapd stops here (plenty of free memory)
    │
    ├── Low watermark  — kswapd wakes up and starts reclaiming
    │
    └── Min watermark  — direct reclaim kicks in; allocations stall
                         until kswapd frees enough pages

When free memory drops below the low watermark, kswapd wakes and scans the inactive lists:

  1. Inactive file pages (clean): drop immediately — they're exact copies of disk data
  2. Inactive file pages (dirty): schedule writeback, then drop after write completes
  3. Inactive anon pages: write to swap device, then free the frame
# Is kswapd doing any work?
$ grep -E 'pgsteal_kswapd|pgscan_kswapd|kswapd_low_wmark' /proc/vmstat
pgsteal_kswapd             0    # pages reclaimed by kswapd (0 = no pressure)
pgscan_kswapd              0    # pages scanned during reclaim
kswapd_low_wmark_hit_quickly   0

All zeros: this system is not under memory pressure. kswapd is idle. On a memory-constrained system, pgsteal_kswapd climbs continuously.

Swappiness

The vm.swappiness sysctl (0–200, default 60) controls the balance between evicting file cache and swapping anonymous pages. It's a weight in the reclaim cost calculation — higher values make the kernel more willing to swap anonymous pages alongside evicting file cache.

$ cat /proc/sys/vm/swappiness
60

Contrary to common belief, swappiness is not a memory-use threshold. swappiness=60 does not mean "start swapping at 60% memory use." It means: when choosing between evicting a file page and swapping an anonymous page, weight the decision so that both are considered in roughly a 60:100 ratio.

Setting swappiness=1 makes the kernel almost never swap — it will exhaust file cache before touching anonymous memory. Setting it to 200 makes the kernel aggressively swap, preferring to keep file cache hot. Neither extreme is universally right.

# Check swap usage
$ grep -E '^(SwapTotal|SwapFree|SwapCached|Pswpin|Pswpout)' /proc/meminfo
SwapTotal:       8388604 kB
SwapFree:        8388604 kB
SwapCached:            0 kB

# Swap activity since boot
$ grep -E '^pswp' /proc/vmstat
pswpin    0     # pages swapped in (major faults from swap)
pswpout   0     # pages swapped out

Zero swap activity on this machine — there's enough RAM that kswapd hasn't needed to swap anything out.


The OOM Killer

When free memory hits zero, all reclaim options are exhausted, and a memory allocation is still failing, the kernel invokes the OOM killer (Out Of Memory killer) as a last resort.

The OOM killer selects a process to terminate based on oom_score — a value from 0 to 1000 calculated from:

  • RSS (how much RAM the process is actually using)
  • Runtime (longer-running processes are slightly protected)
  • Nice value (low-priority processes score higher)
  • Whether the process is privileged (root processes score slightly lower)

The process with the highest score gets killed with SIGKILL.

# OOM score for each running process
$ cat /proc/1/oom_score          # systemd — should be 0 (protected)
0

$ cat /proc/$$/oom_score         # your shell
5

# Processes can adjust their score (-1000 to +1000)
$ cat /proc/$$/oom_score_adj
0

Setting oom_score_adj to -1000 makes a process immune to OOM killing. Systemd sets itself to -1000 for exactly this reason. Critical system daemons often do the same. Setting it to +1000 means "kill me first in an OOM."

# Which process would the OOM killer target right now?
$ ps -eo pid,oom_score,rss,comm --sort=-oom_score | head -10
    PID OOM_SCORE    RSS COMMAND
   2847       182 2345200 firefox
   3012        89  854320 slack
   3201        45  412800 gnome-shell

The OOM killer fires rarely on a well-sized system. If it fires often, the right fix is more RAM, fixing a memory leak, or adding limits with memory cgroups — not tuning OOM parameters.


Why free Shows Almost No "Free" Memory

Now we have all the pieces. Return to the output from a memory-loaded system:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            30Gi        22Gi       200Mi       500Mi       7.8Gi        7.3Gi
Swap:          8.0Gi          0B       8.0Gi

What does each column actually mean?

ColumnWhat it counts
totalTotal physical RAM
usedProcess memory + kernel memory not reclaimable
freeCompletely idle frames — holding nothing
sharedtmpfs and shared memory
buff/cachePage cache + buffer cache — reclaimable on demand
availableEstimated memory a new allocation could get ≈ free + reclaimable buff/cache

The free column being 200 MB does not mean the system is nearly out of memory. It means the kernel has filled almost all idle frames with page cache — which it should. Those 7.8 GB of buff/cache are serving file reads instantly.

When a process needs more memory, the kernel reclaims from buff/cache first (by dropping clean file pages). The new allocation succeeds, and buff/cache shrinks. available — not free — is what tells you whether a large new allocation will succeed.

The actual machine for this series has 30 GB of RAM and was not heavily loaded at the time this chapter was written:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            30Gi        11Gi        12Gi       153Mi       6.6Gi        19Gi
Swap:          8.0Gi          0B       8.0Gi

Here free is 12 GB — plenty. But the pattern is the same: buff/cache is 6.6 GB of page cache that the kernel built up from disk reads. It's not waste. It's usable memory being put to work.

The rule: the system is healthy as long as available is above zero and Swap:used is not climbing rapidly.

# One-liner to check memory health
$ free -h && grep -E '^(SwapFree|Dirty|Writeback)' /proc/meminfo

Try It Yourself

1. The big picture

free -h
cat /proc/meminfo

2. Where is memory going, by process?

# Sort by RSS (resident, actual physical pages)
ps -eo pid,rss,vsz,comm --sort=-rss | head -15

# VSZ is virtual size (mapped but not necessarily in RAM)
# RSS is what's actually resident in physical memory

3. Per-VMA detail for a process

# smaps shows Rss, Pss (proportional), and sharing info per mapping
cat /proc/$$/smaps | grep -E '^(Size|Rss|Pss|Shared_Clean|Private)' | head -30

4. Proportional memory use across the system

# PSS (Proportional Set Size) divides shared pages fairly between sharers
sudo cat /proc/*/smaps_rollup 2>/dev/null | grep ^Pss | awk '{s+=$2} END {print s/1024 "MB total PSS"}'

5. The buddy allocator's free list

# Higher-order blocks = less fragmentation = better large allocations
cat /proc/buddyinfo

6. Slab allocator caches

sudo slabtop -o | head -20
# OBJ SIZE × SLABS × OBJ/SLAB = total memory used by each cache

7. Page fault counts since boot

grep -E '^pgfault|^pgmajfault' /proc/vmstat
# pgmajfault counts disk reads due to evicted pages — low is good

8. Is kswapd doing any work?

grep -E 'pgsteal_kswapd|pgscan_kswapd|pswp' /proc/vmstat
# Non-zero pgsteal_kswapd means active reclaim — system is under memory pressure

9. Drop the page cache and watch free change

# Sync first to avoid dropping dirty data
sync

# Drop page cache, dentries, and inodes (3 = all)
# This is SAFE — it's read-only metadata and clean file data
echo 3 | sudo tee /proc/sys/vm/drop_caches

# free will jump; buff/cache will drop
free -h

# The cache refills automatically as you read files

10. OOM scores across the system

# Which process would the OOM killer target?
ps -eo pid,oom_score,rss,comm --sort=-oom_score | head -10

# Check if any process is protecting itself from OOM
grep -r -l '\-1000' /proc/*/oom_score_adj 2>/dev/null | head -5

Putting It All Together

Process writes to a new heap page
    │
    │  CPU walks page table → PTE not present
    ▼
Minor page fault
    │  kernel allocates a physical frame (buddy allocator: order-0)
    │  SLUB if kernel object; direct page if user allocation
    │  zeroes the frame (security: no previous content leaks)
    │  installs PTE: virtual page → physical frame, writable
    │
    ▼ frame is now in memory, process continues

Process reads a file it hasn't opened before
    │
    │  read() → VFS → page cache lookup → miss
    ▼
Major page fault (or synchronous read)
    │  block I/O reads the file's page from disk
    │  page lands in page cache (indexed by inode + offset)
    │  PTE installed (for mmap) or data copied to user buffer (for read())
    │
    ▼ page cache entry persists

File page stays in page cache
    │
    ├── accessed again → serves from RAM, no disk I/O
    │
    └── memory pressure → kswapd
        │
        ├── page is clean → drop for free
        │   (re-read from disk if accessed again → major fault)
        │
        └── page is dirty → writeback, then drop

Anonymous page under pressure
    │
    └── kswapd writes to swap device
        │  frame is freed
        │  PTE updated: page-not-present (swap offset stored elsewhere)
        │
        └── process accesses the page again
            │  major fault: read from swap → restore frame → PTE updated
            ▼  slow, but correct

Memory completely exhausted
    │
    └── OOM killer: select highest oom_score process → SIGKILL
        frame freed, pressure relieved

The page cache is the unifying thread: read(), write(), mmap(), executable loading, and library sharing all flow through it. The kernel's job is to keep the working set of your running processes in RAM while using the rest for cache — and to reclaim that cache quietly, before you notice, whenever someone needs more memory.


What's Next

In Chapter 5, we'll look at the scheduler — how the kernel decides which of many runnable processes gets CPU time. We've already seen task_struct and its se.vruntime field (CFS virtual runtime). Chapter 5 fills that in: what CFS virtual runtime means, how nice values translate to actual CPU share, what load average really measures, and how to diagnose scheduler-related latency.


Part of the Linux Deep Dive series.

Linux Deep Dive #5: The Scheduler — How the Kernel Decides What Runs

Target: Fedora 43, kernel 6.19.11. Every command in this post is something you can run yourself.


Right now, /proc/loadavg on this machine shows:

0.24 0.38 0.37 2/1951 75458

The 2/1951 means 2 processes are running at this exact moment out of 1,951 that exist. The system has 12 CPUs, so it could run 12 simultaneously — but only 2 have something to do right now. The rest are waiting.

That "waiting" is almost never passive. Processes sleep voluntarily (waiting for I/O, a timer, a lock), then wake up demanding CPU time. Thousands of times per second, the kernel must answer the same question: out of everyone who wants to run right now, who goes next?

That decision belongs to the scheduler.


The Big Picture

The Linux scheduler is not a single algorithm. It's a layered system with multiple scheduling classes, each handling different process types. At the top level:

┌─────────────────────────────────────────────────────────┐
│                    Scheduling Classes                    │
│  (highest priority checked first, runs if runnable)      │
│                                                          │
│  SCHED_DEADLINE  ← real-time: explicit deadline/period  │
│  SCHED_FIFO      ← real-time: first in, never preempted │
│  SCHED_RR        ← real-time: round-robin with slice    │
│  SCHED_NORMAL    ← normal processes (you, bash, chrome) │
│  SCHED_BATCH     ← CPU-bound background work            │
│  SCHED_IDLE      ← lower than nice +19                  │
└─────────────────────────────────────────────────────────┘

Real-time classes (DEADLINE, FIFO, RR) always preempt normal processes. Within the normal class, the scheduler runs EEVDF — Earliest Eligible Virtual Deadline First — the algorithm that replaced CFS in Linux 6.6.


From CFS to EEVDF

To understand EEVDF, you need to understand what it replaced and why.

The CFS mental model

CFS (Completely Fair Scheduler), introduced in Linux 2.6.23, was built around one idea: every process should get exactly its fair share of CPU time. Not just approximately — exactly. CFS tracks this precisely.

The mechanism: each process has a vruntime (virtual runtime) counter, measured in nanoseconds. Every nanosecond a process runs, its vruntime increases. But the rate of increase depends on the process's priority — a high-priority process's vruntime grows more slowly, so it gets selected to run more often.

CFS stored all runnable processes in a red-black tree keyed by vruntime. The scheduler always picked the process with the smallest vruntime — the one that has received the least CPU time. After running for one time slice, its vruntime increased, and it was reinserted at its new position. Whoever now had the smallest vruntime ran next.

This made the scheduler O(log n) for most operations (red-black tree insertion/lookup), with the property that no process ever accumulated more than one time slice of "debt" relative to any other.

The EEVDF refinement

CFS had a problem: it picked the "most deserving" process but had no concept of when that process needed to run. A low-latency audio thread and a CPU-hungry compiler could have the same vruntime, but they have wildly different timing requirements.

EEVDF (Earliest Eligible Virtual Deadline First, landed in Linux 6.6) adds an explicit virtual deadline alongside vruntime. Each process gets a time slice — a requested service quantum. Its virtual deadline is computed as:

virtual_deadline = vruntime + time_slice / weight

The scheduler picks the runnable process whose virtual deadline is earliest — not just the one with the smallest vruntime. This gives the scheduler a way to prefer short-running, latency-sensitive tasks over long-running ones even when they have similar accumulated CPU time.

What you can see

On this machine (kernel 6.19.11), every SCHED_NORMAL process has an explicit time slice of 2.8 ms:

$ chrt -p $$
pid 75352's current scheduling policy: SCHED_OTHER
pid 75352's current scheduling priority: 0
pid 75352's current runtime parameter: 2800000

That 2800000 is nanoseconds — 2.8 ms. That's the EEVDF slice. CFS had no such fixed slice; it calculated a "target latency" divided by the number of runnable processes. EEVDF makes it explicit.

You can see each process's vruntime directly:

$ cat /proc/$$/sched | head -10
bash (75352, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :      47483595.680258
se.vruntime                                  :     102684.252917
se.sum_exec_runtime                          :          5.248837
se.nr_migrations                             :                 0
nr_switches                                  :                62
nr_voluntary_switches                        :                26
nr_involuntary_switches                      :                 0
se.load.weight                               :           1048576

Key fields:

  • se.vruntime — 102,684 ms accumulated virtual time. This is the scheduler's currency.
  • se.sum_exec_runtime — only 5.2 ms of actual CPU time. This process barely runs.
  • nr_voluntary_switches — 26 times this process gave up the CPU willingly (waiting for I/O or input).
  • nr_involuntary_switches — 0. The scheduler never had to yank the CPU away from it. It always yielded on its own.

Compare that to a busier process, systemd (PID 1):

$ cat /proc/1/sched | head -10
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :      47483479.684417
se.vruntime                                  :        106.342126
se.sum_exec_runtime                          :       2724.968813
nr_switches                                  :              11033
nr_voluntary_switches                        :              10422
nr_involuntary_switches                      :                611
se.load.weight                               :           1048576

systemd has nr_involuntary_switches: 611 — 611 times the scheduler decided "time's up" and switched to someone else. That's normal for a long-running process handling many short requests.


Nice Values and Weights

Not all SCHED_NORMAL processes are equal. The nice value controls how fast a process's vruntime accumulates — and therefore how much CPU time it gets.

Nice values range from -20 (highest priority) to +19 (lowest). Default is 0. The name comes from "being nice" to other processes: a process with a high nice value voluntarily gives up CPU time.

The kernel converts nice values to weights using a fixed table where each step is approximately 1.25×:

NiceWeightRelative to nice-0
-2088761~86× more
-109548~9.3×
-53121~3.1×
01024baseline
+5335~0.33×
+10110~0.11×
+1915~68× less

The vruntime delta per nanosecond of real time is:

delta_vruntime = delta_exec × (1024 / weight)

A nice -20 process (weight 88761) accumulates vruntime 86× slower than a nice 0 process. Since the scheduler picks based on earliest deadline (derived from vruntime), the nice -20 process runs 86× more often. A nice +19 process (weight 15) accumulates vruntime 68× faster — it falls behind quickly and runs rarely.

You can see nice values live on this machine:

$ ps -eo pid,ni,comm --sort=ni | head -5
    PID  NI COMMAND
      9 -20 rcu_tasks_kthread
     11 -20 rcu_tasks_rude_
     12 -20 rcu_tasks_trace
   5297 -11 pipewire

PipeWire (the audio server) runs at nice -11. That's why audio doesn't stutter when the CPU is loaded. The kernel scheduler ensures it gets CPU time well ahead of normal-priority processes.

$ ps -eo pid,ni,comm --sort=-ni | head -3
    PID  NI COMMAND
 136282 +19 khugepaged
  75352   0 bash

khugepaged — the background thread that collapses small pages into huge pages — runs at nice +19. It gets CPU time only when nothing else wants it. Perfect for a background maintenance task.

You can change nice values:

# Start a process at nice +10
nice -n 10 make -j12

# Change a running process's nice value (requires root to go more negative)
renice -n 5 -p <pid>

Scheduling Classes in Detail

When the scheduler runs, it checks each class from highest to lowest priority. If a real-time process is runnable, it runs — the normal class doesn't get a look in.

SCHED_DEADLINE

The highest priority class. A process declares its needs explicitly:

period=100ms, runtime=10ms, deadline=50ms

"Every 100ms, give me 10ms of CPU time, and deliver it within 50ms of each period's start."

The kernel enforces this using CBS (Constant Bandwidth Server) — it tracks how much runtime each deadline task has consumed and throttles it if it overruns. Admission control rejects new DEADLINE tasks if they'd make the system unschedulable.

Real-time audio, video capture, and industrial control systems use this. On this machine, the RT bandwidth pool is configured conservatively:

$ cat /proc/sys/kernel/sched_rt_period_us
1000000
$ cat /proc/sys/kernel/sched_rt_runtime_us
950000

RT tasks (FIFO, RR, DEADLINE combined) can use at most 95% of each 1-second period. The remaining 5% is reserved for normal processes — a safety valve to prevent a runaway RT process from completely starving the system.

SCHED_FIFO and SCHED_RR

Classic POSIX real-time policies. Both run at a fixed priority (1–99, higher is better):

  • FIFO: Runs until it blocks or explicitly yields. No time slicing. Preempts any lower-priority process immediately.
  • RR (Round Robin): Like FIFO, but with a time slice. After the slice expires, it moves to the back of its priority queue. Same priority means round-robin; higher priority always wins.

The kernel threads that migrate processes between CPUs use FIFO:

$ ps -eo pid,cls,rtprio,comm | grep migration | head -4
   30 FF      99 migration/0
   37 FF      99 migration/1
   44 FF      99 migration/2
   51 FF      99 migration/3

FF = FIFO, rtprio 99 — the highest possible RT priority. These threads must run instantly when a CPU balance decision is made; nothing should ever preempt them.

SCHED_NORMAL (EEVDF)

The class that runs the rest: your shell, your browser, your compiler. This is where EEVDF operates. The se.load.weight: 1048576 in the /proc/$$/sched output is the internal representation of nice 0 (1024 × 1024 for precision scaling).

SCHED_BATCH and SCHED_IDLE

  • BATCH: Like NORMAL, but the scheduler assumes it's a CPU-bound job and avoids waking it up aggressively. Useful for batch processing that shouldn't interrupt interactive tasks.
  • IDLE: Runs only when no other task — not even SCHED_NORMAL — is runnable. Lower than nice +19.

Preemption: Who Can Interrupt Whom

A process runs until something better comes along. But "something better" can mean different things depending on the kernel's preemption configuration.

This machine's kernel is compiled with PREEMPT_DYNAMIC:

$ uname -a
Linux hostname 6.19.11-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC ...

PREEMPT_DYNAMIC means the preemption model can be changed at runtime via boot parameter. The possible modes:

ModeWhen preemption happens
noneOnly at explicit yield points (voluntary)
voluntaryAt yield points + explicit preemption checks
fullAt almost any kernel code point

full preemption gives the best latency — a high-priority process can interrupt even kernel code (with appropriate locking). none gives the best throughput on server workloads.

Regardless of preemption mode, certain events always trigger a reschedule check:

  • A timer interrupt fires (every 1ms by default with CONFIG_HZ=1000)
  • A process unblocks (wakes from sleep)
  • A process's time slice expires
  • A higher-priority process becomes runnable

When a reschedule is needed, the kernel sets the TIF_NEED_RESCHED flag on the current process. The scheduler runs at the next safe preemption point.


SMP: Per-CPU Run Queues and Load Balancing

With 12 CPUs, the scheduler doesn't have one global queue — it has one run queue per CPU. This eliminates a massive bottleneck: CPUs don't fight over a single lock to pick their next task.

$ nproc
12

Each CPU has its own:

  • EEVDF run queue: the red-black tree of normal-priority runnable tasks
  • RT run queue: lists of RT tasks per priority level
  • DL run queue: deadline tasks with their CBS accounting

When you fork a new process, the kernel picks which CPU to assign it to. When a process wakes up, it returns to the CPU it last ran on — its data is still warm in that CPU's L1/L2 cache. The scheduler tries to keep processes on the same CPU as long as the load stays balanced.

Load balancing

Load balancing runs periodically, triggered by the scheduler tick. The process:

  1. Find the busiest CPU in the current scheduling domain
  2. If it's significantly busier than the local CPU, pull some tasks over
  3. Prefer tasks that are "cache cold" (haven't run recently) to minimize cache pollution from migration

The migration/N threads (FIFO priority 99) handle the actual task movement:

$ ps -eo pid,cls,rtprio,comm | grep migration
   30 FF      99 migration/0
   37 FF      99 migration/1
   44 FF      99 migration/2
   51 FF      99 migration/3
   58 FF      99 migration/4
   65 FF      99 migration/5
   72 FF      99 migration/6
   79 FF      99 migration/7
   86 FF      99 migration/8
   93 FF      99 migration/9
  100 FF      99 migration/10
  107 FF      99 migration/11

One per CPU. They run at the highest RT priority so they can interrupt anything to perform the migration instantly.

Scheduling domains

The scheduler understands CPU topology — SMT siblings (hyperthreads), cores sharing an L2, NUMA nodes. Balancing is done hierarchically: first balance within a core (hyperthreads), then within a socket, then between sockets (NUMA). Crossing NUMA boundaries is expensive and done conservatively.

$ cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list
0-11

On this machine, all 12 CPUs are in one scheduling domain (no NUMA).

Autogroups

With sched_autogroup_enabled (on by default):

$ cat /proc/sys/kernel/sched_autogroup_enabled
1

The scheduler groups processes by session. If you run make -j12 in a terminal, all 12 build jobs share one autogroup. A simultaneous make -j12 in another terminal gets its own autogroup. Each group gets 50% of CPU — regardless of how many tasks each contains. This prevents a parallel build from starving an interactive shell session even when it spawns many more threads.


Load Average: What It Actually Means

$ cat /proc/loadavg
0.24 0.38 0.37 2/1951 75458

The three numbers (0.24, 0.38, 0.37) are exponentially weighted moving averages (EWMA) of the number of processes in:

  • RUNNING state — actually on a CPU or ready to run
  • UNINTERRUPTIBLE state — D state, waiting for I/O that cannot be interrupted

The time constants are 1-minute, 5-minute, and 15-minute. The "average" doesn't mean average over that full interval — it's an EWMA that weights recent values more heavily, with a half-life roughly equal to that time constant.

A load average of 1.0 on a single-CPU machine means the CPU is perfectly saturated. On this 12-CPU machine, 12.0 would mean saturation. At 0.24, we have plenty of headroom.

The D-state trap

The most important thing to understand about load average: it includes uninterruptible I/O wait, not just CPU usage.

A process blocked on a disk read is in D state. It contributes to load average even though it's not using CPU. This is why you can have a high load average with low CPU utilization:

# High load, low CPU: probably I/O-bound
$ uptime
 13:47:01 up  1:46,  2 users,  load average: 8.23, 7.91, 6.44
$ top
%Cpu(s):  3.2 us,  0.8 sy,  0.0 ni, 94.3 id,  1.5 wa,  0.0 hi,  0.2 si

CPUs are idle 94% of the time — but 8+ tasks are blocked on disk, driving up load average without touching CPU.

This distinction matters enormously for diagnosis. Load average alone doesn't tell you whether you're CPU-bound or I/O-bound.


Diagnosing Scheduling Latency

When a process is "slow" or "laggy," the question is: is it slow because it can't get CPU, or because it's doing something slow? The scheduler exposes the data to answer this.

Enable schedstats

Per-process scheduling statistics are disabled by default (they add overhead). Enable them:

echo 1 | sudo tee /proc/sys/kernel/sched_schedstats

Now /proc/PID/schedstat has three columns:

$ cat /proc/$$/schedstat
6814340 4857 35
ColumnMeaning
1Time spent running on CPU (nanoseconds)
2Time spent waiting in the run queue (nanoseconds)
3Number of timeslices run

Column 2 is the key diagnostic: time spent runnable but not running. If a process is slow and this number is high, it's not getting CPU — the scheduler is the bottleneck. If this number is low, the process itself is doing something slow (blocking on I/O, lock contention, computation).

For this bash shell: 4,857 ns of wait time across 35 timeslices. Negligible — the shell is interactive and almost never needs CPU.

Voluntary vs involuntary switches

From /proc/PID/sched:

nr_voluntary_switches   ← process called schedule() itself (I/O, sleep, lock wait)
nr_involuntary_switches ← scheduler took the CPU away (time slice expired)

A process with many involuntary switches is CPU-hungry — it never gives up the CPU voluntarily; the timer interrupt keeps yanking it away. A process with many voluntary switches is latency-sensitive — it keeps waking up, doing a small piece of work, then blocking again.

For the bash shell: 26 voluntary, 0 involuntary — it always yields before its slice expires. For systemd: 10,422 voluntary, 611 involuntary — it does a lot of short work (voluntary) but occasionally runs long enough to get preempted.

perf sched

For deeper analysis, perf sched records scheduler events at the hardware level:

# Record scheduler events for 5 seconds
sudo perf sched record -- sleep 5

# Summarize per-process scheduling latency
sudo perf sched latency | head -20

This shows each process's maximum and average scheduling latency — time from becoming runnable to actually running. A well-tuned interactive system should have p99 latency under a few milliseconds for normal processes.

Run queue length

# vmstat: the 'r' column is tasks running + waiting for CPU
$ vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 19581972 308096 6964064    0    0    13     8  170  319  1  0 98  0  0
 1  0      0 19581972 308096 6964064    0    0     0     0  219  403  0  0 99  0  0
 1  0      0 19581972 308096 6964064    0    0     0     0  185  361  0  0 99  0  0

r=2 means 2 tasks are currently running or waiting for a CPU. On a 12-CPU machine, this is nothing. If r consistently exceeds nproc, tasks are queuing for CPU time.


Try It Yourself

1. See EEVDF slice for any process

chrt -p $$
chrt -p 1

2. Read the scheduler's view of any process

cat /proc/$$/sched
cat /proc/1/sched

Look at se.vruntime, nr_voluntary_switches, and nr_involuntary_switches.

3. See nice values for all processes

ps -eo pid,ni,comm --sort=ni | head -10
ps -eo pid,ni,comm --sort=-ni | head -10

4. Run something at a lower priority

nice -n 10 sha256sum /dev/urandom &
kill %1

5. Check real-time scheduling on kernel threads

ps -eo pid,cls,rtprio,comm | grep -v '\-$' | head -20
# FF = FIFO, RR = round-robin, TS = time-sharing (normal)

6. Enable and read per-process wait time

echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
cat /proc/$$/schedstat
# col 1: ns running, col 2: ns waiting, col 3: timeslices

7. Watch the run queue in real time

vmstat 1
# 'r' column = tasks running + waiting for CPU
# consistently > nproc means CPU saturation

8. Find CPU-hungry processes

top -b -n 1 | head -20

9. Check autogroup status

cat /proc/sys/kernel/sched_autogroup_enabled
# 1 = on: parallel builds in different terminals each get a fair CPU share

10. Profile scheduling latency with perf

sudo perf sched record -- sleep 5
sudo perf sched latency | head -20

Putting It All Together

The scheduler's job seems simple — pick the next process — but the constraints are brutal: maximize CPU utilization, minimize latency for interactive tasks, enforce real-time guarantees, scale across dozens of CPUs, and do all of this in microseconds.

The layered design handles it:

Timer interrupt (every 1ms)
    │
    ▼
Check scheduling classes top-to-bottom:
    │
    ├─ SCHED_DEADLINE task runnable? → run it (CBS enforcement)
    │
    ├─ SCHED_FIFO/RR task runnable? → run it (priority order)
    │
    └─ SCHED_NORMAL: EEVDF
           │
           ├── vruntime tracks accumulated CPU time
           │   (weighted by nice: higher priority → slower vruntime growth)
           │
           ├── virtual_deadline = vruntime + slice / weight
           │   (2.8ms slice, explicit since EEVDF in Linux 6.6)
           │
           └── pick runnable task with earliest virtual_deadline
    │
    ├── Per-CPU run queues: no global lock
    ├── Load balancer: pulls tasks from busy CPUs
    └── migration/N (FIFO 99): moves tasks between CPUs

Accounting you can read:
    ├── nr_involuntary_switches → process is CPU-hungry
    ├── schedstat col 2 → time waiting for CPU (diagnosis)
    └── load average → EWMA of (RUNNING + UNINTERRUPTIBLE)
                       high load + low CPU = I/O bottleneck

The 2/1951 in /proc/loadavg is the whole story compressed: out of 1,951 processes, 2 had something to compute right now. The scheduler gave them a CPU in microseconds, and the other 1,949 never noticed.

The next chapter goes one level deeper into what all those idle processes are waiting for: the filesystem. Every read(), write(), open() passes through the VFS layer — the kernel abstraction that makes ext4, btrfs, tmpfs, and a network socket look identical to userspace.


Part of the Linux Deep Dive series.

Linux Deep Dive #6: The VFS Layer — How Linux Abstracts Filesystems

Coming soon.

Linux Deep Dive #7: Btrfs — Snapshots, Subvolumes, and CoW

Coming soon.

Linux Deep Dive #8: Block I/O — From read() to the Disk

Coming soon.

Linux Deep Dive #9: composefs — Composing Read-Only Filesystems

Coming soon.

Linux Deep Dive #10: ostree — Immutable Trees and Atomic Updates

Coming soon.

Linux Deep Dive #11: rpm-ostree — Layering Packages on an Immutable Base

Coming soon.

Linux Deep Dive #12: bootc — The OS as a Container Image

Coming soon.

Linux Deep Dive #13: NetworkManager — From Interface Detection to IP Address

Coming soon.