The `top` command, part 1: interpreting high-level CPU information
After running company software on some virtual machines and observing erratic resource usage behaviour, I thought it would be wise to understand what sort of data structures underpin a server and its computations. The top
command is readily available in many environments to visualize process usage statistics, and has provided insights to countless people already.
Quick disclaimer: I'm aware that there are many tools that improve the visualizations of top
(htop
, atop
, and more). I found that top
was more than adequate to prompt a deep-dive in the fundamentals of process statistics and task scheduling. Let's jump right in!
Typical top
output looks more-or-less like the following:
top - 13:21:37 up 103 days, 8:11, 1 user, load average: 0.28, 0.74, 0.86
Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881240 total, 976772 free, 316568 used, 2587900 buff/cache
KiB Swap: 4194300 total, 4177276 free, 17024 used. 2430508 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28186 root 20 0 162000 2272 1556 R 0.3 0.1 0:00.09 top
1 root 20 0 193748 6352 2204 S 0.0 0.2 53:16.17 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:03.54 kthreadd
# ... (124 more tasks)
In this article, I want to focus on discovering and explaining the high-level summary statistics (lines 1-3).
I intend to skip over the following items to limit the scope of the article:
- Memory information
- Deeper understanding of input/output
- Deeper understanding of individual tasks
Last update time
top - 13:21:37
top
will always print the system time at the moment of updating on-screen statistics.
Stangely enough, without specifying a different delay (and not finding any configuration overrides), I observed a delay of 3 seconds per update, while the man
pages advertise a default delay of 1.5 seconds. Feel free to adjust this delay by pressing d
and entering a new delay value in seconds.
System uptime
up 103 days, 8:11
The system uptime is the amount of time the system has been running since its last restart.
It's worh clarifying that the 8:11
is part of the 103 days
(even if two spaces separate the information). We're meant to read this value as "103 days, 8 hours, and 11 minutes".
Logged-in users
1 user
This number is described as the number of users currently logged on.
There was no mention of exactly what 1 user
means in top
's man
pages. However, scanning the end of the man
pages revealed related commands. In particular, w
appeared to also be a system metrics summary tool like top
. Here's the output of w
:
14:13:29 up 103 days, 9:03, 1 user, load average: 0.00, 0.01, 0.05
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/0 10.3.60.116 13:32 1.00s 0.10s 0.00s w
After hopping over to another terminal window and logging in as root again (without logging out of the first termina window), I was able to observe the mention of 2 users in top
, w
, and who
as well, and observe the number drop back to 1 after logging out from the second terminal window.
Load average
load average: 0.00, 0.01, 0.05
These numbers represent the average number of running and waiting threads (tasks) measured across 1 minute, 5 minutes, and 15 minutes respectively. This is a very terse explanation.
The conclusion of Brendan Gregg - Linux Load Averages explains the number and its suggested interpretation:
These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
Explaining exponentially-damped moving sum averages is difficult. The key idea (shown by a graph in the article) is that recent measurements have a large weight while old measurements (even the measurements older than 1/5/15 minutes) have a small weight.
The interpretation of the system load average is different depending on the underlying operating system (quoted from the same document):
On Linux, load averages are (or try to be) "system load averages", for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren't completely idle. Advantage: includes demand for different resources.
On other OSes, load averages are "CPU load averages", measuring the number of CPU running + CPU runnable threads. Advantage: can be easier to understand and reason about (for CPUs only).
I noticed that load average numbers update every 5 seconds even if top
is set to a different delay (statistics refresh rate). Ray Walker - Examining Load Average explains that the Linux kernel produces the values in /proc/loadavg
, and the refresh rate of 5 seconds is defined within the function that produces the values.
Task overview
Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie
We're given the total number of tasks and their states at the moment of measurement by top
.
I immediately got confused by the mention of "tasks" due to my preconceived notion of processes (lines of execution in isolation from each other) and threads (lines of execution that share resources such as filesystem info, memory space, signal handlers, set of open files). It turns out that in Linux, this type of thinking breaks down because all newly created "lines of execution" can be defined as a new task or process (terms appear interchangeable), and result in a new PID (process ID) regardless of whether any resources (or specific resources) are shared.
The clone
system call documentation for the CLONE_THREAD flag mentions what can be considered a "thread" in Linux:
To make the remainder of the discussion of CLONE_THREAD more readable, the term "thread" is used to refer to the processes within a thread group.
From observing the fields, it appears that top
's traditional view of tasks is by grouping tasks of the same thread group.
- Refer to
man top
for a description of each field. - Notice how the CPU value is a summation of the threads in the thread group.
USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND PID PPID GID PGRP TGID
root 20 0 102784 4388 768 R 199.4 0.4 0:13.08 `- ackermann_multi 4016 3923 0 4016 4016
The H
key allows us to toggle to a view of all tasks (now, threads are shown separately).
USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND PID PPID GID PGRP TGID
root 20 0 102784 4388 768 R 99.2 0.3 0:13.08 `- ackermann_multi 4016 3923 0 4016 4016
root 20 0 102784 4388 768 S 0.2 0.3 0:00.01 `- ackermann_multi 4017 3923 0 4016 4016
root 20 0 102784 4388 768 S 0.0 0.3 0:00.00 `- ackermann_multi 4018 3923 0 4016 4016
root 20 0 102784 4388 768 S 99.4 0.3 0:00.00 `- ackermann_multi 4019 3923 0 4016 4016
It can display system summary information as well as a list of processes or threads currently being managed by the Linux kernel.
Process states
Running: The process is currently using the CPU.
Sleeping: The process is NOT using the CPU (NOT directly controlled by users)
- I believe (?) this encompasses various waiting states such as runnable, interruptible sleep, and uninterruptible sleep (further research and experimentation needed).
- The scheduler will put the running process aside in favor of another process for various reasons. Exact reasons would be a topic of further research and experimentation.
- We typically observe many processes sleeping in the OS until events (input-output, or other) wake them.
Stopped: The process is NOT using the CPU (directly controlled by users).
- The process has been suspended by Ctrl-z or
kill --STOP <pid>
- Consult Dave McKay - How to Run and Control Background Processes on Linux for job control strategies (stop/resume processes, foreground/background processes, and related signals)
Zombie: The process is no longer executing, and its process descriptor is still in memory.
- Under normal circumstances, this period of time is incredibly small for a given child process and the cleanup initiated by the parent process succeeds.
- The process descriptor (zombie process) lingers if the parent process is unable to obtain the terminated process's status.
- A real problem occurs when too many process descriptors (zombie or not) prevent other processes from being created.
- More details about in the
wait
system call documentation under "Notes".
Aggregated CPU percentages
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
From the man
pages for top
...
us, user : time running un-niced user processes
An un-niced process is a process run with unaltered priority or higher priority (same or lower nice value).
This relates to the concept of niceness, which is explained in more detail in the experimentation post following this article.
sy, system : time running kernel processes
Kernel processes can be thought of as the execution time spent performing system calls. From Wikipedia - System Calls, we get a glimpse at how system calls work:
[System calls are] the programmatic way in which a computer program requests a service from the kernel of the operating system on which it is executed.
The library's wrapper functions expose an ordinary function calling convention (a subroutine call on the assembly level) for using the system call, as well as making the system call more modular. Here, the primary function of the wrapper is to place all the arguments to be passed to the system call in the appropriate processor registers (and maybe on the call stack as well), and also setting a unique system call number for the kernel to call. In this way the library, which exists between the OS and the application, increases portability.
ni, nice : time running niced user processes
A niced process is a process run with lower priority (higher nice value).
This relates to the concept of niceness, which is explained in more detail in the experimentation post following this article.
id, idle : time spent in the kernel idle handler
If the CPU has no work to do, the CPU is still doing something useful. The idle handler will run when no other work can be done by the processor. The idle handler listens for timer or peripheral interrupts, and applies strategies to reduce system power consumption.
More details in Stack Exchange - Idle CPU Process
wa, IO-wait : time waiting for I/O completion
To summarize it in one sentence, 'iowait' is the percentage of time the CPU is idle AND there is at least one I/O in progress.
If the CPU is idle, the kernel then determines if there is at least one I/O currently in progress to either a local disk or a remotely mounted disk (NFS) which had been initiated from that CPU. If there is, then the 'iowait' counter is incremented by one. If there is no I/O in progress that was initiated from that CPU, the 'idle' counter is incremented by one.
hi : time spent servicing hardware interrupts
A hardware interrupt is a condition related to the state of the hardware that may be signaled by an external hardware device [...] to communicate that the device needs attention from the operating system (OS)
si : time spent servicing software interrupts
A software interrupt is requested by the processor itself upon executing particular instructions or when certain conditions are met.
st : time stolen from this vm by the hypervisor
From Stack Exchange - CPU Usage
It represents time when the real CPU was not available to the current virtual machine — it was "stolen" from that VM by the hypervisor (either to run another VM, or for its own needs).