Denis Cormier

The `top` command, part 1.5: `top` experimentation

A lot of content written for "The top command, part 1" was validated with some experimentation. This article highlights some of the experimentation I've done to highlight differences in top output when observing different computational loads and task scheduler priorities.

After preparing a cpu-bound program, I experiment with different runs of the program (single vs multi process, different priorities) to see how top visualizes its usage.

Preparing a CPU-bound program - ackermann_multi

I wanted to prepare a program that was CPU-intensive and had a predictable runtime in the tens of seconds. I decided to use a program that computes the Ackermann function for reasonably difficult parameters.

The program computes ackermann(3, 0), ackermann(3, 0), ackermann(3, 1), ackermann(3, 1), ..., ackermann(3, 13), ackermann(3, 13). I discuss the execution time and the choice of computing each result twice later in the article.

package main

import (
	"log"
	"time"
)

func main() {
	var wg sync.WaitGroup
	for j := 0; j <= 13; j++ {
		wg.Add(2)
		go func(j int) {
			defer wg.Done()
			runTest(3, j)
		}(j)
		go func(j int) {
			defer wg.Done()
			runTest(3, j)
		}(j)
	}
	wg.Wait()
}

func runTest(i, j int) {
	start := time.Now()
	ans := ackermann(i, j)
	elapsed := time.Since(start)
	log.Printf("Calculated `ackermann(%d, %d) = %d` in %s", i, j, ans, elapsed)
}

func ackermann(m, n int) (ans int) {
	if m == 0 {
		return n + 1
	} else if n == 0 {
		return ackermann(m-1, 1)
	} else {
		return ackermann(m-1, ackermann(m, n-1))
	}
}

Experiment 1 - single core vs multi-core

Single core

The first implementation of my program had a different main function that ran each calculation once (not twice) on the default goroutine. Ex: ackermann(3, 0), ackermann(3, 1), ..., ackermann(3, 13)

func main() {
	for j := 0; j <= 13; j++ {
		runTest(3, j)
	}
}

The effort is kept to one of my two CPU cores, and ends up taking ~35 seconds on my CentOS 8 virtual machine.

Here is the output from top in a separate window during a run of ./ackermann_single

top, press H (threads view)

%Cpu(s): 50.5 us,  0.5 sy,  0.0 ni, 49.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3797 root      20   0  102784   4392    768 R 99.9  0.1   0:14.34 ackermann_singl

Throughout the 35 seconds of execution, CPU usage stays constant at 50%. What was most surprising was that computations were not exclusively directed at the same core...

top, press H (threads view), press 1 (single-CPU view)

top - 20:03:48 up 105 days, 14:53,  2 users,  load average: 0.65, 0.33, 0.20
Threads: 249 total,   2 running, 247 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  1.0 us,  1.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3881240 total,  1892088 free,   244080 used,  1745072 buff/cache
KiB Swap:  4194300 total,  3899260 free,   295040 used.  2659276 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3790 root      20   0  104192   7556    776 R 99.9  0.2   0:24.25 ackermann_singl

That's one CPU core at 100%, and subsequent top refreshes (same run) would show the other CPU core spiking to 100% as well, but always with an overall usage of the CPU at 50%. The CPU usage share would go back and forth between the two cores.

top - 19:59:31 up 105 days, 14:49,  2 users,  load average: 0.25, 0.17, 0.13
Threads: 248 total,   2 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3881240 total,  1895080 free,   241368 used,  1744792 buff/cache
KiB Swap:  4194300 total,  3899260 free,   295040 used.  2662020 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3750 root      20   0  102784   4924    760 R 99.0  0.1   0:17.15 ackermann_singl
 3749 root      20   0  162132   2472   1616 R  1.0  0.1   0:00.34 top

I was not able to explain why the process would switch CPUs throughout the run. Many internet posts hint at the scheduling algorithm adapting to other significant work on the same machine, but as far as I can tell, there was no other significant work to switch with.

load average: 0.95, 0.61, 0.31

Multi-core

To make the program as computationally efficient as possible, I rewrote the main function and introduced a new goroutine for each computation. Note: the programmer can prepare many goroutines, and Go runtime itself takes care of spreading the goroutines among different OS threads.

Also, to keep the runtime the same and make it likely for the scheduler to spread work evenly between my two CPU cores, I duplicated each calculation.

func main() {
	var wg sync.WaitGroup
	for j := 0; j <= 13; j++ {
		wg.Add(2)
		go func(j int) {
			defer wg.Done()
			runTest(3, j)
		}(j)
		go func(j int) {
			defer wg.Done()
			runTest(3, j)
		}(j)
	}
	wg.Wait()
}

top, press H (threads view)

%Cpu(s): 99.0 us,  0.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3829 root      20   0  103040   8884    772 R 99.0  0.2   0:07.77 ackermann_multi
 3826 root      20   0  103040   8884    772 R 97.1  0.2   0:07.67 ackermann_multi

Here, CPU usage is maxed out at 100%, and we also see it in the single-CPU view.

top, press H (threads view), press 1 (single-CPU view)

%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3819 root      20   0  103040  10996    776 R 99.9  0.3   0:19.44 ackermann_multi
 3822 root      20   0  103040  10996    776 R 99.9  0.3   0:19.23 ackermann_multi

load average: 2.52, 1.57, 0.79

Experiment 2 - niceness and priority

Moving forward, we stick to running the concurrent version of the program.

nice -n -1 ./ackermann_multi

# top cpu usage output
%Cpu(s): 99.8 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2758 root      19  -1  103040   7028    800 R 198.0  0.2   0:05.98 ackermann_multi

./ackermann_multi

# top cpu usage output
%Cpu(s): 99.8 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2765 root      20   0  103040   8608    772 R 200.0  0.2   0:12.75 ackermann_multi

nice -n 1 ./ackermann_multi

# top cpu usage output
%Cpu(s):  0.0 us,  0.2 sy, 99.7 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2777 root      21   1  103040  11520    800 R 199.3  0.3   0:26.70 ackermann_multi

When commands run without a changed priority, they run at niceness 0 (priority 20). Possible niceness values range from -20 (least nice, highest priority, takes bigger share of CPU cycles) to 19 (most nice, lowest priority, takes less share of CPU cycles). The calculated priority value is niceness + 20 which translates to a range of 0 to 39.

I haven't determined a strong reason for differentiation of niceness and priority other than the perceived convenience of users reasoning about priority around the default value of niceness 0 (-20 to 19), and the perceived convenience of the scheduler implementation reasoning about non-negative priority values (0 to 39).

Other notes:

Experiment 3 - niceness in action

The previously reported execution time of ~35 seconds (on my machine) applies at any niceness/priority when the program runs alone.

From 2 terminal windows, I ran the programs almost simultaneously (start one program, go to the other window, start the other one) so that one run would be at the highest possible niceness, and the other would be at the lowest possible niceness.

time nice -n -20 ./ackermann_multi

...

real	0m35.950s # This is a similar execution time as a normal run that can get the CPU time for itself
user	1m11.180s # This highlights the fact that the amount of user-space computation time is nearly twice as much (2 CPU cores)
sys	0m0.172s
time nice -n 19 ./ackermann_multi

...

real	1m11.860s # Given the above did not finish any slower than it usually does, this program spent that 35 seconds without any CPU time
user	1m11.642s # Similar amount of computation time as the run above
sys	0m0.086s

I tried both variations (starting -20 before 19, starting 19 before -20) and observed a similar pattern:

Trying the same thing with niceness -1 vs 0, and 0 vs 1...

Niceness 0 and 1 (1 before 0 and 0 before 1)

Without much difference between the two, this was my observation:

time nice -n 1 ./ackermann_multi

...

real	1m8.074s
user	1m7.557s
sys	0m0.131s
time ./ackermann_multi

...

real	1m2.088s
user	1m7.567s
sys	0m0.093s

This was the output from top throughout the run:

%Cpu(s): 55.2 us,  0.5 sy, 44.3 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3066 root      20   0  103040   8616    776 R 109.9  0.2   0:12.24 ackermann_multi
 3062 root      21   1  103040   8344    788 R  89.1  0.2   0:10.98 ackermann_multi

The top output very clearly shows the share of CPU usage percentage between the program of niceness 0 (considered "unniced", us) vs niceness 1 (considered "niced", ni) with a split of ~55% vs ~45%.

At the process level (summation of percentages per thread), the run with niceness 0 was shown to take 100-120% CPU (showing 110% most of the time) and, conversely, the run with niceness 1 was shown to take 80-100% CPU (showing 90% most of the time).

Niceness -1 and 0 (0 before -1 and -1 before 0)

As expected, the only difference of testing with -1 and 0 (as opposed to 0 and 1) is that both are considered "unniced" and CPU usage are both registered under us.

%Cpu(s):100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3029 root      19  -1  103040   8088    776 R 109.9  0.2   0:09.65 ackermann_multi
 3025 root      20   0  103040   8352    800 R  90.1  0.2   0:08.76 ackermann_multi

The split of CPU usage summation in threads was also ~110% vs ~90% (~55% + ~45% = ~100% grouped under us).