In addition to debugging facilities, the kernel contains functionality for profiling execution and memory usage in real time. This guide explains how to enable and use those facilities.

But first, a note about profiling. Profiling, especially stack sampling, is a fairly intrusive mechanism. It may severely change the timing or even visibly slow down the system. The act of measuring something changes the measurement.

The "profile" utility is included in all Minoca OS builds and can be used to enable or disable profiling on a running system. No reboots are necessary to start profiling Minoca OS. To enable stack sampling, simply run the following command.

/# profile -e stack

When the task you wished to profile is complete, you can disable stack profiling with the corresponding disable command.

/# profile -d stack

To enable and disable memory profiling, simply replace "stack" with "memory" in the above commands.

If you would like to profile system initialization, profiling can be enable through the debugger before the shell becomes available to use the "profile" utility.

Reboot the system, and use the -i debugger command line option to request an initial break in upon connection. If you were successful, you should be able to type "k" and see a call stack like this one:

kd>k Frame RetAddr Call Site 800c0f58 80020818 kernel+0xcff4 800c0f78 80020a66 kernel!KdConnect+0x30 800c0fa8 80006ab1 kernel!KdInitialize+0x124 800c0ff8 00000000 kernel!KepStartSystem+0x7d

If you can already see the banner printed on the target screen or your call stack looks different, you missed the initial breakpoint. Reboot the system and try again.

Now you have to decide whether to enable stack sampling, memory information, or both. Due to the limited bandwidth of the serial debug channel, we recommend enabling only one at a time. To enable stack based profiling, enter:

eb SpEarlyEnabledFlags 0x1

Replace the 0x1 above with 0x2 to enable memory profiling, or replace the 0x1 with 0x3 to enable both stack sampling and memory profiling. Now hit "g" to let the system boot. It should boot normally, and become idle.

Break in by hitting Control+B.

Enter this at the debugger command prompt:

profile stack start

A pane should crop up on the lower left side of the debugger window, and contain something to the effect of "Root: 100%, NNNNN" (where NNNNN is replaced by a number). If you don't see anything in the pane, then profiling data is not being sent to the debugger. Retrace your steps above.

The Minoca Debugger can be used to collect stack samples in order to CPU usage.

Hit "g", and you'll be able to see the elements update in real time. This example is from a freshly booted system.

Each line in the tree contains three elements. The first element is the address of the hit. Since full stacks are gathered, this may be either a hit in the function directly or a hit in a child function called during this function. The second element is the percentage of total hits that were attributed to this function or its subroutines. The third element is the absolute number of hits for this address. The tree works towards subroutine calls as the inner nodes are expanded. In this example, you can see 749 hits, or 1% of the total, are attributed to PspKernelThreadStart+0x1a. As you drill down, you see the majority of those hits are found inside ImpLoadImage, called by IoLoadDriver, called by IopStartDevice. From this we can infer that loading the driver for a device is one of the more CPU intensive tasks of starting a device. Each set of peers in the tree is sorted by the number of hits, so CPU-intensive functions will tend to float to the top.

There are a couple of things that are important to understand about this data. First, it is sampled, meaning every so often the profiler will grab the full stack and send it over. Not every function call is logged here, and there is a tradeoff between accuracy and intrusiveness. Higher sampling rate means better data, but also slows things down more. If you're trying to measure CPU load of a particular action, performing the action repeatedly will tend to create a more accurate sampled picture than trying to snap the action once. Second, assembly functions may print incorrect symbols. Notice the overwhelming number of hits in "ArInvalidateTlbEntry+0x13". Together they account for 94% of all hits. This is actually an assembly routine, and it is not in fact ArInvalidateTlbEntry. It is actually ArWaitForInterrupt, which is a function that executes the "hlt" instruction. In other words, it is the processor being idle!

You can do a couple things with the stack profiling data. If you have source code, you'll see the source highlighted as you click each element in the tree. This is helpful for profiling your own drivers. You can run "profile stack help" to get an overview of the commands available. You can clear all the sample data. You can also apply a threshold value, in case you only want to see stacks that account for more the X percent of total hits. If you're using the command line version, you can use the "profile stack dump" command to view the data in textual form.

If you used "eb SpEarlyEnabledFlags 0x2" at the initial kernel breakpoint, your debugger will be collecting kernel memory information. Enter this at the debugger command prompt:

profile memory start

A pane will appear at the lower left corner of the debugger window describing the usage of Paged and Non-Paged Pool within the kernel.

The Minoca Debugger can be used to profile the memory usage of Minoca OS.

Hit "g" to let the system go, and you'll see these statistics update in real time. You can collapse the data for either pool if you're not interested in it. You can also sort by any of the columns. Clicking a column multiple times changes whether the column is sorted in ascending or descending order. Descriptions of the columns are below:

Tag Whenever a pool allocation is made, the caller is required to submit a 32-bit tag value associated with the allocation. By convention, these tend to be 4 ASCII letters. This column identifies the tag of the group of allocations. Each driver should be using a different tag. Some drivers may use several different tags if they find it useful to conceptually group different types of allocations. The kernel uses many different tags, and accounts for most of the tags shown above. The tags have no impact on which memory is allocated, they're used purely as an identifier during debugging and profiling. If you're exploring corrupted memory during a crash dump, pool tags are easy to spot and often provide valuable clues as to who was using (or corrupting) memory.
Largest Alloc Shows the largest single allocation made for that tag.
Active Bytes Shows the current number of bytes allocated under that tag.
Max Active Bytes Shows the highest number of active bytes ever seen for that tag. This is maintained by the kernel pools themselves, so it is not sampled or an estimate.
Active Count Shows the number of outstanding allocations for that tag.
Max Count Shows the maximum number of active allocations ever seen for that tag.
Lifetime Alloc Shows the sum of all bytes ever allocated under this tag, including allocations that have subsequently been released. This field identifies heavy users of the pool, even if they don't ever have a large number of allocations outstanding at once.

Additionally, the sub-header for each pool type contains some overall information for the pool:

Size Stores the current size of the pool as a whole, in bytes. In this case, the non-paged pool is 512kB total.
Allocs Stores the total number of calls ever made to the allocate function.
Frees Stores the total number of calls ever made to the free function. The difference between this number and the "Allocs" number is the current number of outstanding allocations.
Failed Stores the number of times the allocation routine failed and returned NULL to a caller.
Percent Free Stores the percentage of the pool that is available for allocation.
Free Stores the number of bytes in the pool that are available for allocation.

In this case, you can see the largest consumer of non-paged pool is the tag KeQl, which is used by the KeCreateQueuedLock tag. We can infer that there are 190 queued locks allocated in the system, which consumes about 16 kB of memory. This seems normal or even low, so we can conclude that this memory profiling information comes from a healthy system that's been freshly booted or lightly loaded.

You can temporarily disable sending profiling data across the debug channel by manipulating the "SpEnabledFlags" variable in the debugger. This flag contains the same bits as the "SpEarlyEnabledFlags" used to enable profiling at boot. Simply clear the flags back to zero to temporarily disable sending profiling data. This generally makes the system much more responsive. It may be useful if you cannot get to a command line to disable profiling via the "profile" utility.