Wednesday, April 04, 2018

Performance Analysis: The USE Method

For every resource, check utilization, saturation, and errors.  (Posted by Jerry Yoakum)


Blatant rip off of http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ with a small amount of simplification.


The USE method can be summarized as: For every resource, check utilization, saturation, and errors. While the USE method was first introduced to me as a method for examining hardware some software resources can be examined with this methodology.
Utilization is the percentage of time that the resource is busy working during a specific time interval. While busy, the resource may still be able to accept more work; the degree to which it cannot do so is identified by saturation. That extra work is usually waiting in a queue.

Saturation happens when a resource is fully utilized and work is queued. When a resource is fully saturated then errors will occur.

Errors in terms of the USE method refer to the count of error events. Errors can degrade performance and might not be immediately noticed when the failure mode is recoverable. This includes operations that fail and are retried, as well as resources that fail in a pool of redundant resources.
The key metrics of the USE method are ususally expressed as:
  • Utilization as a percentage over a time interval.
  • Saturation as a wait queue length.
  • Errors as the number of errors reported.
It is also important to express the time interval for the measurement. A short burst of high utilization can cause saturation and performance issues, even though the overall utilization is low over a longer interval.

The first step in the USE method is to create a list of resources. Try to be as complete as possible. Here is a generic list of hardware resources:
  • CPUs - Sockets, cores, hardware threads (virtual CPUs).
  • Main memory - RAM.
  • Network interfaces - Ethernet ports.
  • Storage devices - Disks.
  • Controllers - Storage, network.
  • Interconnects - CPU, memory, I/O.
If focusing on software you should start out breaking your system down by services then methods then low level resources, for example:
  • Mutex locks - Utilization may be defined as the time the lock was held, saturation by those threads queued waiting on the lock.
  • Thread pools - Utilization may be defined as the time threads were busy processing work, saturation by the number of requests waiting to be serviced by the thread pool.
  • Process/thread capacity - The current thread/process usage vs the maximum thread/process limit of a system may be defined as utilization; waiting on allocation may indicate saturation; and errors occur when the allocation fails.
  • File descriptor capacity - Same as above but for file descriptors.
Drawing a function block diagram for the system will be very helpful when looking for bottlenecks in the flow of the data. While determining utilization for the various components, annotate each one on the functional diagram with its maximum bandwidth. The resulting diagram may pinpoint systemic bottlenecks before measurement has been taken. (This is a useful exercise during product design, while you have time to change specifications.)

Here are some general suggestions for interpreting metric types:
  • Utilization - 100% utilization is usually a sign of a bottleneck (check saturation and its effect to the confirm). High utilization (i.e. >60%) can begin to be a problem. When utilization is measured over a relatively long time period, an average utilization of 60% can hide short bursts of 100% utilization.
  • Saturation - Any amount of saturation can be a problem. This may be measured as the length of a wait queue or time spent waiting on the queue.
  • Errors - Non-zero error counters are worth investigating, especially if they are still increasing while performance is poor.