Blatant rip off of http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ with a small amount of simplification.
The USE method can be summarized as: For every resource, check utilization, saturation, and errors. While the USE method was first introduced to me as a method for examining hardware some software resources can be examined with this methodology.
Utilization is the percentage of time that the resource is busy working during a specific time interval. While busy, the resource may still be able to accept more work; the degree to which it cannot do so is identified by saturation. That extra work is usually waiting in a queue.The key metrics of the USE method are ususally expressed as:
Saturation happens when a resource is fully utilized and work is queued. When a resource is fully saturated then errors will occur.
Errors in terms of the USE method refer to the count of error events. Errors can degrade performance and might not be immediately noticed when the failure mode is recoverable. This includes operations that fail and are retried, as well as resources that fail in a pool of redundant resources.
- Utilization as a percentage over a time interval.
- Saturation as a wait queue length.
- Errors as the number of errors reported.
The first step in the USE method is to create a list of resources. Try to be as complete as possible. Here is a generic list of hardware resources:
- CPUs - Sockets, cores, hardware threads (virtual CPUs).
- Main memory - RAM.
- Network interfaces - Ethernet ports.
- Storage devices - Disks.
- Controllers - Storage, network.
- Interconnects - CPU, memory, I/O.
- Mutex locks - Utilization may be defined as the time the lock was held, saturation by those threads queued waiting on the lock.
- Thread pools - Utilization may be defined as the time threads were busy processing work, saturation by the number of requests waiting to be serviced by the thread pool.
- Process/thread capacity - The current thread/process usage vs the maximum thread/process limit of a system may be defined as utilization; waiting on allocation may indicate saturation; and errors occur when the allocation fails.
- File descriptor capacity - Same as above but for file descriptors.
Here are some general suggestions for interpreting metric types:
- Utilization - 100% utilization is usually a sign of a bottleneck (check saturation and its effect to the confirm). High utilization (i.e. >60%) can begin to be a problem. When utilization is measured over a relatively long time period, an average utilization of 60% can hide short bursts of 100% utilization.
- Saturation - Any amount of saturation can be a problem. This may be measured as the length of a wait queue or time spent waiting on the queue.
- Errors - Non-zero error counters are worth investigating, especially if they are still increasing while performance is poor.