- userland daemon
- debug -D
- simulator -S
-S
accepts a perl script that generates different interrupt distributions for testing and debugging.
main function:
LOOP:
getstat() // get kernel statistics
generate_delta() // with previous snapshot
append_delta() // add delta to the list of deltas
compute_goodness() // reconfiguration is needed?
IF needs reconfiguration:
do_reconfig() // change interrupt assignments
getstat function:
For every cpu online in kstat cpu:<cpuid>:sys get
cpu_nsec_* and crtime.
For every interrupt not disabled in pci_intrs: build
cpu->ivecs->'buspath ino'->{crtime,pil,ino,num_ino,buspath,mame,ihs}
For every MSI device add cookie time and crtimes up
generate_delta function:
Adds interrupt counts and times.
- PSARC/2004/199
- intrd(1M)
- intrstat(1M)
- interrupt loading, intrd, and CMT
- Approach for managing interrupt load distribution US 7610425 B2
http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/intrd/intrd.pl
- kernel thread
- controlled by uadmin
- userland daemon
- oneshot
- debug
- hintpolicy
- powerthresh
- banirq
- banscript
- policyscript
- pid
oneshot
assigns interrupts only once. hintpolicy
set an IRQ affinity hint of exact, subset or ignore. powerthresh
set the threshold to put a CPU in powersave mode. banirq
bans a single IRQ from balancing. banscript
runs a script for each IRQ to decide if the IRQ should be banned from balancing. policyscript
is a superset of banscript
. It also can assign the balance_levels none, package, cache or core.
General:
LIST numa_nodes
LIST packages
LIST cache_domains
LIST cpus
LIST interrupts_db
LIST new_irq_list
LIST banned_irqs
Topology object:
STRUCT topo_obj
load
last_load
obj_type // OBJ_TYPE_CPU
// OBJ_TYPE_CACHE
// OBJ_TYPE_PACKAGE
// OBJ_TYPE_NODE
number
powersave_mode
mask // CPU_MASK_ALL
interrupts
parent
children
obj_type_list
IRQ information:
STRUCT irq_info
irq
class // other
// legacy
// storage
// timer
// ethernet
// gbit-ethernet
// 10gbit-ethernet
type // IRQ_LEGACY
// IRQ_MSI
// IRQ_MSIX
level // BALANCE_NONE
// BALANCE_PACKAGE
// BALANCE_CACHE
// BALANCE_CORE
numa_node
cpu_mask
affinity_hint // could be set via irq_set_affinity_hint
irq_count
last_irq_count
load
moved
assigned_obj
Object placement:
STRUCT obj_placement
best
least_irqs
best_cost
info
main function:
build_object_tree()
LOOP:
parse_proc_interrupts()
parse_proc_stat()
IF need_rescan: // cpu hotplug
calculate_placement()
activate_mappings();
build_object_tree function:
// Build NUMA topology
FOR node* in /sys/devices/system/node/
append node*(name, cpu_mask) to numa_nodes
FOR cpu* online and not banned in /sys/devices/system/cpu/
core_count++
read_package_mask // topology/core_siblings
read_package_id // topology/physical_package_id
read_cache_mask // cache/index1/shared_cpu_map
// cache/index2/shared_cpu_map
// Add CPU to cache domain
IF cache_mask not in cache_domains
add OBJ_TYPE_CACHE(cache_mask, cache_domain_count)
IF cpu not in OBJ_TYPE_CACHE->children
add cpu to OBJ_TYPE_CACHE->children
// Add cache domain to package
IF package_mask not in packages
add package to packages
package_count++
// Add package to node
IF exists node in numa_nodes nodeid == number
add package to node->children
append cpu* to cpus
// Build IRQ database
FOR irq in sys/bus/pci/devices/*/msi_irqs
struct irq_info->type = IRQ_TYPE_MSIX
add irq_info to interrupts_db
FOR irq in sys/bus/pci/devices/*/irq
struct irq_info->type = IRQ_TYPE_LEGACY
add irq_info to interrupts_db
parse_proc_interrupts:
FOR irq in /proc/interrupts
get struct irq_info for irq number
IF not irq_info
add new irq_info
// Check total irq count
parse_proc_stat:
FOR cpu in /proc/stat
cpu->load = irq_load + softirq_load - cpu->last_load
cpu->load \*= NSEC_PER_SEC/HZ
cpu->last_load = irq_load + softirq_load
// Compute branch load share for cpus, cache_domains,
// packages and numa_nodes
calculate_placement function:
// Place IRQ in node
// NUMA nodes
// Packages
// Caches
A device driver could use irq_set_affinity_hint to inform irqbalance
about a preferred placement policy. irqbalance
would read this hint from /proc/irq/<irq>/affinity_hint
. Few device drivers use this mechanism (like ixgbe, nvme, virtio_pci), its deemed too simple and inelegant.
- /proc/interrupts
- Assign Interrupts to Processor Cores on Intel® Ethernet Controller
- Automatic IRQ siloing for network devices
- http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/irq
- https://code.google.com/p/irqbalance/
- "Interrupt and Exception Handling" in the Intel System Programming Guide
MSI - Message Signaled Interrupts