Skip to content

Instantly share code, notes, and snippets.

@0xffea
Last active December 16, 2015 03:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 0xffea/5371070 to your computer and use it in GitHub Desktop.
Save 0xffea/5371070 to your computer and use it in GitHub Desktop.
Documentation for "Rearchitect and replace interrupt distribution"

Illumos (current perl version)

  • userland daemon

Options

  • debug -D
  • simulator -S

-S accepts a perl script that generates different interrupt distributions for testing and debugging.

Algorithm

main function:

LOOP:
  getstat()       // get kernel statistics

  generate_delta()    // with previous snapshot

  append_delta()      // add delta to the list of deltas

  compute_goodness()  // reconfiguration is needed?

  IF needs reconfiguration:
      do_reconfig()   // change interrupt assignments

getstat function:

For every cpu online in kstat cpu:<cpuid>:sys get
cpu_nsec_* and crtime.

For every interrupt not disabled in pci_intrs: build
cpu->ivecs->'buspath ino'->{crtime,pil,ino,num_ino,buspath,mame,ihs}

For every MSI device add cookie time and crtimes up

generate_delta function:

Adds interrupt counts and times.

Documentation and Tools

Source

http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/intrd/intrd.pl

Oracle Solaris 11.1

  • kernel thread
  • controlled by uadmin

Linux (irqbalance)

  • userland daemon

Options:

  • oneshot
  • debug
  • hintpolicy
  • powerthresh
  • banirq
  • banscript
  • policyscript
  • pid

oneshot assigns interrupts only once. hintpolicy set an IRQ affinity hint of exact, subset or ignore. powerthresh set the threshold to put a CPU in powersave mode. banirq bans a single IRQ from balancing. banscript runs a script for each IRQ to decide if the IRQ should be banned from balancing. policyscript is a superset of banscript. It also can assign the balance_levels none, package, cache or core.

Data structures

General:

LIST numa_nodes
LIST packages
LIST cache_domains
LIST cpus
LIST interrupts_db
LIST new_irq_list
LIST banned_irqs

Topology object:

STRUCT topo_obj
    load
    last_load
    obj_type        // OBJ_TYPE_CPU
                    // OBJ_TYPE_CACHE
                    // OBJ_TYPE_PACKAGE
                    // OBJ_TYPE_NODE
    number
    powersave_mode
    mask            // CPU_MASK_ALL
    interrupts
    parent
    children
    obj_type_list

IRQ information:

STRUCT irq_info
    irq
    class       // other
                    // legacy
                    // storage
                    // timer
                    // ethernet
                    // gbit-ethernet
                    // 10gbit-ethernet
    type            // IRQ_LEGACY
                    // IRQ_MSI
                    // IRQ_MSIX
    level           // BALANCE_NONE
                    // BALANCE_PACKAGE
                    // BALANCE_CACHE
                    // BALANCE_CORE
    numa_node
    cpu_mask
    affinity_hint   // could be set via irq_set_affinity_hint
    irq_count
    last_irq_count
    load
    moved
    assigned_obj

Object placement:

STRUCT obj_placement
    best
    least_irqs
    best_cost
    info

Algorithm

main function:

build_object_tree()

LOOP:
     parse_proc_interrupts()
     parse_proc_stat()

     IF need_rescan: // cpu hotplug

     calculate_placement()
     activate_mappings();

build_object_tree function:

// Build NUMA topology

FOR node* in /sys/devices/system/node/
    append node*(name, cpu_mask) to numa_nodes

FOR cpu* online and not banned in /sys/devices/system/cpu/
    core_count++

    read_package_mask   // topology/core_siblings

    read_package_id         // topology/physical_package_id

    read_cache_mask         // cache/index1/shared_cpu_map
                            // cache/index2/shared_cpu_map

    // Add CPU to cache domain
    IF cache_mask not in cache_domains
         add OBJ_TYPE_CACHE(cache_mask, cache_domain_count)

    IF cpu not in OBJ_TYPE_CACHE->children
        add cpu to OBJ_TYPE_CACHE->children

    // Add cache domain to package
    IF package_mask not in packages
        add package to packages
        package_count++

    // Add package to node
    IF exists node in numa_nodes nodeid == number
        add package to node->children

    append cpu* to cpus

    // Build IRQ database

    FOR irq in sys/bus/pci/devices/*/msi_irqs
        struct irq_info->type = IRQ_TYPE_MSIX
        add irq_info to interrupts_db

    FOR irq in sys/bus/pci/devices/*/irq
        struct irq_info->type = IRQ_TYPE_LEGACY
        add irq_info to interrupts_db

parse_proc_interrupts:

FOR irq in /proc/interrupts
    get struct irq_info for irq number
    IF not irq_info
        add new irq_info

// Check total irq count

parse_proc_stat:

FOR cpu in /proc/stat
    cpu->load = irq_load + softirq_load - cpu->last_load
    cpu->load \*= NSEC_PER_SEC/HZ
    cpu->last_load = irq_load + softirq_load

// Compute branch load share for cpus, cache_domains,
// packages and numa_nodes

calculate_placement function:

// Place IRQ in node

// NUMA nodes

// Packages

// Caches

A device driver could use irq_set_affinity_hint to inform irqbalance about a preferred placement policy. irqbalance would read this hint from /proc/irq/<irq>/affinity_hint. Few device drivers use this mechanism (like ixgbe, nvme, virtio_pci), its deemed too simple and inelegant.

Documentation and Tools

Source

Illumos intrd (new version)

Source

Related Topics

CPU Topology

Interrupts

Glossary

MSI - Message Signaled Interrupts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment