willscripted/kernel

## kernel
Linux - elitist attitude - only considers patche of real-world problems. Marketing bullets or one-off requests, such as pageable kernel memory, have received no consideration

Kernel Development

Introduction
==============

History
-----------

Unix - Elegant OS
  (historically)
  - only a few hundred system calls
  - almost everything is a file
  - written in C (portable)
  - fast process creation time
  - unique sys call 'fork'

  (today)
  - preemptive multitasking
  - multi-threading
  - virtual memory
  - demand paging
  - shared libraries w/ demand loading
  - TCP/IP networking
  - Many variants scale to hundreds of processors
  - Some run on small embedded devices

Linux
  - Born of frustration at Unix's non-OpenSource
  - Implements the Unix API as definded by POSIX and the Unix Spec
  - Not a direct decendent of unix

  - Modular design
  - Can preempt itself (kernel preemption)
  - Kernel threads
  - Dynamic binary loading in Kernel image (kernel modules)

  - Does not differentiate bewoon threads and normal processes
    - all procs are the same, some just share resources
  - Hot pluggable events ?
  - TODO - look into why linux ignores some common Unix features that kernel developers consider poorly designed, such as STREAMS, or standards that are impossible to cleanly implement (p 8)

  System Components Usually Include
    - Kernel
    - C lib
    - toolchain
    - basic sys utils (eg, login process and shell)

OS Includes (Basic use and administration)

 - Kernel (Manages hardware, distributes system resources)
   - Interupt handlers
   - Scheduler - to share processor time btwn procs
   - Memory Management system
   - System services (eg. networking / inter-proc comms)
 - Device drivers
 - Boot loader
 - Command shell / UI
 - Basic file system utilities

Kernel
  - Above user programs
  - Privalleged access (kernel space vs user space)
  - C library provideos higher level functions (although not always that different)
    - Clibs make system calls
    - Not all C-libs use system calls (eg strcopy)
  - applications get work done through system calls

  - in one of three states (always, even corner cases)
    - User-space, executing process code
    - Kernel-space, process context, executing system calls
    - Kernel-space, interrupt context, handlin interrupts

  - Usually a monolithic static binary
  - Usually require a system with a paged memory-management unit (MMU)
    - enables memory protection
    - each proc gets uniq addr space
  - Monolithic servers (Linux)
    - Simpler design
  - Microkernels (Parts of OS X and Windows NT)
    - Several 'server' processes that communicate via IPC (inter-proc com)
    - Can swap out server instances

Interupts
  - stop processor directly
    - in turn, stops the kernel
  - interrupts have numbers associated w/ them (hardware)
  - handlers process and respond
  - do not run in process context

Getting Started w/ the Kernel
=============================

[Directories] - see comprehensive lists
usr - early user-space code (initramfs) ?

Built with flags
Binary flags
Tristates - dont build, build as module, build into source

Configure flags with
  make config  (takes long time)
  make menuconfig (ncurses)
  make gconfig (gtk+)
  or just edit `.config`

Build with
  `make`
  `make -jn` mulit-proc build where n is # procs)

Installing
  Architecture and boot-loader dependent - see their docs

  Modules are architecture independent
  `make modules_install`

Warnings
  Kernel lacks memory protection afforded to user-space
    Clibs not available - to heavy / slow
    see instead `linux/things.h`
      also - inotify?
    printk rather than printf
    inline functions for shit that needs to be fast
    inline functions usually declared in header files

      `static inline void wolf(unsigned long tail_size)`

  # TODO - what is asm() assembly compiler directive.
  Usually only used in parts of the kernel that are unique to a specific architecture - cause assembly is arch specific and linux has a goal to support cross platform
  Small per-process fixed-size stack

  Dereferencing a NULL pointer (illegal memory access)
    - Major kernel error
    - No `SIGSEGV` errors

  Kernel memory is not pagable. Erry bite you eat is one less byte of available physical memory

  # TODO - what does it mean to 'catch a trap'
  Dont do floating point calcs

  Kernel has small stack
  Symmetrical multiprocessing (SMP)

  Kernel must synchronize between scheduled tasks
    Spinlocks and Semaphores usually used to avoid race conditions

  Portability - handful of rules
    Segregate system-specific code to appropriate dirs
    Remain endian neutral
    Be 64-bit clean, do not assume word or page sive


  Anync interrupts, pre-emption, and SMP mean synchronization and concurrency are major concerns

3. Process Management
====================


Process
  - executing program code (the 'text section')
  - Open files
  - pending signals
  - internal kernel data
  - processor state
  - memory address space (with one or more memory mappings)
  - one or more threads of execution
  - a data section contaiing global vars

  Tasks kept in doubly linked list

  Internally to the kernel known as a task
  `task_struct`
    - all the data needed by the kernel about a process
    - allocated by the slab allocator
    - the kernel stack holds the `thread_info`
    - `thread_info` has a pointer to the `task_struct`
    - there are maximum # procs 32k on old machines (cause small ints)
    - will wrap around
    - can be as high as 4m set in /proc/sys/kernel/pid_max
    - `current` macro to get the `task_struct` of the currently running proc

    - open files
    - proc's addr space
    - pending signals
    - proc state

    TODO - revisit 'slab allocator' & thread_info

Threads
  - objects of activity within a proc
  includes
    - unique program counter
    - process stack
    - set of procesor registers

  - the kernel schedules threads, not processes
  - Linux does not differentiate between threads and procs
    threads are just a special kind of proc

Procs
  - procs provideo two virtualizations
    - virtual processor - proc thinks it is alone on the cpu
    - virtual memory - as if it were alone in its use of memory

  - Threads share virtual memory but not virtual processor
  - a program is not a process
    must also include related resources

  Important system calls
    fork()
      - creates a new proc
      - returns 2x, once in the parent, once in the child
    exec() - function family
      - creates a new addrs space and loads a new program
    exit()
      - terminates the process and frees all resources
      - puts proc into zombie state
    wait4()
      - parent inquires about the status of a terminated child
    wait()
      - parent can resolve zombie state
    waitpid()
      - parent can resolve zombie state


Process State
-------------

Condition of the Proc (5 flags)

system command `set_task_state(task, state);` # see `linux/sched.h`

TASK_RUNNING
  - either currently running or on a run queue
  - only possible state for proc running in user-space
TASK_INTERRUPTIBLE
  - sleeping, eg waiting for some condition
TASK_UNINTERRUPTIBLE
  - sleeping, does not wake up and become runnable on signal
__TASK_TRACED
  - trace, eg. by ptrace
__TASK_STOPPED
  - not running, cannot run
  - eg SIGSTOP, SIGTSTP, SIGTSTP, SIGTTIN received


Process Context
------------------

Program code runs in user-space
System calls or exception triggering
  - will enter kernel-space
  - are the only interface to the kernel
  - kernel is 'executing on behalf of the process'
  - kernel is in 'process context'

The Process Family Tree
--------------------

`init` process is process #1
  - started in last step of boot
  - reads initscripts and executes more programs

Process relationships are stored in the `task_struct`
  - `parent`-> parent's `task_struct`
  - `children` -> list of children

Can navigate through tasks quite easily - see section for C samples

Process creation
-----------------

fork()
  - creates a child proc that is a copy of the current task
  - almost an exact clone
  - copy on write, each process gets its own copied page if written to
  - only overhead is dup of parent's page table & creation of new process descriptor for the child

exec() (a family)
  - loads a new executable into the address space

Forking
----------

most work done by `do_fork()` (see kernel/fork.c) which calls `copy_process()`

- PF_SUPERPRIV flag removed from child - no super user privileges
- Child runs first to avoid copy-on-write overhead if child is just `exec`ing

- vfork is a little strange, parent waits for child to exec or exit
  - use seems to be for legacy reasons. Or if you dont want to spend time copying the page table

The Linux Implementation of Threads
-----------------------------

Shared memory space
Multiple execution capabilities
Enable 'concurrent programming' and 'parallelism'
Linux don' care, it scheds like all procs
Differs from Windows or others whose processes are much heavier
Rather than 1 proc and 4 threads on Windows, linux has 4 procs w/ shared memory
clone() sys command
  - passed flags indicating which resources are shared
See table here for clone flags and meanings

Kernel Threads
-----------

Processes run by the kernel in kernel-space
Schedualable and preemptable, just like normal processes
Do not have an address space

Things like 'flush' tasks and 'ksoftirqd' are delegated to kernel threads
See kernel threads with `ps -ef`
Can only be created by other kernel threads
Usually forked of the `kthread` kernel process

Procss Termination
-------------------

By signal or call to `exit()` # see `kernel/exit.c`
Cleans up file descriptors that reach 0 user procs
# TODO - what is a file descriptor
Calls `schedule` to switch to a new process

Removeing the Process Descriptor
------------------------

After the parent has received info on terminated child, child's `task_struct` is deallocated

wait() and co
  - use wait4()
  - std behaviour is to suspend execution until child exits

`do_exit() -> exit_notify() -> forget_original_parent() -> find_new_reaper()` (assigns the orphaned processes a new parent)

`init` routinely calls wait() on its children, cleaning up any zombies assigned to it

Process Scheduling
==================

Fundamental decision: decide which process will run next given a set of runnable processes

Multitasking
-------------

Interleaving execution of multiple processes
Runs processes simultaniously on multi-core machines
Illusion of concurrency on single processor machines

Enables processes to block or sleep when work is unavailable
  - can wait until some event (keyboard input, network data, passage of time)
  - many processes in memory, but only one in a runnable state

Two flavors
  - cooperative multitasking
  - preemptive multitasking (most modern OSs)
    - Scheduler decides when a process is to cease running and a new process is to begin running
  - 'Preemption' - involuntarily suspending a running process
  - Usually time a process runs is predetermined - called a 'timeslice'
  - Linux's unique 'fair' scheduler does not employ timeslices _per se_

  - coop multitasking - proc does not stop running until in voluntarily decides to do so (yielding)
    - hung process can bring down the whole system
    - scheduler cannot make global decisions about how long processes run
    - processes can monopolize the procesor for longer than the user desires

Linux's Process Scheduler
----------------------

O(1) scheduler
  Constant time decisions on timeslice calculation and per-processor runqueues
  Scaled well for large 'iron' systems with tens if not hundreds of processors

  # TODO - define pathological, context: pathological failure
  Latency-sensetive applications were not serviced well
  Interactive processes - any application that users interact with

Rotating Staircase Deadline scheduler (most notable of several new algorithms)
  - introduced `fair scheduling` (borrowed from queuing theory)

Policy
--------

Behavior of sched that determines what runs when
Can determine the 'feel' of a system

CFS - Completely fair scheduler

### I/O-bound vs processor-bound Procs

I/O - bound

  Spends most of its time submitting and waiting on i/o requests
  Usually runnable for only short durations as it eventually blocks waiting on mor io
  (io here means any type of blockable resource)
  eg most GUI applications

Processor - bound

  Spend much of their time executing code
  Tend to run until preemption
  Dont need to run often (system respone not required)
  Run less frequently but for longer durations

Some processes change behavior characteristics
Try to satisfy two conflicting goals
  fast process response time (low latency)
  maximal system utilization (high throughput)

Linux favors i/o-bound processes

### Process Priority

Rank processes based on their worth and need for processor time. Procs with higher priority run first and lower, later
Processes with the same priority run round-robin

Both user and system can set a process's priority to influence scheduling behavior

Two priority ranges
  'nice' values - in the range (-20 - +19)
    larger nice values are lower priority (they ar nicer to other procs)
    lower numbers ar higher priority
    linux controls proportional timeslice size, others (like OS X) have control over absolute timeslice
    Can see nice values using `ps -el`

    # Note - after looking, things like kernel workers get maximal priority, pulso audio has high priority for proc in user space (want continuous play)

  Real-time priority
    By default rage from 0 - 99
    Higher # - higher priority
    All real-time processes are at a higher priority than normal procs
    Disjoint value spaces from nice value

    See with `ps -eo state,uid,pid,ppid,rtprio,time,comm` under RTPRIO
      Value of '-' means the proc is not real-time
    # Note - after looking, only RT user-space thing was a wifibus?

Timeslice
-----------

How long a task resource can run before it is preempted

  Too long => application doesnt feel interactive
  Too short => lots of time wasted on context switching

IO do not need long time slices
Processor-bound processes crave long timeslices to keep their caches hot

### On Linux

  Procs get a proportion of processor time (determined by weight)
  # NOTE - a weight is given. If things are to be added, they must be normalized because the processor use will always attempt to be 100% (1)
  Nice value affects this performance

  It is preemptive - so of a higher priority task becomes runnable
  it will preempt the current process if it has not yet reached its
  share of the cpu (as determined by its proportion)

### Eg

  Text editor and video encoder
  Want text editor to have large amount of time available to it,
    not because it needs in, but because we want it to have time available
  Want text editor to preempt the video encoder when it wakes up
  Video encoding can take more than its allotted time, esp b/c the text editor is usually blocked on io

  The key is that linux is giving processor time immediately when the text editor requires (because it has used less than its allotted 50%)

The Linux Scheduling Algorithm
-------------------------------

Has several scheduler classes
Scheduler is modular
Allows different algorithms to schedule different types of processes
Algorithms can coexist

Base scheduler
  code is in `kernel/sched.c`
  it iterates over each scheduler class in order of priority
  the highest priority scheduler class that has a runnable process wins

Completely Fair Scheduler
  SCHED_NORMAL in Linux
  code in `kernel/sched_fair.c`

### Process Scheduling in Unix Systems

Unix exports priority to the user-space in the form of nice values
 `- leads to some problems
    1. Mapping nice values onto timeslices requires a decision about what absolute timeslice to allot
       Leads to sub-optimal switching behavior

       When only 2 low priority processes run, each gets a very small timesilice. Many more context switches are required than if time were determined proportionally, rather than absolutely
    2. Difference of 0 to 1 is much smaller than 18 to 19

       Inverse processor time allotments
       Background tasks (very nice tasks)
         - usually need more computation time
         - in unix, low priority gets smaller, not larger timeslices
       Normal priority processes tend to be user tasks
         - need shorter, more frequent time slices

       Solution: make nice values geometric instead of additive

Relative nice values are off too
  Proportion of nice values does not map well to the proportion of processor time.
    Eg NV0 = 100ms, NV1 = 95ms # almosts splitting 50/50
    Eg NV18 = 10ms, NV19 = 5ms # first gets double the time

    Incrementing or decrementing a nice value has wildly different effects depending on what value you start at

    3. Absolute timeslices must be a thing the kernel can measure
       Im many OSs this means some multiple of the timmer tick which varies widely by machine
       10ms on one might be 1ms on another
       # TODO - reread this section after ch 11

       Solution: Mapping nice values to timeslices using a measurement decoupled from the timer tick

    4. Interactive process wake-up
       Often, you want to give a proc a temporary boost in priority so it can run in spite of an expired time slice
       Doing this might open up the OS to a program that tries to game the algorithm for unfair processor time


  Real underlying problem is that
    While it is nice to have a contstant switching rate
    Fairness becomes variable

  Solution: do away with timeslices completely and assign processes a proportion of the processor
    Results in a constant fairness but a variable switch rate

### Fair Scheduling

CFS is based on the concept: 'Model process scheduling as if the system had an ideal, perfectly multitasking processor'
  Perfect Fairness and Perfect Multitasking
    - Timeslices are infentessimally small
    - For _any_ span of time, each process would have run the exact same ammount as any other process
    - With 2 processes, this is like running 2 processes at the same time at 50% power

  Unfortunately we cant actually multitask procs
  And infinitely small timeslices are unreasonable
   - b/c switching cost
   - b/c cache penalty

  # TODO - more often, think about the idea / continuous situation

  CFS actually
    - Calculates how long a process should run as a fn of the total number of runnable processes
    - Runs procs round robin - selecting the project that has run the least (is this conflicting? or do allotments of time keep round-robin actually in order)
    - Nice values _weight_ the proportion of processor a process is to receive
    - Each process runs for a 'timeslice' proportional to its weight / total weight of all runnable threads
    - Actual timeslice calculation
       - Set a target for its approximation of 'infinitely small' scheduling duration in perfect multitasking (known as _targeted latency_)
       - Note, smaller targets yield better interactivity at the expense of higher switching costs (and worse throughput)

       - TODO - need paper
       - If targeted latency is 20 ms, with n tasks of the same priority, each task will run for 20 / n miliseconds
       - since high n will result in high switching costs using this model, CFS imposes a floor on the timeslice assigned to a process
       - floor is called _minimum granularity_ - default of 1ms
       - so system is not fair when n is high, because all procs will run for the minimum timeslice
       - TODO - figure out how the targeted latency changes with high n. I think, large n just means that minimum viable targeted latency is larger than setc:w

    - Nice value imposes weight penalty on procs
      eg 2 procs might split the 20ms targeted latency 15 and 5

  CFS isnt prefecly fair, it just approximateds perfect multitasking
  It can place a loer bound on latency of n for n runnable procs on the unfairness
  # TODO - how would you define latency to someone

The Linux Scheduling Implementation
------------------------------------

Actual implementation lives in `kernel/sched_fair.c`

4 Components

  - Time Accounting
  - Process Selection
  - The Scheduler Entry Point
  - Sleeping and Waking Up

### Time Accounting

Most unix

  Assign a timeslice
  Decrement timeslice by the tick period each tick
  On 0, preempt the proc

#### The Scheduler Entity Structure

Used by CFS to keep track of process accounting
Embedded in the _process descriptor_ `task_struct`
  member var `se`

#### The Virtual Runtime

`vruntime`
  - stores the _virtual runtime_ of a proc
  - time spent running normalized by the number of runnable procs
  - unit: nanoseconds (decoupled from the timer tick)
  - on a perfect multitasking system, all tasks of same priority would always have the same virtual runtime
  - used to determine how long a proc has run and how much longer it ought to run

`update/curr(struct cfs_rq *cfs_rq)` manages the accounting

### Process Selection

Run the process with the lowest vruntime
CFS uses a red-black tree to manage list of runnable procs
Note - tree is not actually traversed in finding the left-most node b/c it is cached

### The Scheduler Entry Point

`schedule()`
  - entry point for rest of the system
  - system command, in turn cals `deactivate_task()` which removes the task from the runqueue
  - generic wrt the scheduler classes
    - find the highest priority scheduler clas with runnable proc, asks it what to run next
    - invokes pick_next_task()

  - CFS is the scheduler for normal processes


### Sleeping and Waking Up

Special state that signals to the scheduler that a task does not wish to be run
When sleeping, a task is _always_ waiting for an event
Events include
  - specified amount of time
  - more data from a file I/O (common)
  - some other hardware event

A task may be forced into sleep if it requests tries to obtain a contended semaphore

Two states associated with sleeping
  `TASK_INTERRUPTIBLE`
     - respond to signals
     - _spurious wake up_ -wake-up not caused by the occurrence of the event.. so check and handle signals
  `TASK_UNINTERRUPTIBLE` - ignores signals
  both sit on a wait queue waiting for an event

#### Wait Queues

Can be declared statically or dynamically
Processes put themselves on wait queues and declare themselves not runnable
When the event associated with the wait queue occurs, the procs on the queue are awakened
Race conditions can arise when sleeping and waking are incorrectly implemented
  eg. dont go to sleep aftr the condition becomes true

There is a prescribed approach to putting a process on the wait queue - use it should you ever need to (p 61)

#### Waking Up

`wake_up()`
  - marks TASK_RUNNING, enqueues it
  - sets need_resched if the awakened task's priority is higher than the priority of the current task
  - code that causes the event to occur typically walls wake_up() on everything in its queue waiting for the data

When a task awakens, it is not safe to assume the event has occurred. It may have been a spurrious wakeup - need to check event


Preemption and Context Switching
-------------------------------

`context_switch()`
  - switches from one runnable task to another
  - called by schedule()

  2 jobs
    - switch the virtual memory mapping from that of the old to the new (via `switch_mm`)
    - switch processor state from the previous proc's to the current's
      - save and restore stack info / processor registers / arch specifics
      - via `switch_to()`

Kernel must know when to call schedule
  cant trust user-space programs to do it
  `need_resched` flag used to signify needed reschedule
    - by `scheduler_tick()`
    - by `try_to_wake_up()` when woke proc is higher prior
    has some accessor methods

  user preemption - preemption when kernel is about to return to user-space
    - from system call
    - from interrupt handler

  kernel preemption - no need to wait for kernel code running in user context to return, can reschedule any time it is safe to do so
    Not safe
      - when a lock is held (lock count is non-0)
      - can do this because the kernel is SMP-safe

    can occur
      - when an interrupt handler exits, before returning to kernel-space
      - whent kernel code becomes preemptible again
      - if a task in the kernel explicitly calls `schedule()`
      - if a task in the kernel blocks (which results in a call to `schedule)`)

Real-Time Scheduling Policies
-------------------------

Two real-time scheduling policies
  `SCHED_FIFO`
    first in, first out - without timeslices
    always scheduled over `SCHED_NORMAL`
    runs until it explicitly blocks or yields
    no timeslice, can run indefinitely
    only higher priority real-time proc can preempt
    two or more at the same priority run round-robin
  `SCHED_RR`
    same as above, except with a predetermined timeslice
  (`SCHED_NORMAL` is non-realtime policy )

  Managed by a specia real-time scheduler in `kernel/sched_rt.c`

  'soft real-time' - kernel tries to make timing deadlines but no promises
  'hard real-time' - guarantees on the capability to schedule real-time tasks

  # TODO - not sure what it means for nice values and rt values to share a priority space, and how we got a range of 100 to 139

Scheduler-Related System Calls
-------------------------------

Bunch of C library functions which call system calls, almost directly
can
  set a procs nice value
    only root can set a negative nice value
  set sched policy
  get sched policy
  set rt priority
  get rt priority
  get max rt pri
  get min rt pri
  get proc timeslice
  set proc processor affinity
    attempts soft affinity by default?
    bit-map enforce hard processor affinity
  get proc processor affinity
  yield tho processor

# Note, much of sys call code is checking arguments, setup, and cleanup

There is a load balancer that juggles procs around processors when necessary

`sched_yield()` - expires a proc so it wort run for a while, moreso than just moving it to the end of the run array

Process and Thread Scheduling (Guest ch from diff book)
===============================

Scheduler
---------

composed of two separate modules

  Process scheduling - decision-making policies to determine the order in which active processes should compete for the use of processors
  Process dispatcher - binding of selected process to a processor (remove from run-queue, change status, load state)

implementation

  might be called through syscalls, or
  might be an anonymous process that polls
    sometimes in a master/slave configuration

_quantum_ - predifined period of time the process is allowed to run
_priority_ - used to decide 'who next'
  P = Priority(p) where P is a number and p is a process
    P is a number
      sometimes static, given p
    p is a process
  divides procs into _priority levels_
  _ready list_ (RL), struct to hold levels

framework for scheduling

  2 fundamental questions
    when to schedule - when should the scheduler be active
    who to schedule  - what proc
      governed by
        _priority function_ - seen above
        _arbitration rule_ - satisfies ties

  decision mode
    when are priorities evaluated
    non-preemptive
      run until block or termination
    preemptive
      stop the presses, there is more important shit to print
      _quantum-oriented_ - performed each quantum
      when other proc priority > current proc
      ususally more costly than non-preemptive

  priority furnction
    uses diff attributes from proc and system
    common attrs
      service time so far
      is real time?
      deadline
      periodicity
      external priority
      memory requirements

    _job scheduling_
      long term scheduling
      choose among batch jobs
      arrival might be arrival in the system
      departure might be proc termination

    _process scheduling_
      select among currently ready processes to run next
      arrival might be change to ready state
      departure might be change to running or blocked

    _attained service time_ - amount of time since arrival the proc used the CPU. (Often the most important parameter in scheduling)
    _real time in system_  - total time in system (attained + waiting time)

    attained and real time of a proc make it easy to compute ratio of runtime
    total service time
      attained service time at time of departure
      some systems know this time in advance
        usually true of repetitive service tasks
      sometimes predicted based on proc's most recent behavior
        procs that block frequently will have smaller total service times
    deadline
      point in real time by which a task must be completed
      time-sensative computation, eg weather forecast, wideo transmission
      shorter deadline ~= higher priority
      periodicity
        repetition of a computation
        sometimes used to model event driven processes
          especially when events are polled
          fixed period automatically imposes an implicit deadline
      external priorities
        explicitly assigned priorities
      memory requirements
        major scheduling criterion in batch operating systems
        interactive systems, this is a good measure of swapping overhead

Arbitriation Rule
----------------

Resolves conflicts btwn procs of equal priority
Could be random choince, could be round robin

Common Schediling Alogrithms
-------------------------

FIFO
  arrival time is only criteria

Shortest-Job-First
  non-preemptive
  based on total service time
  priority is inverse of of total service time

Shortest-Remaining-Time
  preemptive and dynamic version of SJF
  P = -(total time - attained service time)

Round-Robin
  uses fixed quantum
  quantum might be different at different cpu loads
  linux sets a floor for minimum quantum time
    avoids lots of context switching at high load

Multi-Level Priority
  Fixed set of priorities, 1 to n
  Each process is assigned a priority e
  P = e
  e can be changed by authorized system calls
  usually preemptive
  at each level - RR if preemptive, else FIFO

Multi-Level Feedback
  similar to MLP, but
  processes migrate to lower levels as attained service time increases
  each level gets a tP max, if a process exceeds that value, its priority is decreased a level
  There is a formula (P = n - floor(lg2(a/T + 1)))
    n - the number of priority levels
    T - max time at highest level

Rate Monotonic
  Real time systems
    procs and threads are often _periodic_
    computation must be done before the start of the next service time
  Preemptive policy that dispatches according to period length
    the shorter the period, the higher the priority

Earliest-Deadline-First
  preemptive and dynamic
  highest priority assigned to the process with the smallest remaining time until its deadline
  assumes all processes are periodic
  assumes deadline is equal to the end of the current period
  probably not great for humans? or, maybe is good for some humans
  P = -(d - r%d)
    r - time proc entered the system
    d - period
    r / d - number of completed periods
    r % d - time already expired fraction of the current period

  (Table is pretty cool)

Comparison of Methods
---------------------

FIFO, SJF, SRT - developed primarily for batch processing system
If total service times are known (or can be predicted)
  SJF or SRT are usually preferrable to FIFO
    b/c both SJF nd SRT favor shorter comp time over long
      users want fast service for short tasks
      total time required
        independent of process order
      avg turnaround time
        can be changed with diff process orders
        one of the more important metrics used to compare sched algos
      # NOTE - it seems like it would be easy to show change in process attributes for different scheduling algorithms
      # probably in a visually appealing way

_response time_
  important in time share systems
  if important
    implies a preemptive algorithm
  good algos
    round robin
      responsiveness depends on q
      q -> inf, rr becomes fifo
      q -> 0, rr is dominated by context switching
    multi-level feedback

  MLF
    IO bound procs tend to remain high priority
    batch jobs tend towards the bottom of the list

Example Schedulers
----------------

VAX/VMS
  15 priorities for real-time, 15 for regulars
  Realtime uses multi-level scheme (without feedback)
  Regural processes
    have standard
    adjusted with other events in the system
    eg. read completion will ++ a proc priority
        reaching the end of runtime will -- proc priority

Windows 2000
  similar to above except threads are scheduled rather than procs

# NOTE - scheduling threads could use the process (tgid struct) to keep track of the attained time - or some aggregate time

Minix
  Multi-level queues
    _task_ (kernel stuffs),
    _server_, (memory, file, network management)
    and _user_ levels
    RR

Linux
  no queues of threads/procs

  sort by _goodness_

  base and current goodness is tied to the quantum
    each thread has its own variable-length quantum
      like a countdown time

  each thread
    gets a base quantum (bq)
    gets a remaining quantum (rq)
    current goodness (g)
      if (rq == 0) g = 0       # alloted quantum has been consumed
      if (rq > 0) g = bq + rq

  epochs
    division of scheduler time
    end when no threads can run

      all threads (or 'exahsted quantums',
                      'blocked')

    reset with formula `rq = bq + (rq / 2)`
      each thread gets its base quantum plus half of its remaining
      naturally favors I/O - bound threads

  realtime threads
    always executed before normal threads

    can be scheduled (or RR,
                         FIFO)

    helps to know if deadline will be met _before_ execution

    _schedule_ - an assignment of proc to process
    _feasible schedule_ - iff all deadlines are satisfied
    _optimal scheduling method_ - resulting schedule meets all deadlines whenever a feasible schedule exists

  periodic processes
    utilize ti/di of the cpu time
      ti is the total service time for process i for one period
      di is the period length of process i

    overall utilization is
      U = sumi=1-n(ti/di)

    schedule is feasible if U <= 1

    Rate Monotonic
      will find optimal for U <~ .7
      therefore, it is not optimal

    Earliest-Deadline-First
      will always fond the optimal solution if there is one

Priority Inversion
-----------------------

Higher priority procs or threads can be delayed or blocked by lower priority ones

possible solutions
  make critical sections non-preemptable
    works well if critical sections are short and few
  execute critical sections at the highest priority of any process that _might_ use them
    suffers from overkill
    leads to other forms of priority inversion
  dynamic priority inheritance
    when lower priority proc blocks a higher one,
    the lower proc is assigned a priority = to the higher until
    it exits the critical section.

    allows for nested critical sections
    blocking times boulded by critical section execution times
    can be a cumpulsory requirement in real-time operating systems

POSIX focus
  user can associate a _ceiling attribute_ with a mutex semaphore
    integer specified
    when mutex is activated, priority is automatically raised to the value of the ceiling
    threads that use this mutex will always run at the same level and will not preempt each other
  priority inheritance

Windows 2000 focus
  every second, scan all ready threads
  any that have been idle for more than 3 seconds
    bost priority to max
    extend quantum by 2x
  no thread will be postponed indefinitely due to higher priority threads

Multiprocessor and Distributed Scheduling
====================

Two principal approaches to scheduling procs and threads on multicomputers

  Use a single scheduler
    all CPUs are in same resource pool
    any proc can be run on any processor

  Use multiple schedulers
    machines are scheduled separately
    often a choice for non-uniform multiprocessors
      different CPUs have different characteristics and functions
      eg. special machines fro I/O, filing, data acquisition, fast Fourier transforms, matrix computations
    users and apps tend to have greater control over performance

  Case: _the Mach Scheduler_
    each machine set has a global ready list of schedulable threads
    dispatching
      select highest priority thread from the local queue
      when empty, select highest priority thread from global list
      execute idle thread when both lists are empty

  Case: _Windows 2000_
    affinity scheduling
      goal: keep a thread running on the same processor (if possible)
      make use of caching
    each thread is associated with two integers
      ideal processor
      last processor
    when thread is ready
      prefer running on
        1. the ideal processor
        2. the last processor
        3. the processor currently runnng the scheduler
        4. any other processor
    idle processor looks for proc to run
      1. last thread to run o the processor
      2. the thread for which this is the ideal processor
      3. then whatever the top priority processor is

  Distributed systems
    normally have separate schedulers at each site (cluster)
    processes usually preallocated to clusters
      communication costs time
                    costs message traffic
      applications have threads that utilize a shared address space
      distributed shared memory and migrating threads
        practical ideas
        realization is still slow
    often have functional distribution
      leads to different schedulers
    a node might specialize
      focus on email - cause node is an exit node
      printing (node has access to printer), etc
      RTB requires being really fricken close to exchanges, need to fill request fast

  Rate monotonic - preemptive policy that disatches according to period length
    shorter the period, the higher the priority

  Earliest-Deadine-First (EDF) - preemptive and dynamic scheme
    usually designed for real-time applications

  Neither of the above two are optimal for multiprocessors
    often satisfactory heuristics though
    feasible schedules produced, when they exist and when machines are not under heavy load

  preallocation of procs is most common because of the complexit of trying to meet real-time constraints


Linux Scheduler

 - time compelxity depends on number of priorities (queues)
 - time compelxity does not depend on the number of processes

Calculating Timeslices

 Quantum = 140 - (static priority) * 20 if SP < 120
 Quantum = 140 - (static priority) * 5 if SP >= 120

 static priority - think nice & real-time stuffs

 note higher priority process get _longer_ quanta
 important process should run longer

 dynamic priority
   calculate from static priority and avg sleep time
     record time sleeping (up to a max value)
     when the process is running, decrease value each timer tick
     [0, 10] - rough estimate of % time sleeping recently
       0 hurts priority by 5
       5 is neutral
       10 helps priority by 5

   DP = max(100, min(SP-bonus + 5, 139))
   _interactive_ - if bonus - 5 >= S/4 - 28
   low-priority processes have a hard time becoming interactive
   default priority process becomes interactive when its sleep time is greater than 700ms

Using Quanta
  each tick decreases the quantum of _current_
  process gets prempted when clock hits 0
  if non-interactive -> send to expired list
  if interactive     -> send to end of current queue (scheduled again)
    but running will decrease bonus (along with priority)

Avoiding Indefinite Overtaking
  remember those 140 queues (_active_ and _expired_)
  only run procs from active queues (put on expired when quanta == 0)
  swap active and expired queues after running through all of them
  pointers to current arrays

  two arrays
    - avoids the need for traditional aging
    - why is aging bad? -> it's O(n) at each clock tick
    - ergo, not required for peoples scheduling

Many OSs iterate over procs to update priorities
Linux is more efficient

  - procs are touched on start and on stop only
    - recalculate priorities, bonuses, quanta, and interactive status
    - never loop over all procs!

Locking Runqueues
  Sometimes for queue rebalancing, kernel locks queue
  in order to move procs

Real-time scheduling

  soft real-time scheduling
  static priority [0, 99] are real-time
  FCFS & RR

Sleeping and Waking

  - proc needs to wait for events
  - proc added to wait queue
  - wakeups can happen too soon - need to check args and maybe re-sleep

  - you dont wake a process
  - you DO wake a wait queue

System calls

  `nice()` - lower a process' static priority
  `getpriority()/setpriority()` - change priorities of a process group
  `sched_getscheduler()/sched_setscheduler()`
    set scheduling policy and params
    and others that start with `sched_` (see with man -k)
  `scheduler_tick()`
    called each timer tick to update quanta
  `try_to_wakeup()`
    attempt to wake a proc
    put on run queue
    rebalance loads
  `recalc_task_prio()`
    update average sleep time and dynamic priority
  `schedule()`
    pick the next process to run
  `rebalance_tick()`
    check if load-balancing is needed

Fair Share Scheduling
  would need to
    add new scheduler type

Two basic functions
  time of day - a service provided to apps
  interval timers - something should happen in n ms from now

  _Real-Time clock_
    tracks time and date

    can interrupt at a certain rate
    can interrupt at a certain time

    only used by Linux for time of day

  _programmable interval timer_
    generates periodic interrupts
    usually based on HZ
    some other special , less common timers

  Things executed on timer ticks
    keep track of time
    invoke dynamic timer routines
    call `scheduler_tick()`
    system uses best timer available
    `jiffies`
      incremented each tick
      roughly the number of HZ since sys boot
      32 bit counter - wraps around in ~ 50 days

  Time of Day
    stored in xtime (`struct timespec`)
    incremented once per tick
    apparent tick rate can be adjusted slightly see `adjtimex()`

  Kernel Timers
    Dynamic timers
     - call routine after a praticular interval
     - specify an interval, subroutine to call, and a param to pass subroutine
     - param differentiate different instantiations of the same timer
     - timers can be cancelled
    Delay loops
     - tight spin loops for very short delays (micro seconds or nano seconds)
     - nothing else can use CPU during that time (except via interrupt)
     - for when dynamic timers are too much overhead

  System Calls
    `time()` and `gettimeofday()`
    `adjtimex()`
      tweaks apparent clock rate (even the best crystals drift)
    `setitimer()` and `alarm()`
      interval timers for applications
    user-mode interval timers
      similar to kernel dynamic timers


System Calls
=====================

Allow user-space apps to
  - access hardware
  - create new procs
  - communicate with existing ones
  - request os resources

Layer between user programs and hardware
  3 purposes
  - abstracts hardware eg. read/write files dont care what media you are on
  - security / stability
  - virtualized system provided to the process

Without controlled access of sys resources
  - couldnt multitask
  - wouldnt be able to have virtual memory

Legal entry points to the kernel
  - sys calls
  - traps
  - exceptions

Even device files, or files in /proc, are accessed via system calls

APIs, POSIX, and the C Library
-------------------------------

User space programs usually programmed against an API implemented in use-space, not directly to sys calls
Can create defferent APIs from building blocks provided by system calls

POSIX - series of standards from the IEEE (eye-triple-E)
  Linux strives to be 'POSIX' and 'SUSv3'-compliant
  A very common api is based on the POSIX standard

A meme related to interfaces in Unix is 'Provide mechanism, not policy'

Syscalls
---------

Typically accessed by function calls in the C library
System calls also provide a return value of type `long` that signifies success or error
  usually negative value denotes an error
  usually 0 is a sign of success
 `errno` - holds a special error code of the last error
 `perror()` - used to print human-readable error code

tgid - is the thread-group id. This is == pid for normal processes, while for threads is the same for all threads of the same group

`asmlinkage`
  - only look on the stack for the fns arguments
  - required for all syscall

in kernel, all syscalls are prefixed with `sys_`

### System Call Numbers

Unique # used to reference a specific system call
User-space process identifies calls by #, not name
#, once assigned, cannot change - else compiled apps will break

### System Call Performance

Faster than many other os's
Cause
  - fast context switching
  - streamilened enter/exit
  - simple call handler

System Call Handler
-------------------

Kernel fns exist in a protected memory space. User-space programs cannot call them directly

Method - use an interrupt: incur an exception and the system will switch to kernel mode, executing the exception handler (in this case, the syscall handler)

# TODO - difference between throwing an exception and 'trapping into the kernel'

Some processors have a special sysenter feature

### Denoting the Correct System Call

Architecture dependent, on x86, user-space app sets the eax register before causing the trap into the kernel

### Parameter passing

Sent to the kernel on registers `ebx, ecx, edx, esi, and edi`
Return value is also sent via register
  on x86, the `eax` register is used

System Call Implementation
----------------------------

Must first define its purpose - exactly one
What are the new system call's arguments, return value, err codes
Should have a clean and simple interface
Small # args possible
Remember, semantics and behaviour are important, they must not change
  - Consider how the function use might change over time
  - Can your method be changed while maintaining compatability?
  - Will bugs be fixable?
  - Are you needlessly limiting the fn in some way?
  - Purpose remains the same, but uses may change
  - Dont make architectual changes

### Verifying the Params

Thourough checking is in order - security is imperrative
Check valid and legal, but also correct
Cant trust user-space procs
Always check validity of pointers!!!
  - must point to user-space memory
  - must point to memory in the process's address space
  - if reading, memory must be marked readable
  - if writing, memory must be marked writable
  - if executing, memory must be marked executable
  there are helper fns to do this

System Call Context
-----------------

In proc context
  - Kernel can sleep
    Makes available most kernel functionality
  - is fully preemptible
    Must make sure call is reentrant
      otherwise trouble occurs when another higher priority proc makes same syscall in the middle of a lower prior proc

### Final Steps in Binding a System Call

Add entry to the end of the syscall table
  - must be done for each arch
For each architecture, define the call number in `asm/unistd.h`

Compile the syscall into the kernel image (as opposed to compiling as a module)

# Note - interrupt handlers cannot sleep, and are therefore much more limited in what they can do than syscalls running in proc context

Since new sys calls will not be in the C libs, you can wrap access with macros. This will handle setting up register contents and issuing the trap instructions. `_syscalln()` where n is between 0 and 6. (# of params)

_see this section for use of your new syscall_

### Option of a New Syscall

#### Pros

  Simple to implement, easy to use
  Fast performance on Linux

#### Cons

  Requires a syscall number, which needs to be officially assigned to you
  Written in stone, interface cannot change
  Must be explicitly supported by each architecture
  Not easily used from scripts
  Cannot be accessed directly from filesystem
  Hard to maintain outsize of the master kernel tree
  Overkill for simple info exchanges

#### Alternatives

  Implement a device node and `read()` and `write()` to it
    use `ioctl()` to manipulate specific settings or retrieve specific info
  Certain interfaces, such as semaphores, can be represented as file descriptors and manipulated as such
  Add the information as a file to appropriate location in sysfs

# TODO - what is sysfs?
# Note: you can get a proc's processor afinity in linux with `sched_getaffinity`. In a group context, a warm cache of the processor is relevant. If someone is familiar with offer caching code, you probably want them to continue rather than ramping-up someone else.

# Note: for humans: loading something onto disk takes time (learning something for the first time), disk to memory is faster (recall of past), recent tasks is faster still

Other
------

# note - preempt notifiers - let something know that a process has been preempted
# TODO what is - group leader
# TODO what is - slacktime

Kernel Data Structures
========================

Dont roll your own
Not inclusive, just the basics in this book

  - linked lists
  - queues
  - maps
  - binary trees


Linked Lists
-----------------

Allows storage of variable number of elements
Elements created at differernt times
  - must be variable in length
  - may occupy discontiguous regions in mem

Single linked, double linked, or circular linked
Linux Kernel LL is unique, but pretty much a circular, doubly linked list

### Moving Through a Linked List

Linear movement

Bad at
  Random access

Good at
  Dynamic addition / removal O(1) ftw
  Iterating

### In Linux Kernel

Rather than including pointers in the data struct, include a pointer to a LL node instead
`linux/list.h`
Reference to list_head in the data struct allows several macros to be used to `list_add()` or `container_of()` which take list_head as arguments

Since the offeset of a node pointer in a data struct is set at compile time, it is easy to get the containing structure of the list
  Presumably `const typeof( ... )` will get the type of the struct in a generic way, so `container_of()` doesnt need to know anything about it

Be sure it is embedded in data_struct, otherwise macros arent helpful
Need to init the list head on creation

Also, there are some safe(er) methodes
Also, also, you may need to synchronize for exclusive list access

Queues
-----------------

Producers and consumers
see `kernel/kfifo.c` and `linux/kfifo.h`

Queues are one page-sized by default

Queues leave it up to the user to say
  - How many bytes to copy in
  - How many bytes to copy out

If a queue is nearly full, only a partial add will be done
If a queue is nearly empty, only a partial read will be done

Maps
------------

Aka associative array
Keys to values

Hashtables are a type of map - but not all maps are hashtables
Trees can be used as well (self-balancing)

Hash
  - better average-case complexity

Trees
  - better worst-case complexity
  - can also maintain a sorted order
  - can use any type of key, so long as it can define the <= operator

### In Linux

Map is used to map a UID to a pointer
Will also generate a UID for you
`idr` maps user-space UIDs to their kernel data structures

Need to preallocate a UID before getting a new one, cause memory and locking are hard


Binary Trees
----------------

Math says it is an 'acyclic, connected, directed graph' where each node has 0 or more outgoing edges
Binary tree is a tree with no more than 4 outgoing edges per node

_binary search tree (BST)_ - specific ordering on nodes

  left of root is only nodes < root
  right of root is only nodes > root
  all subtrees are also binary search trees

Pros
  fast search
  fast ordered traversal

### Self-balancing BST

Depth of all leaves differs by at most 1
In other words, height is managed

#### Red-Black Tree

  type of self-balancing bst
  binary tree of choice in linux

  6 properties are enforced
  Ensures that the deepest leaf has a depth of no more than double that of the shallowest leaf
  Tree is always semi-balanced

             B
           /    \
          R      B
        /  \    / \
       B   nilnil  B

  It does this by rotating shit thats unbalanced, but dont worry about it unless you are an undergrad compsci student, which im not (anymore)

  `rbtree` - `lib/rbtree.c` and `linux/rbtree.h`
  inserts are always logarithmic

What Data Structure to Use, When
--------------------------------

What is primary access method?

LL
  if iterating
  when performance is not important
  when interfacing with other LL consuming code

Queues
  if producer / consumer pattern is used
  if you know your buffer can hold your data

Maps
  if UID to obj is needed

Rb tree
  if storing lots of data is required
  fast searching, with efficient in-order traversal
  in-memory footprint is not bad

Radix tree

Algorithmic Complexity
----------------------

What is the _asymptotic behavior_ of the algorithm

Big-oh - upper bound
Big-theta - both an upper and lower bound
  usually when big-o is dropped, they are really talking abouth the least such big-o (the big theta)

Time Complexity - See AoA - Colby CS

Interrupts and Interrupt Handlers
================================

Communicating with hardware is a core responsibility
Aint got no time to wait for hardware

Rather than polling these devices, the responsibility is on the HW to interrupt the processor when it wants attension

Polling - check periodically
Interrupt - send a signal to the processor
Interrupt handlers - fn to handle interrupt

HW device interrupts
  async w.r.t. processor clock
  aggergated by multiplexes
  directly connected to the CPU

  have a unique value - for differentiation
    aka interrupt request (IRQ) lines

  IRQ 0 on a classic PC is the timer interrupt
  IRQ 1 on a classic PC is the keyboard

  Not always dynamically assigned
  PCI bus dynamically generates values


Exceptions
  - synchronous ~interrupts~
  - generated by the processor
  - from software

Interrupts
  - asynchronous interrupts
  - generated by hardware

Interrupt Handlers
------------------

aka interrupt service routine (ISR)
An associated handler for every device


Function
  One per device
  part of the _driver_
  normal C functions
  run in _interrupt context_
    aka _attomic context_
    code cannot block
  must run quickly
  acknowledge hardware, at a minimum

Top Halves vs Bottom Halves
---------------------------

To balance large amount of work with response speed

### Tap Halve

  - only the time critical stuff

### Bottom Halve

  - run at more convenient time


Network card example

  - card has very little memory
  - dont want buffer overruns

Registering an Interrupt Handler
--------------------------------

`request_irq()` in `linux/interrupt.h`
One driver for every device
One interrupt handler per driver

`request_irq()`
  params
  int irq
    the interrupt number
    usually dynamically allocated
  irq_handler_t handler
    pointer to actual handler fn
  unsigned long flags
    bitmask of one or more of the flags
    IRQF_DISABLED
      usually poor form to use
      handle fn blocks other interrupts from occurring
    IRQF_SAMPLE_RANDOM
      use as source of entropy
      helps with random number generation
      obv. dont set for predictable device interrupts
    IRQF_TIMER
      this handler procsses interrupts for the system timer
    IRQF_SHARED
      interrupt line can be shared among multiple interrupt handlers
  conts char *name
    ascii text name of device
    used by `/proc/irq` and `/proc/interrupts`
  void *dev
    used when lines are shared
    provide a unique cookie for later identification on removal
    the way the kernel tells which handler to remove for shared lines
    common to use the driver's device structure

  can sleep
    so no calling in interrupt context
    cause somewhere `kmalloc()` is called

  careful to initialize hardware fully before registering handler

  `void free_irq(unsigned int irq, void *dev)`
    disconnects line iff (not shared || last registered)
    must be called from process context

Unless your handler is old and crusty
  good chance you will need to share

Writing an Interrupt Handler
-----------------------

`static irqreturn_t intr_handler(int irq, void *dev)`
  params
    int irq
      line #
    void *dev
      same as passed to `request_irq()`
      typically the `device` structure
        b/c unique to each device
        potentially useful to have within the handler

  return
    IRQ_NONE
      handle's device was not the originator
    IRQ_HANDLED
      handler correctly invoked
    IRQ_RETVAL(val)
      macro for returning one of the above two based on val

    used to determine if device is issuing _spurious interrupts_
      spurious interrupts - unrequested

Interrupt handlers in Linux need not be reentrant
  Processors do not receive inturrupts on the same line while executing a handler for tat line
  Avoids dealing with nested interrupts
  Code can be preempted, though, by interrupts on other lines

### Shared Handlers

- IRQF_SHORED flag must be set
- dev arg must be unique
- Interrupt handler must be able to distinguish what device actually generated an interrupt
  hardware capability must exist

Invoked sequentially
Handler should quickly exit if its device was not the generator of the interrupt

Real-time clock (RTC)
  `drivers/char/rtc.c`
  separate from the system timer
  sets  the system clock
  provides an alarm
  supplys a periodic timer

  driver's init code registers the handler

Interrupt Context
-----------------

Not associated with a running process
  `current` macro is not relevant
  cannot sleep
  other functions are also not relevant

Time-critical
  can wait-loop, but dont
  push as much as possible into the bottom half

Stack location varies
Handler should not care what stack setup is in use
Always use the absolute minimum amount of stack space

Processor jumps to a unique location for each interrupt line
  executes code at that location

### `/proc/interrupts`

stats realated to interrupts
  line #
  # of times executed
  handling controller (eg `XT-PIC`)
  name

_procfs_
  a virtual file system
  in kernel memory
  mounted at `/proc`

  reading / writing files
    invokes kernel functions
    seems like actually reading/writing files (but are just fns)

`XT-PIC`
  std PC programmable interrupt controller

Interrupt Control
-----------------

Can enable/disable
  a line for the whole machine
    aka _masking out_
    not used as often now since, increasingly, lines are shared
  the interrupt system on a processor
    `local_irq_disable()` && `local_irq_enable()`
    or safer: `local_irq_save(flags)` && `loal_irq_restore(flags)`

Reasons
  need synchronization (disable interrupt preemption)
  kernel will not be preempted

Does not prevent concurrent access from another processor
  need a lock for that

No `cli()` (global disable interrupts)
  means all synchronization must be specific and narrow
  results in faster performance

### Status of the Interrupt System

`in_interrupt()`
  - nonzero if kernel is executing a handler
  - nonzero if kernel is executing a bottom half
  - if 0, kernel is in process context

`in_irq()`
  - nonzero if kernel is executing a handler


Bottom Halves and Deferring Work
==========================

Top half is not enough

Why need for fast

  - TH interrupts all the code
  - TH disables at least a line, if not all HW comms
  - Timing critical, eg. small HW buffers
  - Cannot block

There is a need to do less critical work later.

Bottom Halves
----------------

Do most work here. All but the critical.
You should probably process data copied in the TH, though.

Tips for dividing work:

  perform in handler
    - perform in handler
    - HW related
    - ensure another interrupt doesnt happen

  perform in bottom half
    - all other

### Several Mechanisms

  First approach - a static set of handlers
  Second approach - task queues that a driver could append to
  Third approach

    Softirqs
      - statically definded - compile-time
      - can run simultaneously on any processor
      - two of the same can run concurrently
        (require more care)

    Tasklets
      - dynamically defined
      - can run simultaneously on any processor
      - two of the same can _NOT_ run concurrently
      - build on softirqs

    Task Queue replaced with Work Queue
      - queue work for later
      - runs jobs in process context

    (kernel timers)
      - after specific time, rather than 'any time but now'

Softirqs
------------

Rarely used directly. Tasklets are much more comon form.
`kernel/softirq.c`
Usually reserved for the most timing-critical / improtant bottom-half prcessing
Currently, only used by networking and block subsystems. But also, tasklets and kernel timers are built on top.
Needed for scalability. If you do not need to scale to infinity processors, consider a tasklet
Used for high frequency or highly threaded use.

Represented by the `saftirq_action` structure (see `linux/interrupt.h`)

Each is registered in a 32-entry array
  only nine exist now

A softirq never preempts another softirq
But it can run on a differnt processor
Only interrupt handlers can preempt

Must mark the softirq before it will run
  aka _raising the softirq_
  usually done by the interrupt handler

  check for pending occurs
    in return from HW interrupt code path
    in the `ksoftirqd` kernel thread
    any code w/ explicit checks
      eg. networking subsystem

### Use

index
  must be assigned (`linux/interrupt.h`)
  used for priority - lower executes before higher

handler must be registered at run-time
  `open_softirq()`
  eg `open_softirq(NET_TX_SOFTIRQ, net_tx_action)`

To avoid need to lock, try and only use per-processor data

After installed and init'd
  can be raised using `raise_softirq(NET_TX_SOFTIRQ)`


Tasklets
-----------

Called by softirq logic
  `HI_SOFTIRQ` - higher priority tasklets
  `TASKLET_SOFTIRQ` - regular

Represented by the tasklet structure
  `struct tasklet_struct`
    `unsigned long state`
       0, `TASKLET_STATE_SCHED`, or `TASKLET_STATE_RUN`
    `count`
       tasklet can only run if non-zero
    `func`
       called the same way `action` is in softirqs

### Scheduling tasklets

Multiplexed on two softirqs

_sheduled tasklets_
  - equivallent of raised softirqs
  - stored two per processor
    linked lists
    `tasklet_vec` - regular tasklets
    `tasklet_hi_vec` - high priority
  - cannot be scheduled if already scheduled

System disables interrupts while tasklets are being sched'd
Raises the `TASKLET_SOFTIRQ` or `HI_SOFTIRQ` which will call
  `tasklet_action()` and `tasklet_hi_action()` respectively

    Loop over each pending tasklet in the retrieved list
    if already running on another processor
      skip it
    else
      mark it as being run, and run it

### Using Tasklets

#### Declare your tasklet

  statically or dynamically

  statically - use macros
    use `DECLARE_TASKLET`
     or `DECLARE_TASKLET_DISABLED` (count is initially 1, not 0)

    both create `tasklet_struct`

  dynamically create `tasklet_struct` and call `tasklet_init`

#### Write your handler

`tasklet_handler` passed an unsigned long data
Writing
  Cannot sleep
    no semaphores
    no blocking functions
  Interrupts are enabled
    careful if you share data with one

#### Scheduling Your Tasklet

Call `tasklet_schedule` with a pointer to your `tasklet_struct`
If already scheduled, it only runs once
If already running (eg on another processor), tasklet is rescheduled
As an optimization, a tasklet always runs on the processor that scheduled it (hopefully making better use of cache)

Need to enable the tasklet if it is created disabled
Disabled tasklets are just skipped

You can kill a pending tasklet with `tasklet_kill` (sleeps)
Helpful when a tasklet frequently reschedules itself

#### ksoftirqd

Per processor kernel threads
Help process softirqs and tasklets

Grew out of problem with heavy interrupt load
  esp. with softirqs that rescheduled themselves
  procs would starve for processor time
  but softirqs need to be processed

Possible solution 1
  Perform rescheduled softirqs before returning to procs
  Might starve out processes under high load
  User space needs to be run periodicly

Possible solution 2
  Dont handle reactivated softirqs
  must wait for the next time softirqs are processed
  doesnt take advantage of an idle system
  may starve softirqs

Actual Solution
  Dont immediately process reactivated softirqs
  if softirqs are excessive, wake up some special kernel threads
  special kernel threads
    run at lowest priority
    one per processor
    named `ksoftirqd/n`
    loop over pending softirqs
    sleep to yield to other procs
      coop nice toooo

#### The Old BH Mechanism

Old BH
  Max 32
  Must be statically defined
    inaccesible to modules
  Dynamicly defined must piggyback off already defined one
  Only one handler can execute at a time
    (strictly serialized)
  Did not scale well to multiple processes
    network layer, in particular, suffered
  Much like tasklets in other reguards

Work Queues
-----------------

Defer work into a kernel thread
  called _worker threads_
  default ones named `events/n`
  one per processor
  must have a strong reason not to use default
    eg. processor-intense
        performance-critical
        pervent starving out other default werk

Always run in process context

Can sleep - allows
  allocation of lots of memory
  semaphores to be used
  perform block I/O

Decision - if you need to sleep, use a work queue
  otherwise, tasklet

Alternatively - one could spin up a new kernel thread
  but this is frowned upon
  or punishable..

### Data Structures

#### Representing the Threads

  `workqueue_struct`
    one per _type_ of worker thread
    interface / abstraction exposed to the rest of the system
    holds a `cpu_workqueue_struct` per processor

  `cpu_workqueue_struct`

#### Representing the Work