Skip to content

Instantly share code, notes, and snippets.

@willscripted
Created March 27, 2014 11:56
Show Gist options
  • Save willscripted/9805963 to your computer and use it in GitHub Desktop.
Save willscripted/9805963 to your computer and use it in GitHub Desktop.
Because sometimes in business, your goals are tied to fiscal quarters
Linux - elitist attitude - only considers patche of real-world problems. Marketing bullets or one-off requests, such as pageable kernel memory, have received no consideration
Kernel Development
Introduction
==============
History
-----------
Unix - Elegant OS
(historically)
- only a few hundred system calls
- almost everything is a file
- written in C (portable)
- fast process creation time
- unique sys call 'fork'
(today)
- preemptive multitasking
- multi-threading
- virtual memory
- demand paging
- shared libraries w/ demand loading
- TCP/IP networking
- Many variants scale to hundreds of processors
- Some run on small embedded devices
Linux
- Born of frustration at Unix's non-OpenSource
- Implements the Unix API as definded by POSIX and the Unix Spec
- Not a direct decendent of unix
- Modular design
- Can preempt itself (kernel preemption)
- Kernel threads
- Dynamic binary loading in Kernel image (kernel modules)
- Does not differentiate bewoon threads and normal processes
- all procs are the same, some just share resources
- Hot pluggable events ?
- TODO - look into why linux ignores some common Unix features that kernel developers consider poorly designed, such as STREAMS, or standards that are impossible to cleanly implement (p 8)
System Components Usually Include
- Kernel
- C lib
- toolchain
- basic sys utils (eg, login process and shell)
OS Includes (Basic use and administration)
- Kernel (Manages hardware, distributes system resources)
- Interupt handlers
- Scheduler - to share processor time btwn procs
- Memory Management system
- System services (eg. networking / inter-proc comms)
- Device drivers
- Boot loader
- Command shell / UI
- Basic file system utilities
Kernel
- Above user programs
- Privalleged access (kernel space vs user space)
- C library provideos higher level functions (although not always that different)
- Clibs make system calls
- Not all C-libs use system calls (eg strcopy)
- applications get work done through system calls
- in one of three states (always, even corner cases)
- User-space, executing process code
- Kernel-space, process context, executing system calls
- Kernel-space, interrupt context, handlin interrupts
- Usually a monolithic static binary
- Usually require a system with a paged memory-management unit (MMU)
- enables memory protection
- each proc gets uniq addr space
- Monolithic servers (Linux)
- Simpler design
- Microkernels (Parts of OS X and Windows NT)
- Several 'server' processes that communicate via IPC (inter-proc com)
- Can swap out server instances
Interupts
- stop processor directly
- in turn, stops the kernel
- interrupts have numbers associated w/ them (hardware)
- handlers process and respond
- do not run in process context
Getting Started w/ the Kernel
=============================
[Directories] - see comprehensive lists
usr - early user-space code (initramfs) ?
Built with flags
Binary flags
Tristates - dont build, build as module, build into source
Configure flags with
make config (takes long time)
make menuconfig (ncurses)
make gconfig (gtk+)
or just edit `.config`
Build with
`make`
`make -jn` mulit-proc build where n is # procs)
Installing
Architecture and boot-loader dependent - see their docs
Modules are architecture independent
`make modules_install`
Warnings
Kernel lacks memory protection afforded to user-space
Clibs not available - to heavy / slow
see instead `linux/things.h`
also - inotify?
printk rather than printf
inline functions for shit that needs to be fast
inline functions usually declared in header files
`static inline void wolf(unsigned long tail_size)`
# TODO - what is asm() assembly compiler directive.
Usually only used in parts of the kernel that are unique to a specific architecture - cause assembly is arch specific and linux has a goal to support cross platform
Small per-process fixed-size stack
Dereferencing a NULL pointer (illegal memory access)
- Major kernel error
- No `SIGSEGV` errors
Kernel memory is not pagable. Erry bite you eat is one less byte of available physical memory
# TODO - what does it mean to 'catch a trap'
Dont do floating point calcs
Kernel has small stack
Symmetrical multiprocessing (SMP)
Kernel must synchronize between scheduled tasks
Spinlocks and Semaphores usually used to avoid race conditions
Portability - handful of rules
Segregate system-specific code to appropriate dirs
Remain endian neutral
Be 64-bit clean, do not assume word or page sive
Anync interrupts, pre-emption, and SMP mean synchronization and concurrency are major concerns
3. Process Management
====================
Process
- executing program code (the 'text section')
- Open files
- pending signals
- internal kernel data
- processor state
- memory address space (with one or more memory mappings)
- one or more threads of execution
- a data section contaiing global vars
Tasks kept in doubly linked list
Internally to the kernel known as a task
`task_struct`
- all the data needed by the kernel about a process
- allocated by the slab allocator
- the kernel stack holds the `thread_info`
- `thread_info` has a pointer to the `task_struct`
- there are maximum # procs 32k on old machines (cause small ints)
- will wrap around
- can be as high as 4m set in /proc/sys/kernel/pid_max
- `current` macro to get the `task_struct` of the currently running proc
- open files
- proc's addr space
- pending signals
- proc state
TODO - revisit 'slab allocator' & thread_info
Threads
- objects of activity within a proc
includes
- unique program counter
- process stack
- set of procesor registers
- the kernel schedules threads, not processes
- Linux does not differentiate between threads and procs
threads are just a special kind of proc
Procs
- procs provideo two virtualizations
- virtual processor - proc thinks it is alone on the cpu
- virtual memory - as if it were alone in its use of memory
- Threads share virtual memory but not virtual processor
- a program is not a process
must also include related resources
Important system calls
fork()
- creates a new proc
- returns 2x, once in the parent, once in the child
exec() - function family
- creates a new addrs space and loads a new program
exit()
- terminates the process and frees all resources
- puts proc into zombie state
wait4()
- parent inquires about the status of a terminated child
wait()
- parent can resolve zombie state
waitpid()
- parent can resolve zombie state
Process State
-------------
Condition of the Proc (5 flags)
system command `set_task_state(task, state);` # see `linux/sched.h`
TASK_RUNNING
- either currently running or on a run queue
- only possible state for proc running in user-space
TASK_INTERRUPTIBLE
- sleeping, eg waiting for some condition
TASK_UNINTERRUPTIBLE
- sleeping, does not wake up and become runnable on signal
__TASK_TRACED
- trace, eg. by ptrace
__TASK_STOPPED
- not running, cannot run
- eg SIGSTOP, SIGTSTP, SIGTSTP, SIGTTIN received
Process Context
------------------
Program code runs in user-space
System calls or exception triggering
- will enter kernel-space
- are the only interface to the kernel
- kernel is 'executing on behalf of the process'
- kernel is in 'process context'
The Process Family Tree
--------------------
`init` process is process #1
- started in last step of boot
- reads initscripts and executes more programs
Process relationships are stored in the `task_struct`
- `parent`-> parent's `task_struct`
- `children` -> list of children
Can navigate through tasks quite easily - see section for C samples
Process creation
-----------------
fork()
- creates a child proc that is a copy of the current task
- almost an exact clone
- copy on write, each process gets its own copied page if written to
- only overhead is dup of parent's page table & creation of new process descriptor for the child
exec() (a family)
- loads a new executable into the address space
Forking
----------
most work done by `do_fork()` (see kernel/fork.c) which calls `copy_process()`
- PF_SUPERPRIV flag removed from child - no super user privileges
- Child runs first to avoid copy-on-write overhead if child is just `exec`ing
- vfork is a little strange, parent waits for child to exec or exit
- use seems to be for legacy reasons. Or if you dont want to spend time copying the page table
The Linux Implementation of Threads
-----------------------------
Shared memory space
Multiple execution capabilities
Enable 'concurrent programming' and 'parallelism'
Linux don' care, it scheds like all procs
Differs from Windows or others whose processes are much heavier
Rather than 1 proc and 4 threads on Windows, linux has 4 procs w/ shared memory
clone() sys command
- passed flags indicating which resources are shared
See table here for clone flags and meanings
Kernel Threads
-----------
Processes run by the kernel in kernel-space
Schedualable and preemptable, just like normal processes
Do not have an address space
Things like 'flush' tasks and 'ksoftirqd' are delegated to kernel threads
See kernel threads with `ps -ef`
Can only be created by other kernel threads
Usually forked of the `kthread` kernel process
Procss Termination
-------------------
By signal or call to `exit()` # see `kernel/exit.c`
Cleans up file descriptors that reach 0 user procs
# TODO - what is a file descriptor
Calls `schedule` to switch to a new process
Removeing the Process Descriptor
------------------------
After the parent has received info on terminated child, child's `task_struct` is deallocated
wait() and co
- use wait4()
- std behaviour is to suspend execution until child exits
`do_exit() -> exit_notify() -> forget_original_parent() -> find_new_reaper()` (assigns the orphaned processes a new parent)
`init` routinely calls wait() on its children, cleaning up any zombies assigned to it
Process Scheduling
==================
Fundamental decision: decide which process will run next given a set of runnable processes
Multitasking
-------------
Interleaving execution of multiple processes
Runs processes simultaniously on multi-core machines
Illusion of concurrency on single processor machines
Enables processes to block or sleep when work is unavailable
- can wait until some event (keyboard input, network data, passage of time)
- many processes in memory, but only one in a runnable state
Two flavors
- cooperative multitasking
- preemptive multitasking (most modern OSs)
- Scheduler decides when a process is to cease running and a new process is to begin running
- 'Preemption' - involuntarily suspending a running process
- Usually time a process runs is predetermined - called a 'timeslice'
- Linux's unique 'fair' scheduler does not employ timeslices _per se_
- coop multitasking - proc does not stop running until in voluntarily decides to do so (yielding)
- hung process can bring down the whole system
- scheduler cannot make global decisions about how long processes run
- processes can monopolize the procesor for longer than the user desires
Linux's Process Scheduler
----------------------
O(1) scheduler
Constant time decisions on timeslice calculation and per-processor runqueues
Scaled well for large 'iron' systems with tens if not hundreds of processors
# TODO - define pathological, context: pathological failure
Latency-sensetive applications were not serviced well
Interactive processes - any application that users interact with
Rotating Staircase Deadline scheduler (most notable of several new algorithms)
- introduced `fair scheduling` (borrowed from queuing theory)
Policy
--------
Behavior of sched that determines what runs when
Can determine the 'feel' of a system
CFS - Completely fair scheduler
### I/O-bound vs processor-bound Procs
I/O - bound
Spends most of its time submitting and waiting on i/o requests
Usually runnable for only short durations as it eventually blocks waiting on mor io
(io here means any type of blockable resource)
eg most GUI applications
Processor - bound
Spend much of their time executing code
Tend to run until preemption
Dont need to run often (system respone not required)
Run less frequently but for longer durations
Some processes change behavior characteristics
Try to satisfy two conflicting goals
fast process response time (low latency)
maximal system utilization (high throughput)
Linux favors i/o-bound processes
### Process Priority
Rank processes based on their worth and need for processor time. Procs with higher priority run first and lower, later
Processes with the same priority run round-robin
Both user and system can set a process's priority to influence scheduling behavior
Two priority ranges
'nice' values - in the range (-20 - +19)
larger nice values are lower priority (they ar nicer to other procs)
lower numbers ar higher priority
linux controls proportional timeslice size, others (like OS X) have control over absolute timeslice
Can see nice values using `ps -el`
# Note - after looking, things like kernel workers get maximal priority, pulso audio has high priority for proc in user space (want continuous play)
Real-time priority
By default rage from 0 - 99
Higher # - higher priority
All real-time processes are at a higher priority than normal procs
Disjoint value spaces from nice value
See with `ps -eo state,uid,pid,ppid,rtprio,time,comm` under RTPRIO
Value of '-' means the proc is not real-time
# Note - after looking, only RT user-space thing was a wifibus?
Timeslice
-----------
How long a task resource can run before it is preempted
Too long => application doesnt feel interactive
Too short => lots of time wasted on context switching
IO do not need long time slices
Processor-bound processes crave long timeslices to keep their caches hot
### On Linux
Procs get a proportion of processor time (determined by weight)
# NOTE - a weight is given. If things are to be added, they must be normalized because the processor use will always attempt to be 100% (1)
Nice value affects this performance
It is preemptive - so of a higher priority task becomes runnable
it will preempt the current process if it has not yet reached its
share of the cpu (as determined by its proportion)
### Eg
Text editor and video encoder
Want text editor to have large amount of time available to it,
not because it needs in, but because we want it to have time available
Want text editor to preempt the video encoder when it wakes up
Video encoding can take more than its allotted time, esp b/c the text editor is usually blocked on io
The key is that linux is giving processor time immediately when the text editor requires (because it has used less than its allotted 50%)
The Linux Scheduling Algorithm
-------------------------------
Has several scheduler classes
Scheduler is modular
Allows different algorithms to schedule different types of processes
Algorithms can coexist
Base scheduler
code is in `kernel/sched.c`
it iterates over each scheduler class in order of priority
the highest priority scheduler class that has a runnable process wins
Completely Fair Scheduler
SCHED_NORMAL in Linux
code in `kernel/sched_fair.c`
### Process Scheduling in Unix Systems
Unix exports priority to the user-space in the form of nice values
`- leads to some problems
1. Mapping nice values onto timeslices requires a decision about what absolute timeslice to allot
Leads to sub-optimal switching behavior
When only 2 low priority processes run, each gets a very small timesilice. Many more context switches are required than if time were determined proportionally, rather than absolutely
2. Difference of 0 to 1 is much smaller than 18 to 19
Inverse processor time allotments
Background tasks (very nice tasks)
- usually need more computation time
- in unix, low priority gets smaller, not larger timeslices
Normal priority processes tend to be user tasks
- need shorter, more frequent time slices
Solution: make nice values geometric instead of additive
Relative nice values are off too
Proportion of nice values does not map well to the proportion of processor time.
Eg NV0 = 100ms, NV1 = 95ms # almosts splitting 50/50
Eg NV18 = 10ms, NV19 = 5ms # first gets double the time
Incrementing or decrementing a nice value has wildly different effects depending on what value you start at
3. Absolute timeslices must be a thing the kernel can measure
Im many OSs this means some multiple of the timmer tick which varies widely by machine
10ms on one might be 1ms on another
# TODO - reread this section after ch 11
Solution: Mapping nice values to timeslices using a measurement decoupled from the timer tick
4. Interactive process wake-up
Often, you want to give a proc a temporary boost in priority so it can run in spite of an expired time slice
Doing this might open up the OS to a program that tries to game the algorithm for unfair processor time
Real underlying problem is that
While it is nice to have a contstant switching rate
Fairness becomes variable
Solution: do away with timeslices completely and assign processes a proportion of the processor
Results in a constant fairness but a variable switch rate
### Fair Scheduling
CFS is based on the concept: 'Model process scheduling as if the system had an ideal, perfectly multitasking processor'
Perfect Fairness and Perfect Multitasking
- Timeslices are infentessimally small
- For _any_ span of time, each process would have run the exact same ammount as any other process
- With 2 processes, this is like running 2 processes at the same time at 50% power
Unfortunately we cant actually multitask procs
And infinitely small timeslices are unreasonable
- b/c switching cost
- b/c cache penalty
# TODO - more often, think about the idea / continuous situation
CFS actually
- Calculates how long a process should run as a fn of the total number of runnable processes
- Runs procs round robin - selecting the project that has run the least (is this conflicting? or do allotments of time keep round-robin actually in order)
- Nice values _weight_ the proportion of processor a process is to receive
- Each process runs for a 'timeslice' proportional to its weight / total weight of all runnable threads
- Actual timeslice calculation
- Set a target for its approximation of 'infinitely small' scheduling duration in perfect multitasking (known as _targeted latency_)
- Note, smaller targets yield better interactivity at the expense of higher switching costs (and worse throughput)
- TODO - need paper
- If targeted latency is 20 ms, with n tasks of the same priority, each task will run for 20 / n miliseconds
- since high n will result in high switching costs using this model, CFS imposes a floor on the timeslice assigned to a process
- floor is called _minimum granularity_ - default of 1ms
- so system is not fair when n is high, because all procs will run for the minimum timeslice
- TODO - figure out how the targeted latency changes with high n. I think, large n just means that minimum viable targeted latency is larger than setc:w
- Nice value imposes weight penalty on procs
eg 2 procs might split the 20ms targeted latency 15 and 5
CFS isnt prefecly fair, it just approximateds perfect multitasking
It can place a loer bound on latency of n for n runnable procs on the unfairness
# TODO - how would you define latency to someone
The Linux Scheduling Implementation
------------------------------------
Actual implementation lives in `kernel/sched_fair.c`
4 Components
- Time Accounting
- Process Selection
- The Scheduler Entry Point
- Sleeping and Waking Up
### Time Accounting
Most unix
Assign a timeslice
Decrement timeslice by the tick period each tick
On 0, preempt the proc
#### The Scheduler Entity Structure
Used by CFS to keep track of process accounting
Embedded in the _process descriptor_ `task_struct`
member var `se`
#### The Virtual Runtime
`vruntime`
- stores the _virtual runtime_ of a proc
- time spent running normalized by the number of runnable procs
- unit: nanoseconds (decoupled from the timer tick)
- on a perfect multitasking system, all tasks of same priority would always have the same virtual runtime
- used to determine how long a proc has run and how much longer it ought to run
`update/curr(struct cfs_rq *cfs_rq)` manages the accounting
### Process Selection
Run the process with the lowest vruntime
CFS uses a red-black tree to manage list of runnable procs
Note - tree is not actually traversed in finding the left-most node b/c it is cached
### The Scheduler Entry Point
`schedule()`
- entry point for rest of the system
- system command, in turn cals `deactivate_task()` which removes the task from the runqueue
- generic wrt the scheduler classes
- find the highest priority scheduler clas with runnable proc, asks it what to run next
- invokes pick_next_task()
- CFS is the scheduler for normal processes
### Sleeping and Waking Up
Special state that signals to the scheduler that a task does not wish to be run
When sleeping, a task is _always_ waiting for an event
Events include
- specified amount of time
- more data from a file I/O (common)
- some other hardware event
A task may be forced into sleep if it requests tries to obtain a contended semaphore
Two states associated with sleeping
`TASK_INTERRUPTIBLE`
- respond to signals
- _spurious wake up_ -wake-up not caused by the occurrence of the event.. so check and handle signals
`TASK_UNINTERRUPTIBLE` - ignores signals
both sit on a wait queue waiting for an event
#### Wait Queues
Can be declared statically or dynamically
Processes put themselves on wait queues and declare themselves not runnable
When the event associated with the wait queue occurs, the procs on the queue are awakened
Race conditions can arise when sleeping and waking are incorrectly implemented
eg. dont go to sleep aftr the condition becomes true
There is a prescribed approach to putting a process on the wait queue - use it should you ever need to (p 61)
#### Waking Up
`wake_up()`
- marks TASK_RUNNING, enqueues it
- sets need_resched if the awakened task's priority is higher than the priority of the current task
- code that causes the event to occur typically walls wake_up() on everything in its queue waiting for the data
When a task awakens, it is not safe to assume the event has occurred. It may have been a spurrious wakeup - need to check event
Preemption and Context Switching
-------------------------------
`context_switch()`
- switches from one runnable task to another
- called by schedule()
2 jobs
- switch the virtual memory mapping from that of the old to the new (via `switch_mm`)
- switch processor state from the previous proc's to the current's
- save and restore stack info / processor registers / arch specifics
- via `switch_to()`
Kernel must know when to call schedule
cant trust user-space programs to do it
`need_resched` flag used to signify needed reschedule
- by `scheduler_tick()`
- by `try_to_wake_up()` when woke proc is higher prior
has some accessor methods
user preemption - preemption when kernel is about to return to user-space
- from system call
- from interrupt handler
kernel preemption - no need to wait for kernel code running in user context to return, can reschedule any time it is safe to do so
Not safe
- when a lock is held (lock count is non-0)
- can do this because the kernel is SMP-safe
can occur
- when an interrupt handler exits, before returning to kernel-space
- whent kernel code becomes preemptible again
- if a task in the kernel explicitly calls `schedule()`
- if a task in the kernel blocks (which results in a call to `schedule)`)
Real-Time Scheduling Policies
-------------------------
Two real-time scheduling policies
`SCHED_FIFO`
first in, first out - without timeslices
always scheduled over `SCHED_NORMAL`
runs until it explicitly blocks or yields
no timeslice, can run indefinitely
only higher priority real-time proc can preempt
two or more at the same priority run round-robin
`SCHED_RR`
same as above, except with a predetermined timeslice
(`SCHED_NORMAL` is non-realtime policy )
Managed by a specia real-time scheduler in `kernel/sched_rt.c`
'soft real-time' - kernel tries to make timing deadlines but no promises
'hard real-time' - guarantees on the capability to schedule real-time tasks
# TODO - not sure what it means for nice values and rt values to share a priority space, and how we got a range of 100 to 139
Scheduler-Related System Calls
-------------------------------
Bunch of C library functions which call system calls, almost directly
can
set a procs nice value
only root can set a negative nice value
set sched policy
get sched policy
set rt priority
get rt priority
get max rt pri
get min rt pri
get proc timeslice
set proc processor affinity
attempts soft affinity by default?
bit-map enforce hard processor affinity
get proc processor affinity
yield tho processor
# Note, much of sys call code is checking arguments, setup, and cleanup
There is a load balancer that juggles procs around processors when necessary
`sched_yield()` - expires a proc so it wort run for a while, moreso than just moving it to the end of the run array
Process and Thread Scheduling (Guest ch from diff book)
===============================
Scheduler
---------
composed of two separate modules
Process scheduling - decision-making policies to determine the order in which active processes should compete for the use of processors
Process dispatcher - binding of selected process to a processor (remove from run-queue, change status, load state)
implementation
might be called through syscalls, or
might be an anonymous process that polls
sometimes in a master/slave configuration
_quantum_ - predifined period of time the process is allowed to run
_priority_ - used to decide 'who next'
P = Priority(p) where P is a number and p is a process
P is a number
sometimes static, given p
p is a process
divides procs into _priority levels_
_ready list_ (RL), struct to hold levels
framework for scheduling
2 fundamental questions
when to schedule - when should the scheduler be active
who to schedule - what proc
governed by
_priority function_ - seen above
_arbitration rule_ - satisfies ties
decision mode
when are priorities evaluated
non-preemptive
run until block or termination
preemptive
stop the presses, there is more important shit to print
_quantum-oriented_ - performed each quantum
when other proc priority > current proc
ususally more costly than non-preemptive
priority furnction
uses diff attributes from proc and system
common attrs
service time so far
is real time?
deadline
periodicity
external priority
memory requirements
_job scheduling_
long term scheduling
choose among batch jobs
arrival might be arrival in the system
departure might be proc termination
_process scheduling_
select among currently ready processes to run next
arrival might be change to ready state
departure might be change to running or blocked
_attained service time_ - amount of time since arrival the proc used the CPU. (Often the most important parameter in scheduling)
_real time in system_ - total time in system (attained + waiting time)
attained and real time of a proc make it easy to compute ratio of runtime
total service time
attained service time at time of departure
some systems know this time in advance
usually true of repetitive service tasks
sometimes predicted based on proc's most recent behavior
procs that block frequently will have smaller total service times
deadline
point in real time by which a task must be completed
time-sensative computation, eg weather forecast, wideo transmission
shorter deadline ~= higher priority
periodicity
repetition of a computation
sometimes used to model event driven processes
especially when events are polled
fixed period automatically imposes an implicit deadline
external priorities
explicitly assigned priorities
memory requirements
major scheduling criterion in batch operating systems
interactive systems, this is a good measure of swapping overhead
Arbitriation Rule
----------------
Resolves conflicts btwn procs of equal priority
Could be random choince, could be round robin
Common Schediling Alogrithms
-------------------------
FIFO
arrival time is only criteria
Shortest-Job-First
non-preemptive
based on total service time
priority is inverse of of total service time
Shortest-Remaining-Time
preemptive and dynamic version of SJF
P = -(total time - attained service time)
Round-Robin
uses fixed quantum
quantum might be different at different cpu loads
linux sets a floor for minimum quantum time
avoids lots of context switching at high load
Multi-Level Priority
Fixed set of priorities, 1 to n
Each process is assigned a priority e
P = e
e can be changed by authorized system calls
usually preemptive
at each level - RR if preemptive, else FIFO
Multi-Level Feedback
similar to MLP, but
processes migrate to lower levels as attained service time increases
each level gets a tP max, if a process exceeds that value, its priority is decreased a level
There is a formula (P = n - floor(lg2(a/T + 1)))
n - the number of priority levels
T - max time at highest level
Rate Monotonic
Real time systems
procs and threads are often _periodic_
computation must be done before the start of the next service time
Preemptive policy that dispatches according to period length
the shorter the period, the higher the priority
Earliest-Deadline-First
preemptive and dynamic
highest priority assigned to the process with the smallest remaining time until its deadline
assumes all processes are periodic
assumes deadline is equal to the end of the current period
probably not great for humans? or, maybe is good for some humans
P = -(d - r%d)
r - time proc entered the system
d - period
r / d - number of completed periods
r % d - time already expired fraction of the current period
(Table is pretty cool)
Comparison of Methods
---------------------
FIFO, SJF, SRT - developed primarily for batch processing system
If total service times are known (or can be predicted)
SJF or SRT are usually preferrable to FIFO
b/c both SJF nd SRT favor shorter comp time over long
users want fast service for short tasks
total time required
independent of process order
avg turnaround time
can be changed with diff process orders
one of the more important metrics used to compare sched algos
# NOTE - it seems like it would be easy to show change in process attributes for different scheduling algorithms
# probably in a visually appealing way
_response time_
important in time share systems
if important
implies a preemptive algorithm
good algos
round robin
responsiveness depends on q
q -> inf, rr becomes fifo
q -> 0, rr is dominated by context switching
multi-level feedback
MLF
IO bound procs tend to remain high priority
batch jobs tend towards the bottom of the list
Example Schedulers
----------------
VAX/VMS
15 priorities for real-time, 15 for regulars
Realtime uses multi-level scheme (without feedback)
Regural processes
have standard
adjusted with other events in the system
eg. read completion will ++ a proc priority
reaching the end of runtime will -- proc priority
Windows 2000
similar to above except threads are scheduled rather than procs
# NOTE - scheduling threads could use the process (tgid struct) to keep track of the attained time - or some aggregate time
Minix
Multi-level queues
_task_ (kernel stuffs),
_server_, (memory, file, network management)
and _user_ levels
RR
Linux
no queues of threads/procs
sort by _goodness_
base and current goodness is tied to the quantum
each thread has its own variable-length quantum
like a countdown time
each thread
gets a base quantum (bq)
gets a remaining quantum (rq)
current goodness (g)
if (rq == 0) g = 0 # alloted quantum has been consumed
if (rq > 0) g = bq + rq
epochs
division of scheduler time
end when no threads can run
all threads (or 'exahsted quantums',
'blocked')
reset with formula `rq = bq + (rq / 2)`
each thread gets its base quantum plus half of its remaining
naturally favors I/O - bound threads
realtime threads
always executed before normal threads
can be scheduled (or RR,
FIFO)
helps to know if deadline will be met _before_ execution
_schedule_ - an assignment of proc to process
_feasible schedule_ - iff all deadlines are satisfied
_optimal scheduling method_ - resulting schedule meets all deadlines whenever a feasible schedule exists
periodic processes
utilize ti/di of the cpu time
ti is the total service time for process i for one period
di is the period length of process i
overall utilization is
U = sumi=1-n(ti/di)
schedule is feasible if U <= 1
Rate Monotonic
will find optimal for U <~ .7
therefore, it is not optimal
Earliest-Deadline-First
will always fond the optimal solution if there is one
Priority Inversion
-----------------------
Higher priority procs or threads can be delayed or blocked by lower priority ones
possible solutions
make critical sections non-preemptable
works well if critical sections are short and few
execute critical sections at the highest priority of any process that _might_ use them
suffers from overkill
leads to other forms of priority inversion
dynamic priority inheritance
when lower priority proc blocks a higher one,
the lower proc is assigned a priority = to the higher until
it exits the critical section.
allows for nested critical sections
blocking times boulded by critical section execution times
can be a cumpulsory requirement in real-time operating systems
POSIX focus
user can associate a _ceiling attribute_ with a mutex semaphore
integer specified
when mutex is activated, priority is automatically raised to the value of the ceiling
threads that use this mutex will always run at the same level and will not preempt each other
priority inheritance
Windows 2000 focus
every second, scan all ready threads
any that have been idle for more than 3 seconds
bost priority to max
extend quantum by 2x
no thread will be postponed indefinitely due to higher priority threads
Multiprocessor and Distributed Scheduling
====================
Two principal approaches to scheduling procs and threads on multicomputers
Use a single scheduler
all CPUs are in same resource pool
any proc can be run on any processor
Use multiple schedulers
machines are scheduled separately
often a choice for non-uniform multiprocessors
different CPUs have different characteristics and functions
eg. special machines fro I/O, filing, data acquisition, fast Fourier transforms, matrix computations
users and apps tend to have greater control over performance
Case: _the Mach Scheduler_
each machine set has a global ready list of schedulable threads
dispatching
select highest priority thread from the local queue
when empty, select highest priority thread from global list
execute idle thread when both lists are empty
Case: _Windows 2000_
affinity scheduling
goal: keep a thread running on the same processor (if possible)
make use of caching
each thread is associated with two integers
ideal processor
last processor
when thread is ready
prefer running on
1. the ideal processor
2. the last processor
3. the processor currently runnng the scheduler
4. any other processor
idle processor looks for proc to run
1. last thread to run o the processor
2. the thread for which this is the ideal processor
3. then whatever the top priority processor is
Distributed systems
normally have separate schedulers at each site (cluster)
processes usually preallocated to clusters
communication costs time
costs message traffic
applications have threads that utilize a shared address space
distributed shared memory and migrating threads
practical ideas
realization is still slow
often have functional distribution
leads to different schedulers
a node might specialize
focus on email - cause node is an exit node
printing (node has access to printer), etc
RTB requires being really fricken close to exchanges, need to fill request fast
Rate monotonic - preemptive policy that disatches according to period length
shorter the period, the higher the priority
Earliest-Deadine-First (EDF) - preemptive and dynamic scheme
usually designed for real-time applications
Neither of the above two are optimal for multiprocessors
often satisfactory heuristics though
feasible schedules produced, when they exist and when machines are not under heavy load
preallocation of procs is most common because of the complexit of trying to meet real-time constraints
Linux Scheduler
- time compelxity depends on number of priorities (queues)
- time compelxity does not depend on the number of processes
Calculating Timeslices
Quantum = 140 - (static priority) * 20 if SP < 120
Quantum = 140 - (static priority) * 5 if SP >= 120
static priority - think nice & real-time stuffs
note higher priority process get _longer_ quanta
important process should run longer
dynamic priority
calculate from static priority and avg sleep time
record time sleeping (up to a max value)
when the process is running, decrease value each timer tick
[0, 10] - rough estimate of % time sleeping recently
0 hurts priority by 5
5 is neutral
10 helps priority by 5
DP = max(100, min(SP-bonus + 5, 139))
_interactive_ - if bonus - 5 >= S/4 - 28
low-priority processes have a hard time becoming interactive
default priority process becomes interactive when its sleep time is greater than 700ms
Using Quanta
each tick decreases the quantum of _current_
process gets prempted when clock hits 0
if non-interactive -> send to expired list
if interactive -> send to end of current queue (scheduled again)
but running will decrease bonus (along with priority)
Avoiding Indefinite Overtaking
remember those 140 queues (_active_ and _expired_)
only run procs from active queues (put on expired when quanta == 0)
swap active and expired queues after running through all of them
pointers to current arrays
two arrays
- avoids the need for traditional aging
- why is aging bad? -> it's O(n) at each clock tick
- ergo, not required for peoples scheduling
Many OSs iterate over procs to update priorities
Linux is more efficient
- procs are touched on start and on stop only
- recalculate priorities, bonuses, quanta, and interactive status
- never loop over all procs!
Locking Runqueues
Sometimes for queue rebalancing, kernel locks queue
in order to move procs
Real-time scheduling
soft real-time scheduling
static priority [0, 99] are real-time
FCFS & RR
Sleeping and Waking
- proc needs to wait for events
- proc added to wait queue
- wakeups can happen too soon - need to check args and maybe re-sleep
- you dont wake a process
- you DO wake a wait queue
System calls
`nice()` - lower a process' static priority
`getpriority()/setpriority()` - change priorities of a process group
`sched_getscheduler()/sched_setscheduler()`
set scheduling policy and params
and others that start with `sched_` (see with man -k)
`scheduler_tick()`
called each timer tick to update quanta
`try_to_wakeup()`
attempt to wake a proc
put on run queue
rebalance loads
`recalc_task_prio()`
update average sleep time and dynamic priority
`schedule()`
pick the next process to run
`rebalance_tick()`
check if load-balancing is needed
Fair Share Scheduling
would need to
add new scheduler type
Two basic functions
time of day - a service provided to apps
interval timers - something should happen in n ms from now
_Real-Time clock_
tracks time and date
can interrupt at a certain rate
can interrupt at a certain time
only used by Linux for time of day
_programmable interval timer_
generates periodic interrupts
usually based on HZ
some other special , less common timers
Things executed on timer ticks
keep track of time
invoke dynamic timer routines
call `scheduler_tick()`
system uses best timer available
`jiffies`
incremented each tick
roughly the number of HZ since sys boot
32 bit counter - wraps around in ~ 50 days
Time of Day
stored in xtime (`struct timespec`)
incremented once per tick
apparent tick rate can be adjusted slightly see `adjtimex()`
Kernel Timers
Dynamic timers
- call routine after a praticular interval
- specify an interval, subroutine to call, and a param to pass subroutine
- param differentiate different instantiations of the same timer
- timers can be cancelled
Delay loops
- tight spin loops for very short delays (micro seconds or nano seconds)
- nothing else can use CPU during that time (except via interrupt)
- for when dynamic timers are too much overhead
System Calls
`time()` and `gettimeofday()`
`adjtimex()`
tweaks apparent clock rate (even the best crystals drift)
`setitimer()` and `alarm()`
interval timers for applications
user-mode interval timers
similar to kernel dynamic timers
System Calls
=====================
Allow user-space apps to
- access hardware
- create new procs
- communicate with existing ones
- request os resources
Layer between user programs and hardware
3 purposes
- abstracts hardware eg. read/write files dont care what media you are on
- security / stability
- virtualized system provided to the process
Without controlled access of sys resources
- couldnt multitask
- wouldnt be able to have virtual memory
Legal entry points to the kernel
- sys calls
- traps
- exceptions
Even device files, or files in /proc, are accessed via system calls
APIs, POSIX, and the C Library
-------------------------------
User space programs usually programmed against an API implemented in use-space, not directly to sys calls
Can create defferent APIs from building blocks provided by system calls
POSIX - series of standards from the IEEE (eye-triple-E)
Linux strives to be 'POSIX' and 'SUSv3'-compliant
A very common api is based on the POSIX standard
A meme related to interfaces in Unix is 'Provide mechanism, not policy'
Syscalls
---------
Typically accessed by function calls in the C library
System calls also provide a return value of type `long` that signifies success or error
usually negative value denotes an error
usually 0 is a sign of success
`errno` - holds a special error code of the last error
`perror()` - used to print human-readable error code
tgid - is the thread-group id. This is == pid for normal processes, while for threads is the same for all threads of the same group
`asmlinkage`
- only look on the stack for the fns arguments
- required for all syscall
in kernel, all syscalls are prefixed with `sys_`
### System Call Numbers
Unique # used to reference a specific system call
User-space process identifies calls by #, not name
#, once assigned, cannot change - else compiled apps will break
### System Call Performance
Faster than many other os's
Cause
- fast context switching
- streamilened enter/exit
- simple call handler
System Call Handler
-------------------
Kernel fns exist in a protected memory space. User-space programs cannot call them directly
Method - use an interrupt: incur an exception and the system will switch to kernel mode, executing the exception handler (in this case, the syscall handler)
# TODO - difference between throwing an exception and 'trapping into the kernel'
Some processors have a special sysenter feature
### Denoting the Correct System Call
Architecture dependent, on x86, user-space app sets the eax register before causing the trap into the kernel
### Parameter passing
Sent to the kernel on registers `ebx, ecx, edx, esi, and edi`
Return value is also sent via register
on x86, the `eax` register is used
System Call Implementation
----------------------------
Must first define its purpose - exactly one
What are the new system call's arguments, return value, err codes
Should have a clean and simple interface
Small # args possible
Remember, semantics and behaviour are important, they must not change
- Consider how the function use might change over time
- Can your method be changed while maintaining compatability?
- Will bugs be fixable?
- Are you needlessly limiting the fn in some way?
- Purpose remains the same, but uses may change
- Dont make architectual changes
### Verifying the Params
Thourough checking is in order - security is imperrative
Check valid and legal, but also correct
Cant trust user-space procs
Always check validity of pointers!!!
- must point to user-space memory
- must point to memory in the process's address space
- if reading, memory must be marked readable
- if writing, memory must be marked writable
- if executing, memory must be marked executable
there are helper fns to do this
System Call Context
-----------------
In proc context
- Kernel can sleep
Makes available most kernel functionality
- is fully preemptible
Must make sure call is reentrant
otherwise trouble occurs when another higher priority proc makes same syscall in the middle of a lower prior proc
### Final Steps in Binding a System Call
Add entry to the end of the syscall table
- must be done for each arch
For each architecture, define the call number in `asm/unistd.h`
Compile the syscall into the kernel image (as opposed to compiling as a module)
# Note - interrupt handlers cannot sleep, and are therefore much more limited in what they can do than syscalls running in proc context
Since new sys calls will not be in the C libs, you can wrap access with macros. This will handle setting up register contents and issuing the trap instructions. `_syscalln()` where n is between 0 and 6. (# of params)
_see this section for use of your new syscall_
### Option of a New Syscall
#### Pros
Simple to implement, easy to use
Fast performance on Linux
#### Cons
Requires a syscall number, which needs to be officially assigned to you
Written in stone, interface cannot change
Must be explicitly supported by each architecture
Not easily used from scripts
Cannot be accessed directly from filesystem
Hard to maintain outsize of the master kernel tree
Overkill for simple info exchanges
#### Alternatives
Implement a device node and `read()` and `write()` to it
use `ioctl()` to manipulate specific settings or retrieve specific info
Certain interfaces, such as semaphores, can be represented as file descriptors and manipulated as such
Add the information as a file to appropriate location in sysfs
# TODO - what is sysfs?
# Note: you can get a proc's processor afinity in linux with `sched_getaffinity`. In a group context, a warm cache of the processor is relevant. If someone is familiar with offer caching code, you probably want them to continue rather than ramping-up someone else.
# Note: for humans: loading something onto disk takes time (learning something for the first time), disk to memory is faster (recall of past), recent tasks is faster still
Other
------
# note - preempt notifiers - let something know that a process has been preempted
# TODO what is - group leader
# TODO what is - slacktime
Kernel Data Structures
========================
Dont roll your own
Not inclusive, just the basics in this book
- linked lists
- queues
- maps
- binary trees
Linked Lists
-----------------
Allows storage of variable number of elements
Elements created at differernt times
- must be variable in length
- may occupy discontiguous regions in mem
Single linked, double linked, or circular linked
Linux Kernel LL is unique, but pretty much a circular, doubly linked list
### Moving Through a Linked List
Linear movement
Bad at
Random access
Good at
Dynamic addition / removal O(1) ftw
Iterating
### In Linux Kernel
Rather than including pointers in the data struct, include a pointer to a LL node instead
`linux/list.h`
Reference to list_head in the data struct allows several macros to be used to `list_add()` or `container_of()` which take list_head as arguments
Since the offeset of a node pointer in a data struct is set at compile time, it is easy to get the containing structure of the list
Presumably `const typeof( ... )` will get the type of the struct in a generic way, so `container_of()` doesnt need to know anything about it
Be sure it is embedded in data_struct, otherwise macros arent helpful
Need to init the list head on creation
Also, there are some safe(er) methodes
Also, also, you may need to synchronize for exclusive list access
Queues
-----------------
Producers and consumers
see `kernel/kfifo.c` and `linux/kfifo.h`
Queues are one page-sized by default
Queues leave it up to the user to say
- How many bytes to copy in
- How many bytes to copy out
If a queue is nearly full, only a partial add will be done
If a queue is nearly empty, only a partial read will be done
Maps
------------
Aka associative array
Keys to values
Hashtables are a type of map - but not all maps are hashtables
Trees can be used as well (self-balancing)
Hash
- better average-case complexity
Trees
- better worst-case complexity
- can also maintain a sorted order
- can use any type of key, so long as it can define the <= operator
### In Linux
Map is used to map a UID to a pointer
Will also generate a UID for you
`idr` maps user-space UIDs to their kernel data structures
Need to preallocate a UID before getting a new one, cause memory and locking are hard
Binary Trees
----------------
Math says it is an 'acyclic, connected, directed graph' where each node has 0 or more outgoing edges
Binary tree is a tree with no more than 4 outgoing edges per node
_binary search tree (BST)_ - specific ordering on nodes
left of root is only nodes < root
right of root is only nodes > root
all subtrees are also binary search trees
Pros
fast search
fast ordered traversal
### Self-balancing BST
Depth of all leaves differs by at most 1
In other words, height is managed
#### Red-Black Tree
type of self-balancing bst
binary tree of choice in linux
6 properties are enforced
Ensures that the deepest leaf has a depth of no more than double that of the shallowest leaf
Tree is always semi-balanced
B
/ \
R B
/ \ / \
B nilnil B
It does this by rotating shit thats unbalanced, but dont worry about it unless you are an undergrad compsci student, which im not (anymore)
`rbtree` - `lib/rbtree.c` and `linux/rbtree.h`
inserts are always logarithmic
What Data Structure to Use, When
--------------------------------
What is primary access method?
LL
if iterating
when performance is not important
when interfacing with other LL consuming code
Queues
if producer / consumer pattern is used
if you know your buffer can hold your data
Maps
if UID to obj is needed
Rb tree
if storing lots of data is required
fast searching, with efficient in-order traversal
in-memory footprint is not bad
Radix tree
Algorithmic Complexity
----------------------
What is the _asymptotic behavior_ of the algorithm
Big-oh - upper bound
Big-theta - both an upper and lower bound
usually when big-o is dropped, they are really talking abouth the least such big-o (the big theta)
Time Complexity - See AoA - Colby CS
Interrupts and Interrupt Handlers
================================
Communicating with hardware is a core responsibility
Aint got no time to wait for hardware
Rather than polling these devices, the responsibility is on the HW to interrupt the processor when it wants attension
Polling - check periodically
Interrupt - send a signal to the processor
Interrupt handlers - fn to handle interrupt
HW device interrupts
async w.r.t. processor clock
aggergated by multiplexes
directly connected to the CPU
have a unique value - for differentiation
aka interrupt request (IRQ) lines
IRQ 0 on a classic PC is the timer interrupt
IRQ 1 on a classic PC is the keyboard
Not always dynamically assigned
PCI bus dynamically generates values
Exceptions
- synchronous ~interrupts~
- generated by the processor
- from software
Interrupts
- asynchronous interrupts
- generated by hardware
Interrupt Handlers
------------------
aka interrupt service routine (ISR)
An associated handler for every device
Function
One per device
part of the _driver_
normal C functions
run in _interrupt context_
aka _attomic context_
code cannot block
must run quickly
acknowledge hardware, at a minimum
Top Halves vs Bottom Halves
---------------------------
To balance large amount of work with response speed
### Tap Halve
- only the time critical stuff
### Bottom Halve
- run at more convenient time
Network card example
- card has very little memory
- dont want buffer overruns
Registering an Interrupt Handler
--------------------------------
`request_irq()` in `linux/interrupt.h`
One driver for every device
One interrupt handler per driver
`request_irq()`
params
int irq
the interrupt number
usually dynamically allocated
irq_handler_t handler
pointer to actual handler fn
unsigned long flags
bitmask of one or more of the flags
IRQF_DISABLED
usually poor form to use
handle fn blocks other interrupts from occurring
IRQF_SAMPLE_RANDOM
use as source of entropy
helps with random number generation
obv. dont set for predictable device interrupts
IRQF_TIMER
this handler procsses interrupts for the system timer
IRQF_SHARED
interrupt line can be shared among multiple interrupt handlers
conts char *name
ascii text name of device
used by `/proc/irq` and `/proc/interrupts`
void *dev
used when lines are shared
provide a unique cookie for later identification on removal
the way the kernel tells which handler to remove for shared lines
common to use the driver's device structure
can sleep
so no calling in interrupt context
cause somewhere `kmalloc()` is called
careful to initialize hardware fully before registering handler
`void free_irq(unsigned int irq, void *dev)`
disconnects line iff (not shared || last registered)
must be called from process context
Unless your handler is old and crusty
good chance you will need to share
Writing an Interrupt Handler
-----------------------
`static irqreturn_t intr_handler(int irq, void *dev)`
params
int irq
line #
void *dev
same as passed to `request_irq()`
typically the `device` structure
b/c unique to each device
potentially useful to have within the handler
return
IRQ_NONE
handle's device was not the originator
IRQ_HANDLED
handler correctly invoked
IRQ_RETVAL(val)
macro for returning one of the above two based on val
used to determine if device is issuing _spurious interrupts_
spurious interrupts - unrequested
Interrupt handlers in Linux need not be reentrant
Processors do not receive inturrupts on the same line while executing a handler for tat line
Avoids dealing with nested interrupts
Code can be preempted, though, by interrupts on other lines
### Shared Handlers
- IRQF_SHORED flag must be set
- dev arg must be unique
- Interrupt handler must be able to distinguish what device actually generated an interrupt
hardware capability must exist
Invoked sequentially
Handler should quickly exit if its device was not the generator of the interrupt
Real-time clock (RTC)
`drivers/char/rtc.c`
separate from the system timer
sets the system clock
provides an alarm
supplys a periodic timer
driver's init code registers the handler
Interrupt Context
-----------------
Not associated with a running process
`current` macro is not relevant
cannot sleep
other functions are also not relevant
Time-critical
can wait-loop, but dont
push as much as possible into the bottom half
Stack location varies
Handler should not care what stack setup is in use
Always use the absolute minimum amount of stack space
Processor jumps to a unique location for each interrupt line
executes code at that location
### `/proc/interrupts`
stats realated to interrupts
line #
# of times executed
handling controller (eg `XT-PIC`)
name
_procfs_
a virtual file system
in kernel memory
mounted at `/proc`
reading / writing files
invokes kernel functions
seems like actually reading/writing files (but are just fns)
`XT-PIC`
std PC programmable interrupt controller
Interrupt Control
-----------------
Can enable/disable
a line for the whole machine
aka _masking out_
not used as often now since, increasingly, lines are shared
the interrupt system on a processor
`local_irq_disable()` && `local_irq_enable()`
or safer: `local_irq_save(flags)` && `loal_irq_restore(flags)`
Reasons
need synchronization (disable interrupt preemption)
kernel will not be preempted
Does not prevent concurrent access from another processor
need a lock for that
No `cli()` (global disable interrupts)
means all synchronization must be specific and narrow
results in faster performance
### Status of the Interrupt System
`in_interrupt()`
- nonzero if kernel is executing a handler
- nonzero if kernel is executing a bottom half
- if 0, kernel is in process context
`in_irq()`
- nonzero if kernel is executing a handler
Bottom Halves and Deferring Work
==========================
Top half is not enough
Why need for fast
- TH interrupts all the code
- TH disables at least a line, if not all HW comms
- Timing critical, eg. small HW buffers
- Cannot block
There is a need to do less critical work later.
Bottom Halves
----------------
Do most work here. All but the critical.
You should probably process data copied in the TH, though.
Tips for dividing work:
perform in handler
- perform in handler
- HW related
- ensure another interrupt doesnt happen
perform in bottom half
- all other
### Several Mechanisms
First approach - a static set of handlers
Second approach - task queues that a driver could append to
Third approach
Softirqs
- statically definded - compile-time
- can run simultaneously on any processor
- two of the same can run concurrently
(require more care)
Tasklets
- dynamically defined
- can run simultaneously on any processor
- two of the same can _NOT_ run concurrently
- build on softirqs
Task Queue replaced with Work Queue
- queue work for later
- runs jobs in process context
(kernel timers)
- after specific time, rather than 'any time but now'
Softirqs
------------
Rarely used directly. Tasklets are much more comon form.
`kernel/softirq.c`
Usually reserved for the most timing-critical / improtant bottom-half prcessing
Currently, only used by networking and block subsystems. But also, tasklets and kernel timers are built on top.
Needed for scalability. If you do not need to scale to infinity processors, consider a tasklet
Used for high frequency or highly threaded use.
Represented by the `saftirq_action` structure (see `linux/interrupt.h`)
Each is registered in a 32-entry array
only nine exist now
A softirq never preempts another softirq
But it can run on a differnt processor
Only interrupt handlers can preempt
Must mark the softirq before it will run
aka _raising the softirq_
usually done by the interrupt handler
check for pending occurs
in return from HW interrupt code path
in the `ksoftirqd` kernel thread
any code w/ explicit checks
eg. networking subsystem
### Use
index
must be assigned (`linux/interrupt.h`)
used for priority - lower executes before higher
handler must be registered at run-time
`open_softirq()`
eg `open_softirq(NET_TX_SOFTIRQ, net_tx_action)`
To avoid need to lock, try and only use per-processor data
After installed and init'd
can be raised using `raise_softirq(NET_TX_SOFTIRQ)`
Tasklets
-----------
Called by softirq logic
`HI_SOFTIRQ` - higher priority tasklets
`TASKLET_SOFTIRQ` - regular
Represented by the tasklet structure
`struct tasklet_struct`
`unsigned long state`
0, `TASKLET_STATE_SCHED`, or `TASKLET_STATE_RUN`
`count`
tasklet can only run if non-zero
`func`
called the same way `action` is in softirqs
### Scheduling tasklets
Multiplexed on two softirqs
_sheduled tasklets_
- equivallent of raised softirqs
- stored two per processor
linked lists
`tasklet_vec` - regular tasklets
`tasklet_hi_vec` - high priority
- cannot be scheduled if already scheduled
System disables interrupts while tasklets are being sched'd
Raises the `TASKLET_SOFTIRQ` or `HI_SOFTIRQ` which will call
`tasklet_action()` and `tasklet_hi_action()` respectively
Loop over each pending tasklet in the retrieved list
if already running on another processor
skip it
else
mark it as being run, and run it
### Using Tasklets
#### Declare your tasklet
statically or dynamically
statically - use macros
use `DECLARE_TASKLET`
or `DECLARE_TASKLET_DISABLED` (count is initially 1, not 0)
both create `tasklet_struct`
dynamically create `tasklet_struct` and call `tasklet_init`
#### Write your handler
`tasklet_handler` passed an unsigned long data
Writing
Cannot sleep
no semaphores
no blocking functions
Interrupts are enabled
careful if you share data with one
#### Scheduling Your Tasklet
Call `tasklet_schedule` with a pointer to your `tasklet_struct`
If already scheduled, it only runs once
If already running (eg on another processor), tasklet is rescheduled
As an optimization, a tasklet always runs on the processor that scheduled it (hopefully making better use of cache)
Need to enable the tasklet if it is created disabled
Disabled tasklets are just skipped
You can kill a pending tasklet with `tasklet_kill` (sleeps)
Helpful when a tasklet frequently reschedules itself
#### ksoftirqd
Per processor kernel threads
Help process softirqs and tasklets
Grew out of problem with heavy interrupt load
esp. with softirqs that rescheduled themselves
procs would starve for processor time
but softirqs need to be processed
Possible solution 1
Perform rescheduled softirqs before returning to procs
Might starve out processes under high load
User space needs to be run periodicly
Possible solution 2
Dont handle reactivated softirqs
must wait for the next time softirqs are processed
doesnt take advantage of an idle system
may starve softirqs
Actual Solution
Dont immediately process reactivated softirqs
if softirqs are excessive, wake up some special kernel threads
special kernel threads
run at lowest priority
one per processor
named `ksoftirqd/n`
loop over pending softirqs
sleep to yield to other procs
coop nice toooo
#### The Old BH Mechanism
Old BH
Max 32
Must be statically defined
inaccesible to modules
Dynamicly defined must piggyback off already defined one
Only one handler can execute at a time
(strictly serialized)
Did not scale well to multiple processes
network layer, in particular, suffered
Much like tasklets in other reguards
Work Queues
-----------------
Defer work into a kernel thread
called _worker threads_
default ones named `events/n`
one per processor
must have a strong reason not to use default
eg. processor-intense
performance-critical
pervent starving out other default werk
Always run in process context
Can sleep - allows
allocation of lots of memory
semaphores to be used
perform block I/O
Decision - if you need to sleep, use a work queue
otherwise, tasklet
Alternatively - one could spin up a new kernel thread
but this is frowned upon
or punishable..
### Data Structures
#### Representing the Threads
`workqueue_struct`
one per _type_ of worker thread
interface / abstraction exposed to the rest of the system
holds a `cpu_workqueue_struct` per processor
`cpu_workqueue_struct`
#### Representing the Work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment