Created
March 27, 2014 11:56
-
-
Save willscripted/9805963 to your computer and use it in GitHub Desktop.
Because sometimes in business, your goals are tied to fiscal quarters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Linux - elitist attitude - only considers patche of real-world problems. Marketing bullets or one-off requests, such as pageable kernel memory, have received no consideration | |
Kernel Development | |
Introduction | |
============== | |
History | |
----------- | |
Unix - Elegant OS | |
(historically) | |
- only a few hundred system calls | |
- almost everything is a file | |
- written in C (portable) | |
- fast process creation time | |
- unique sys call 'fork' | |
(today) | |
- preemptive multitasking | |
- multi-threading | |
- virtual memory | |
- demand paging | |
- shared libraries w/ demand loading | |
- TCP/IP networking | |
- Many variants scale to hundreds of processors | |
- Some run on small embedded devices | |
Linux | |
- Born of frustration at Unix's non-OpenSource | |
- Implements the Unix API as definded by POSIX and the Unix Spec | |
- Not a direct decendent of unix | |
- Modular design | |
- Can preempt itself (kernel preemption) | |
- Kernel threads | |
- Dynamic binary loading in Kernel image (kernel modules) | |
- Does not differentiate bewoon threads and normal processes | |
- all procs are the same, some just share resources | |
- Hot pluggable events ? | |
- TODO - look into why linux ignores some common Unix features that kernel developers consider poorly designed, such as STREAMS, or standards that are impossible to cleanly implement (p 8) | |
System Components Usually Include | |
- Kernel | |
- C lib | |
- toolchain | |
- basic sys utils (eg, login process and shell) | |
OS Includes (Basic use and administration) | |
- Kernel (Manages hardware, distributes system resources) | |
- Interupt handlers | |
- Scheduler - to share processor time btwn procs | |
- Memory Management system | |
- System services (eg. networking / inter-proc comms) | |
- Device drivers | |
- Boot loader | |
- Command shell / UI | |
- Basic file system utilities | |
Kernel | |
- Above user programs | |
- Privalleged access (kernel space vs user space) | |
- C library provideos higher level functions (although not always that different) | |
- Clibs make system calls | |
- Not all C-libs use system calls (eg strcopy) | |
- applications get work done through system calls | |
- in one of three states (always, even corner cases) | |
- User-space, executing process code | |
- Kernel-space, process context, executing system calls | |
- Kernel-space, interrupt context, handlin interrupts | |
- Usually a monolithic static binary | |
- Usually require a system with a paged memory-management unit (MMU) | |
- enables memory protection | |
- each proc gets uniq addr space | |
- Monolithic servers (Linux) | |
- Simpler design | |
- Microkernels (Parts of OS X and Windows NT) | |
- Several 'server' processes that communicate via IPC (inter-proc com) | |
- Can swap out server instances | |
Interupts | |
- stop processor directly | |
- in turn, stops the kernel | |
- interrupts have numbers associated w/ them (hardware) | |
- handlers process and respond | |
- do not run in process context | |
Getting Started w/ the Kernel | |
============================= | |
[Directories] - see comprehensive lists | |
usr - early user-space code (initramfs) ? | |
Built with flags | |
Binary flags | |
Tristates - dont build, build as module, build into source | |
Configure flags with | |
make config (takes long time) | |
make menuconfig (ncurses) | |
make gconfig (gtk+) | |
or just edit `.config` | |
Build with | |
`make` | |
`make -jn` mulit-proc build where n is # procs) | |
Installing | |
Architecture and boot-loader dependent - see their docs | |
Modules are architecture independent | |
`make modules_install` | |
Warnings | |
Kernel lacks memory protection afforded to user-space | |
Clibs not available - to heavy / slow | |
see instead `linux/things.h` | |
also - inotify? | |
printk rather than printf | |
inline functions for shit that needs to be fast | |
inline functions usually declared in header files | |
`static inline void wolf(unsigned long tail_size)` | |
# TODO - what is asm() assembly compiler directive. | |
Usually only used in parts of the kernel that are unique to a specific architecture - cause assembly is arch specific and linux has a goal to support cross platform | |
Small per-process fixed-size stack | |
Dereferencing a NULL pointer (illegal memory access) | |
- Major kernel error | |
- No `SIGSEGV` errors | |
Kernel memory is not pagable. Erry bite you eat is one less byte of available physical memory | |
# TODO - what does it mean to 'catch a trap' | |
Dont do floating point calcs | |
Kernel has small stack | |
Symmetrical multiprocessing (SMP) | |
Kernel must synchronize between scheduled tasks | |
Spinlocks and Semaphores usually used to avoid race conditions | |
Portability - handful of rules | |
Segregate system-specific code to appropriate dirs | |
Remain endian neutral | |
Be 64-bit clean, do not assume word or page sive | |
Anync interrupts, pre-emption, and SMP mean synchronization and concurrency are major concerns | |
3. Process Management | |
==================== | |
Process | |
- executing program code (the 'text section') | |
- Open files | |
- pending signals | |
- internal kernel data | |
- processor state | |
- memory address space (with one or more memory mappings) | |
- one or more threads of execution | |
- a data section contaiing global vars | |
Tasks kept in doubly linked list | |
Internally to the kernel known as a task | |
`task_struct` | |
- all the data needed by the kernel about a process | |
- allocated by the slab allocator | |
- the kernel stack holds the `thread_info` | |
- `thread_info` has a pointer to the `task_struct` | |
- there are maximum # procs 32k on old machines (cause small ints) | |
- will wrap around | |
- can be as high as 4m set in /proc/sys/kernel/pid_max | |
- `current` macro to get the `task_struct` of the currently running proc | |
- open files | |
- proc's addr space | |
- pending signals | |
- proc state | |
TODO - revisit 'slab allocator' & thread_info | |
Threads | |
- objects of activity within a proc | |
includes | |
- unique program counter | |
- process stack | |
- set of procesor registers | |
- the kernel schedules threads, not processes | |
- Linux does not differentiate between threads and procs | |
threads are just a special kind of proc | |
Procs | |
- procs provideo two virtualizations | |
- virtual processor - proc thinks it is alone on the cpu | |
- virtual memory - as if it were alone in its use of memory | |
- Threads share virtual memory but not virtual processor | |
- a program is not a process | |
must also include related resources | |
Important system calls | |
fork() | |
- creates a new proc | |
- returns 2x, once in the parent, once in the child | |
exec() - function family | |
- creates a new addrs space and loads a new program | |
exit() | |
- terminates the process and frees all resources | |
- puts proc into zombie state | |
wait4() | |
- parent inquires about the status of a terminated child | |
wait() | |
- parent can resolve zombie state | |
waitpid() | |
- parent can resolve zombie state | |
Process State | |
------------- | |
Condition of the Proc (5 flags) | |
system command `set_task_state(task, state);` # see `linux/sched.h` | |
TASK_RUNNING | |
- either currently running or on a run queue | |
- only possible state for proc running in user-space | |
TASK_INTERRUPTIBLE | |
- sleeping, eg waiting for some condition | |
TASK_UNINTERRUPTIBLE | |
- sleeping, does not wake up and become runnable on signal | |
__TASK_TRACED | |
- trace, eg. by ptrace | |
__TASK_STOPPED | |
- not running, cannot run | |
- eg SIGSTOP, SIGTSTP, SIGTSTP, SIGTTIN received | |
Process Context | |
------------------ | |
Program code runs in user-space | |
System calls or exception triggering | |
- will enter kernel-space | |
- are the only interface to the kernel | |
- kernel is 'executing on behalf of the process' | |
- kernel is in 'process context' | |
The Process Family Tree | |
-------------------- | |
`init` process is process #1 | |
- started in last step of boot | |
- reads initscripts and executes more programs | |
Process relationships are stored in the `task_struct` | |
- `parent`-> parent's `task_struct` | |
- `children` -> list of children | |
Can navigate through tasks quite easily - see section for C samples | |
Process creation | |
----------------- | |
fork() | |
- creates a child proc that is a copy of the current task | |
- almost an exact clone | |
- copy on write, each process gets its own copied page if written to | |
- only overhead is dup of parent's page table & creation of new process descriptor for the child | |
exec() (a family) | |
- loads a new executable into the address space | |
Forking | |
---------- | |
most work done by `do_fork()` (see kernel/fork.c) which calls `copy_process()` | |
- PF_SUPERPRIV flag removed from child - no super user privileges | |
- Child runs first to avoid copy-on-write overhead if child is just `exec`ing | |
- vfork is a little strange, parent waits for child to exec or exit | |
- use seems to be for legacy reasons. Or if you dont want to spend time copying the page table | |
The Linux Implementation of Threads | |
----------------------------- | |
Shared memory space | |
Multiple execution capabilities | |
Enable 'concurrent programming' and 'parallelism' | |
Linux don' care, it scheds like all procs | |
Differs from Windows or others whose processes are much heavier | |
Rather than 1 proc and 4 threads on Windows, linux has 4 procs w/ shared memory | |
clone() sys command | |
- passed flags indicating which resources are shared | |
See table here for clone flags and meanings | |
Kernel Threads | |
----------- | |
Processes run by the kernel in kernel-space | |
Schedualable and preemptable, just like normal processes | |
Do not have an address space | |
Things like 'flush' tasks and 'ksoftirqd' are delegated to kernel threads | |
See kernel threads with `ps -ef` | |
Can only be created by other kernel threads | |
Usually forked of the `kthread` kernel process | |
Procss Termination | |
------------------- | |
By signal or call to `exit()` # see `kernel/exit.c` | |
Cleans up file descriptors that reach 0 user procs | |
# TODO - what is a file descriptor | |
Calls `schedule` to switch to a new process | |
Removeing the Process Descriptor | |
------------------------ | |
After the parent has received info on terminated child, child's `task_struct` is deallocated | |
wait() and co | |
- use wait4() | |
- std behaviour is to suspend execution until child exits | |
`do_exit() -> exit_notify() -> forget_original_parent() -> find_new_reaper()` (assigns the orphaned processes a new parent) | |
`init` routinely calls wait() on its children, cleaning up any zombies assigned to it | |
Process Scheduling | |
================== | |
Fundamental decision: decide which process will run next given a set of runnable processes | |
Multitasking | |
------------- | |
Interleaving execution of multiple processes | |
Runs processes simultaniously on multi-core machines | |
Illusion of concurrency on single processor machines | |
Enables processes to block or sleep when work is unavailable | |
- can wait until some event (keyboard input, network data, passage of time) | |
- many processes in memory, but only one in a runnable state | |
Two flavors | |
- cooperative multitasking | |
- preemptive multitasking (most modern OSs) | |
- Scheduler decides when a process is to cease running and a new process is to begin running | |
- 'Preemption' - involuntarily suspending a running process | |
- Usually time a process runs is predetermined - called a 'timeslice' | |
- Linux's unique 'fair' scheduler does not employ timeslices _per se_ | |
- coop multitasking - proc does not stop running until in voluntarily decides to do so (yielding) | |
- hung process can bring down the whole system | |
- scheduler cannot make global decisions about how long processes run | |
- processes can monopolize the procesor for longer than the user desires | |
Linux's Process Scheduler | |
---------------------- | |
O(1) scheduler | |
Constant time decisions on timeslice calculation and per-processor runqueues | |
Scaled well for large 'iron' systems with tens if not hundreds of processors | |
# TODO - define pathological, context: pathological failure | |
Latency-sensetive applications were not serviced well | |
Interactive processes - any application that users interact with | |
Rotating Staircase Deadline scheduler (most notable of several new algorithms) | |
- introduced `fair scheduling` (borrowed from queuing theory) | |
Policy | |
-------- | |
Behavior of sched that determines what runs when | |
Can determine the 'feel' of a system | |
CFS - Completely fair scheduler | |
### I/O-bound vs processor-bound Procs | |
I/O - bound | |
Spends most of its time submitting and waiting on i/o requests | |
Usually runnable for only short durations as it eventually blocks waiting on mor io | |
(io here means any type of blockable resource) | |
eg most GUI applications | |
Processor - bound | |
Spend much of their time executing code | |
Tend to run until preemption | |
Dont need to run often (system respone not required) | |
Run less frequently but for longer durations | |
Some processes change behavior characteristics | |
Try to satisfy two conflicting goals | |
fast process response time (low latency) | |
maximal system utilization (high throughput) | |
Linux favors i/o-bound processes | |
### Process Priority | |
Rank processes based on their worth and need for processor time. Procs with higher priority run first and lower, later | |
Processes with the same priority run round-robin | |
Both user and system can set a process's priority to influence scheduling behavior | |
Two priority ranges | |
'nice' values - in the range (-20 - +19) | |
larger nice values are lower priority (they ar nicer to other procs) | |
lower numbers ar higher priority | |
linux controls proportional timeslice size, others (like OS X) have control over absolute timeslice | |
Can see nice values using `ps -el` | |
# Note - after looking, things like kernel workers get maximal priority, pulso audio has high priority for proc in user space (want continuous play) | |
Real-time priority | |
By default rage from 0 - 99 | |
Higher # - higher priority | |
All real-time processes are at a higher priority than normal procs | |
Disjoint value spaces from nice value | |
See with `ps -eo state,uid,pid,ppid,rtprio,time,comm` under RTPRIO | |
Value of '-' means the proc is not real-time | |
# Note - after looking, only RT user-space thing was a wifibus? | |
Timeslice | |
----------- | |
How long a task resource can run before it is preempted | |
Too long => application doesnt feel interactive | |
Too short => lots of time wasted on context switching | |
IO do not need long time slices | |
Processor-bound processes crave long timeslices to keep their caches hot | |
### On Linux | |
Procs get a proportion of processor time (determined by weight) | |
# NOTE - a weight is given. If things are to be added, they must be normalized because the processor use will always attempt to be 100% (1) | |
Nice value affects this performance | |
It is preemptive - so of a higher priority task becomes runnable | |
it will preempt the current process if it has not yet reached its | |
share of the cpu (as determined by its proportion) | |
### Eg | |
Text editor and video encoder | |
Want text editor to have large amount of time available to it, | |
not because it needs in, but because we want it to have time available | |
Want text editor to preempt the video encoder when it wakes up | |
Video encoding can take more than its allotted time, esp b/c the text editor is usually blocked on io | |
The key is that linux is giving processor time immediately when the text editor requires (because it has used less than its allotted 50%) | |
The Linux Scheduling Algorithm | |
------------------------------- | |
Has several scheduler classes | |
Scheduler is modular | |
Allows different algorithms to schedule different types of processes | |
Algorithms can coexist | |
Base scheduler | |
code is in `kernel/sched.c` | |
it iterates over each scheduler class in order of priority | |
the highest priority scheduler class that has a runnable process wins | |
Completely Fair Scheduler | |
SCHED_NORMAL in Linux | |
code in `kernel/sched_fair.c` | |
### Process Scheduling in Unix Systems | |
Unix exports priority to the user-space in the form of nice values | |
`- leads to some problems | |
1. Mapping nice values onto timeslices requires a decision about what absolute timeslice to allot | |
Leads to sub-optimal switching behavior | |
When only 2 low priority processes run, each gets a very small timesilice. Many more context switches are required than if time were determined proportionally, rather than absolutely | |
2. Difference of 0 to 1 is much smaller than 18 to 19 | |
Inverse processor time allotments | |
Background tasks (very nice tasks) | |
- usually need more computation time | |
- in unix, low priority gets smaller, not larger timeslices | |
Normal priority processes tend to be user tasks | |
- need shorter, more frequent time slices | |
Solution: make nice values geometric instead of additive | |
Relative nice values are off too | |
Proportion of nice values does not map well to the proportion of processor time. | |
Eg NV0 = 100ms, NV1 = 95ms # almosts splitting 50/50 | |
Eg NV18 = 10ms, NV19 = 5ms # first gets double the time | |
Incrementing or decrementing a nice value has wildly different effects depending on what value you start at | |
3. Absolute timeslices must be a thing the kernel can measure | |
Im many OSs this means some multiple of the timmer tick which varies widely by machine | |
10ms on one might be 1ms on another | |
# TODO - reread this section after ch 11 | |
Solution: Mapping nice values to timeslices using a measurement decoupled from the timer tick | |
4. Interactive process wake-up | |
Often, you want to give a proc a temporary boost in priority so it can run in spite of an expired time slice | |
Doing this might open up the OS to a program that tries to game the algorithm for unfair processor time | |
Real underlying problem is that | |
While it is nice to have a contstant switching rate | |
Fairness becomes variable | |
Solution: do away with timeslices completely and assign processes a proportion of the processor | |
Results in a constant fairness but a variable switch rate | |
### Fair Scheduling | |
CFS is based on the concept: 'Model process scheduling as if the system had an ideal, perfectly multitasking processor' | |
Perfect Fairness and Perfect Multitasking | |
- Timeslices are infentessimally small | |
- For _any_ span of time, each process would have run the exact same ammount as any other process | |
- With 2 processes, this is like running 2 processes at the same time at 50% power | |
Unfortunately we cant actually multitask procs | |
And infinitely small timeslices are unreasonable | |
- b/c switching cost | |
- b/c cache penalty | |
# TODO - more often, think about the idea / continuous situation | |
CFS actually | |
- Calculates how long a process should run as a fn of the total number of runnable processes | |
- Runs procs round robin - selecting the project that has run the least (is this conflicting? or do allotments of time keep round-robin actually in order) | |
- Nice values _weight_ the proportion of processor a process is to receive | |
- Each process runs for a 'timeslice' proportional to its weight / total weight of all runnable threads | |
- Actual timeslice calculation | |
- Set a target for its approximation of 'infinitely small' scheduling duration in perfect multitasking (known as _targeted latency_) | |
- Note, smaller targets yield better interactivity at the expense of higher switching costs (and worse throughput) | |
- TODO - need paper | |
- If targeted latency is 20 ms, with n tasks of the same priority, each task will run for 20 / n miliseconds | |
- since high n will result in high switching costs using this model, CFS imposes a floor on the timeslice assigned to a process | |
- floor is called _minimum granularity_ - default of 1ms | |
- so system is not fair when n is high, because all procs will run for the minimum timeslice | |
- TODO - figure out how the targeted latency changes with high n. I think, large n just means that minimum viable targeted latency is larger than setc:w | |
- Nice value imposes weight penalty on procs | |
eg 2 procs might split the 20ms targeted latency 15 and 5 | |
CFS isnt prefecly fair, it just approximateds perfect multitasking | |
It can place a loer bound on latency of n for n runnable procs on the unfairness | |
# TODO - how would you define latency to someone | |
The Linux Scheduling Implementation | |
------------------------------------ | |
Actual implementation lives in `kernel/sched_fair.c` | |
4 Components | |
- Time Accounting | |
- Process Selection | |
- The Scheduler Entry Point | |
- Sleeping and Waking Up | |
### Time Accounting | |
Most unix | |
Assign a timeslice | |
Decrement timeslice by the tick period each tick | |
On 0, preempt the proc | |
#### The Scheduler Entity Structure | |
Used by CFS to keep track of process accounting | |
Embedded in the _process descriptor_ `task_struct` | |
member var `se` | |
#### The Virtual Runtime | |
`vruntime` | |
- stores the _virtual runtime_ of a proc | |
- time spent running normalized by the number of runnable procs | |
- unit: nanoseconds (decoupled from the timer tick) | |
- on a perfect multitasking system, all tasks of same priority would always have the same virtual runtime | |
- used to determine how long a proc has run and how much longer it ought to run | |
`update/curr(struct cfs_rq *cfs_rq)` manages the accounting | |
### Process Selection | |
Run the process with the lowest vruntime | |
CFS uses a red-black tree to manage list of runnable procs | |
Note - tree is not actually traversed in finding the left-most node b/c it is cached | |
### The Scheduler Entry Point | |
`schedule()` | |
- entry point for rest of the system | |
- system command, in turn cals `deactivate_task()` which removes the task from the runqueue | |
- generic wrt the scheduler classes | |
- find the highest priority scheduler clas with runnable proc, asks it what to run next | |
- invokes pick_next_task() | |
- CFS is the scheduler for normal processes | |
### Sleeping and Waking Up | |
Special state that signals to the scheduler that a task does not wish to be run | |
When sleeping, a task is _always_ waiting for an event | |
Events include | |
- specified amount of time | |
- more data from a file I/O (common) | |
- some other hardware event | |
A task may be forced into sleep if it requests tries to obtain a contended semaphore | |
Two states associated with sleeping | |
`TASK_INTERRUPTIBLE` | |
- respond to signals | |
- _spurious wake up_ -wake-up not caused by the occurrence of the event.. so check and handle signals | |
`TASK_UNINTERRUPTIBLE` - ignores signals | |
both sit on a wait queue waiting for an event | |
#### Wait Queues | |
Can be declared statically or dynamically | |
Processes put themselves on wait queues and declare themselves not runnable | |
When the event associated with the wait queue occurs, the procs on the queue are awakened | |
Race conditions can arise when sleeping and waking are incorrectly implemented | |
eg. dont go to sleep aftr the condition becomes true | |
There is a prescribed approach to putting a process on the wait queue - use it should you ever need to (p 61) | |
#### Waking Up | |
`wake_up()` | |
- marks TASK_RUNNING, enqueues it | |
- sets need_resched if the awakened task's priority is higher than the priority of the current task | |
- code that causes the event to occur typically walls wake_up() on everything in its queue waiting for the data | |
When a task awakens, it is not safe to assume the event has occurred. It may have been a spurrious wakeup - need to check event | |
Preemption and Context Switching | |
------------------------------- | |
`context_switch()` | |
- switches from one runnable task to another | |
- called by schedule() | |
2 jobs | |
- switch the virtual memory mapping from that of the old to the new (via `switch_mm`) | |
- switch processor state from the previous proc's to the current's | |
- save and restore stack info / processor registers / arch specifics | |
- via `switch_to()` | |
Kernel must know when to call schedule | |
cant trust user-space programs to do it | |
`need_resched` flag used to signify needed reschedule | |
- by `scheduler_tick()` | |
- by `try_to_wake_up()` when woke proc is higher prior | |
has some accessor methods | |
user preemption - preemption when kernel is about to return to user-space | |
- from system call | |
- from interrupt handler | |
kernel preemption - no need to wait for kernel code running in user context to return, can reschedule any time it is safe to do so | |
Not safe | |
- when a lock is held (lock count is non-0) | |
- can do this because the kernel is SMP-safe | |
can occur | |
- when an interrupt handler exits, before returning to kernel-space | |
- whent kernel code becomes preemptible again | |
- if a task in the kernel explicitly calls `schedule()` | |
- if a task in the kernel blocks (which results in a call to `schedule)`) | |
Real-Time Scheduling Policies | |
------------------------- | |
Two real-time scheduling policies | |
`SCHED_FIFO` | |
first in, first out - without timeslices | |
always scheduled over `SCHED_NORMAL` | |
runs until it explicitly blocks or yields | |
no timeslice, can run indefinitely | |
only higher priority real-time proc can preempt | |
two or more at the same priority run round-robin | |
`SCHED_RR` | |
same as above, except with a predetermined timeslice | |
(`SCHED_NORMAL` is non-realtime policy ) | |
Managed by a specia real-time scheduler in `kernel/sched_rt.c` | |
'soft real-time' - kernel tries to make timing deadlines but no promises | |
'hard real-time' - guarantees on the capability to schedule real-time tasks | |
# TODO - not sure what it means for nice values and rt values to share a priority space, and how we got a range of 100 to 139 | |
Scheduler-Related System Calls | |
------------------------------- | |
Bunch of C library functions which call system calls, almost directly | |
can | |
set a procs nice value | |
only root can set a negative nice value | |
set sched policy | |
get sched policy | |
set rt priority | |
get rt priority | |
get max rt pri | |
get min rt pri | |
get proc timeslice | |
set proc processor affinity | |
attempts soft affinity by default? | |
bit-map enforce hard processor affinity | |
get proc processor affinity | |
yield tho processor | |
# Note, much of sys call code is checking arguments, setup, and cleanup | |
There is a load balancer that juggles procs around processors when necessary | |
`sched_yield()` - expires a proc so it wort run for a while, moreso than just moving it to the end of the run array | |
Process and Thread Scheduling (Guest ch from diff book) | |
=============================== | |
Scheduler | |
--------- | |
composed of two separate modules | |
Process scheduling - decision-making policies to determine the order in which active processes should compete for the use of processors | |
Process dispatcher - binding of selected process to a processor (remove from run-queue, change status, load state) | |
implementation | |
might be called through syscalls, or | |
might be an anonymous process that polls | |
sometimes in a master/slave configuration | |
_quantum_ - predifined period of time the process is allowed to run | |
_priority_ - used to decide 'who next' | |
P = Priority(p) where P is a number and p is a process | |
P is a number | |
sometimes static, given p | |
p is a process | |
divides procs into _priority levels_ | |
_ready list_ (RL), struct to hold levels | |
framework for scheduling | |
2 fundamental questions | |
when to schedule - when should the scheduler be active | |
who to schedule - what proc | |
governed by | |
_priority function_ - seen above | |
_arbitration rule_ - satisfies ties | |
decision mode | |
when are priorities evaluated | |
non-preemptive | |
run until block or termination | |
preemptive | |
stop the presses, there is more important shit to print | |
_quantum-oriented_ - performed each quantum | |
when other proc priority > current proc | |
ususally more costly than non-preemptive | |
priority furnction | |
uses diff attributes from proc and system | |
common attrs | |
service time so far | |
is real time? | |
deadline | |
periodicity | |
external priority | |
memory requirements | |
_job scheduling_ | |
long term scheduling | |
choose among batch jobs | |
arrival might be arrival in the system | |
departure might be proc termination | |
_process scheduling_ | |
select among currently ready processes to run next | |
arrival might be change to ready state | |
departure might be change to running or blocked | |
_attained service time_ - amount of time since arrival the proc used the CPU. (Often the most important parameter in scheduling) | |
_real time in system_ - total time in system (attained + waiting time) | |
attained and real time of a proc make it easy to compute ratio of runtime | |
total service time | |
attained service time at time of departure | |
some systems know this time in advance | |
usually true of repetitive service tasks | |
sometimes predicted based on proc's most recent behavior | |
procs that block frequently will have smaller total service times | |
deadline | |
point in real time by which a task must be completed | |
time-sensative computation, eg weather forecast, wideo transmission | |
shorter deadline ~= higher priority | |
periodicity | |
repetition of a computation | |
sometimes used to model event driven processes | |
especially when events are polled | |
fixed period automatically imposes an implicit deadline | |
external priorities | |
explicitly assigned priorities | |
memory requirements | |
major scheduling criterion in batch operating systems | |
interactive systems, this is a good measure of swapping overhead | |
Arbitriation Rule | |
---------------- | |
Resolves conflicts btwn procs of equal priority | |
Could be random choince, could be round robin | |
Common Schediling Alogrithms | |
------------------------- | |
FIFO | |
arrival time is only criteria | |
Shortest-Job-First | |
non-preemptive | |
based on total service time | |
priority is inverse of of total service time | |
Shortest-Remaining-Time | |
preemptive and dynamic version of SJF | |
P = -(total time - attained service time) | |
Round-Robin | |
uses fixed quantum | |
quantum might be different at different cpu loads | |
linux sets a floor for minimum quantum time | |
avoids lots of context switching at high load | |
Multi-Level Priority | |
Fixed set of priorities, 1 to n | |
Each process is assigned a priority e | |
P = e | |
e can be changed by authorized system calls | |
usually preemptive | |
at each level - RR if preemptive, else FIFO | |
Multi-Level Feedback | |
similar to MLP, but | |
processes migrate to lower levels as attained service time increases | |
each level gets a tP max, if a process exceeds that value, its priority is decreased a level | |
There is a formula (P = n - floor(lg2(a/T + 1))) | |
n - the number of priority levels | |
T - max time at highest level | |
Rate Monotonic | |
Real time systems | |
procs and threads are often _periodic_ | |
computation must be done before the start of the next service time | |
Preemptive policy that dispatches according to period length | |
the shorter the period, the higher the priority | |
Earliest-Deadline-First | |
preemptive and dynamic | |
highest priority assigned to the process with the smallest remaining time until its deadline | |
assumes all processes are periodic | |
assumes deadline is equal to the end of the current period | |
probably not great for humans? or, maybe is good for some humans | |
P = -(d - r%d) | |
r - time proc entered the system | |
d - period | |
r / d - number of completed periods | |
r % d - time already expired fraction of the current period | |
(Table is pretty cool) | |
Comparison of Methods | |
--------------------- | |
FIFO, SJF, SRT - developed primarily for batch processing system | |
If total service times are known (or can be predicted) | |
SJF or SRT are usually preferrable to FIFO | |
b/c both SJF nd SRT favor shorter comp time over long | |
users want fast service for short tasks | |
total time required | |
independent of process order | |
avg turnaround time | |
can be changed with diff process orders | |
one of the more important metrics used to compare sched algos | |
# NOTE - it seems like it would be easy to show change in process attributes for different scheduling algorithms | |
# probably in a visually appealing way | |
_response time_ | |
important in time share systems | |
if important | |
implies a preemptive algorithm | |
good algos | |
round robin | |
responsiveness depends on q | |
q -> inf, rr becomes fifo | |
q -> 0, rr is dominated by context switching | |
multi-level feedback | |
MLF | |
IO bound procs tend to remain high priority | |
batch jobs tend towards the bottom of the list | |
Example Schedulers | |
---------------- | |
VAX/VMS | |
15 priorities for real-time, 15 for regulars | |
Realtime uses multi-level scheme (without feedback) | |
Regural processes | |
have standard | |
adjusted with other events in the system | |
eg. read completion will ++ a proc priority | |
reaching the end of runtime will -- proc priority | |
Windows 2000 | |
similar to above except threads are scheduled rather than procs | |
# NOTE - scheduling threads could use the process (tgid struct) to keep track of the attained time - or some aggregate time | |
Minix | |
Multi-level queues | |
_task_ (kernel stuffs), | |
_server_, (memory, file, network management) | |
and _user_ levels | |
RR | |
Linux | |
no queues of threads/procs | |
sort by _goodness_ | |
base and current goodness is tied to the quantum | |
each thread has its own variable-length quantum | |
like a countdown time | |
each thread | |
gets a base quantum (bq) | |
gets a remaining quantum (rq) | |
current goodness (g) | |
if (rq == 0) g = 0 # alloted quantum has been consumed | |
if (rq > 0) g = bq + rq | |
epochs | |
division of scheduler time | |
end when no threads can run | |
all threads (or 'exahsted quantums', | |
'blocked') | |
reset with formula `rq = bq + (rq / 2)` | |
each thread gets its base quantum plus half of its remaining | |
naturally favors I/O - bound threads | |
realtime threads | |
always executed before normal threads | |
can be scheduled (or RR, | |
FIFO) | |
helps to know if deadline will be met _before_ execution | |
_schedule_ - an assignment of proc to process | |
_feasible schedule_ - iff all deadlines are satisfied | |
_optimal scheduling method_ - resulting schedule meets all deadlines whenever a feasible schedule exists | |
periodic processes | |
utilize ti/di of the cpu time | |
ti is the total service time for process i for one period | |
di is the period length of process i | |
overall utilization is | |
U = sumi=1-n(ti/di) | |
schedule is feasible if U <= 1 | |
Rate Monotonic | |
will find optimal for U <~ .7 | |
therefore, it is not optimal | |
Earliest-Deadline-First | |
will always fond the optimal solution if there is one | |
Priority Inversion | |
----------------------- | |
Higher priority procs or threads can be delayed or blocked by lower priority ones | |
possible solutions | |
make critical sections non-preemptable | |
works well if critical sections are short and few | |
execute critical sections at the highest priority of any process that _might_ use them | |
suffers from overkill | |
leads to other forms of priority inversion | |
dynamic priority inheritance | |
when lower priority proc blocks a higher one, | |
the lower proc is assigned a priority = to the higher until | |
it exits the critical section. | |
allows for nested critical sections | |
blocking times boulded by critical section execution times | |
can be a cumpulsory requirement in real-time operating systems | |
POSIX focus | |
user can associate a _ceiling attribute_ with a mutex semaphore | |
integer specified | |
when mutex is activated, priority is automatically raised to the value of the ceiling | |
threads that use this mutex will always run at the same level and will not preempt each other | |
priority inheritance | |
Windows 2000 focus | |
every second, scan all ready threads | |
any that have been idle for more than 3 seconds | |
bost priority to max | |
extend quantum by 2x | |
no thread will be postponed indefinitely due to higher priority threads | |
Multiprocessor and Distributed Scheduling | |
==================== | |
Two principal approaches to scheduling procs and threads on multicomputers | |
Use a single scheduler | |
all CPUs are in same resource pool | |
any proc can be run on any processor | |
Use multiple schedulers | |
machines are scheduled separately | |
often a choice for non-uniform multiprocessors | |
different CPUs have different characteristics and functions | |
eg. special machines fro I/O, filing, data acquisition, fast Fourier transforms, matrix computations | |
users and apps tend to have greater control over performance | |
Case: _the Mach Scheduler_ | |
each machine set has a global ready list of schedulable threads | |
dispatching | |
select highest priority thread from the local queue | |
when empty, select highest priority thread from global list | |
execute idle thread when both lists are empty | |
Case: _Windows 2000_ | |
affinity scheduling | |
goal: keep a thread running on the same processor (if possible) | |
make use of caching | |
each thread is associated with two integers | |
ideal processor | |
last processor | |
when thread is ready | |
prefer running on | |
1. the ideal processor | |
2. the last processor | |
3. the processor currently runnng the scheduler | |
4. any other processor | |
idle processor looks for proc to run | |
1. last thread to run o the processor | |
2. the thread for which this is the ideal processor | |
3. then whatever the top priority processor is | |
Distributed systems | |
normally have separate schedulers at each site (cluster) | |
processes usually preallocated to clusters | |
communication costs time | |
costs message traffic | |
applications have threads that utilize a shared address space | |
distributed shared memory and migrating threads | |
practical ideas | |
realization is still slow | |
often have functional distribution | |
leads to different schedulers | |
a node might specialize | |
focus on email - cause node is an exit node | |
printing (node has access to printer), etc | |
RTB requires being really fricken close to exchanges, need to fill request fast | |
Rate monotonic - preemptive policy that disatches according to period length | |
shorter the period, the higher the priority | |
Earliest-Deadine-First (EDF) - preemptive and dynamic scheme | |
usually designed for real-time applications | |
Neither of the above two are optimal for multiprocessors | |
often satisfactory heuristics though | |
feasible schedules produced, when they exist and when machines are not under heavy load | |
preallocation of procs is most common because of the complexit of trying to meet real-time constraints | |
Linux Scheduler | |
- time compelxity depends on number of priorities (queues) | |
- time compelxity does not depend on the number of processes | |
Calculating Timeslices | |
Quantum = 140 - (static priority) * 20 if SP < 120 | |
Quantum = 140 - (static priority) * 5 if SP >= 120 | |
static priority - think nice & real-time stuffs | |
note higher priority process get _longer_ quanta | |
important process should run longer | |
dynamic priority | |
calculate from static priority and avg sleep time | |
record time sleeping (up to a max value) | |
when the process is running, decrease value each timer tick | |
[0, 10] - rough estimate of % time sleeping recently | |
0 hurts priority by 5 | |
5 is neutral | |
10 helps priority by 5 | |
DP = max(100, min(SP-bonus + 5, 139)) | |
_interactive_ - if bonus - 5 >= S/4 - 28 | |
low-priority processes have a hard time becoming interactive | |
default priority process becomes interactive when its sleep time is greater than 700ms | |
Using Quanta | |
each tick decreases the quantum of _current_ | |
process gets prempted when clock hits 0 | |
if non-interactive -> send to expired list | |
if interactive -> send to end of current queue (scheduled again) | |
but running will decrease bonus (along with priority) | |
Avoiding Indefinite Overtaking | |
remember those 140 queues (_active_ and _expired_) | |
only run procs from active queues (put on expired when quanta == 0) | |
swap active and expired queues after running through all of them | |
pointers to current arrays | |
two arrays | |
- avoids the need for traditional aging | |
- why is aging bad? -> it's O(n) at each clock tick | |
- ergo, not required for peoples scheduling | |
Many OSs iterate over procs to update priorities | |
Linux is more efficient | |
- procs are touched on start and on stop only | |
- recalculate priorities, bonuses, quanta, and interactive status | |
- never loop over all procs! | |
Locking Runqueues | |
Sometimes for queue rebalancing, kernel locks queue | |
in order to move procs | |
Real-time scheduling | |
soft real-time scheduling | |
static priority [0, 99] are real-time | |
FCFS & RR | |
Sleeping and Waking | |
- proc needs to wait for events | |
- proc added to wait queue | |
- wakeups can happen too soon - need to check args and maybe re-sleep | |
- you dont wake a process | |
- you DO wake a wait queue | |
System calls | |
`nice()` - lower a process' static priority | |
`getpriority()/setpriority()` - change priorities of a process group | |
`sched_getscheduler()/sched_setscheduler()` | |
set scheduling policy and params | |
and others that start with `sched_` (see with man -k) | |
`scheduler_tick()` | |
called each timer tick to update quanta | |
`try_to_wakeup()` | |
attempt to wake a proc | |
put on run queue | |
rebalance loads | |
`recalc_task_prio()` | |
update average sleep time and dynamic priority | |
`schedule()` | |
pick the next process to run | |
`rebalance_tick()` | |
check if load-balancing is needed | |
Fair Share Scheduling | |
would need to | |
add new scheduler type | |
Two basic functions | |
time of day - a service provided to apps | |
interval timers - something should happen in n ms from now | |
_Real-Time clock_ | |
tracks time and date | |
can interrupt at a certain rate | |
can interrupt at a certain time | |
only used by Linux for time of day | |
_programmable interval timer_ | |
generates periodic interrupts | |
usually based on HZ | |
some other special , less common timers | |
Things executed on timer ticks | |
keep track of time | |
invoke dynamic timer routines | |
call `scheduler_tick()` | |
system uses best timer available | |
`jiffies` | |
incremented each tick | |
roughly the number of HZ since sys boot | |
32 bit counter - wraps around in ~ 50 days | |
Time of Day | |
stored in xtime (`struct timespec`) | |
incremented once per tick | |
apparent tick rate can be adjusted slightly see `adjtimex()` | |
Kernel Timers | |
Dynamic timers | |
- call routine after a praticular interval | |
- specify an interval, subroutine to call, and a param to pass subroutine | |
- param differentiate different instantiations of the same timer | |
- timers can be cancelled | |
Delay loops | |
- tight spin loops for very short delays (micro seconds or nano seconds) | |
- nothing else can use CPU during that time (except via interrupt) | |
- for when dynamic timers are too much overhead | |
System Calls | |
`time()` and `gettimeofday()` | |
`adjtimex()` | |
tweaks apparent clock rate (even the best crystals drift) | |
`setitimer()` and `alarm()` | |
interval timers for applications | |
user-mode interval timers | |
similar to kernel dynamic timers | |
System Calls | |
===================== | |
Allow user-space apps to | |
- access hardware | |
- create new procs | |
- communicate with existing ones | |
- request os resources | |
Layer between user programs and hardware | |
3 purposes | |
- abstracts hardware eg. read/write files dont care what media you are on | |
- security / stability | |
- virtualized system provided to the process | |
Without controlled access of sys resources | |
- couldnt multitask | |
- wouldnt be able to have virtual memory | |
Legal entry points to the kernel | |
- sys calls | |
- traps | |
- exceptions | |
Even device files, or files in /proc, are accessed via system calls | |
APIs, POSIX, and the C Library | |
------------------------------- | |
User space programs usually programmed against an API implemented in use-space, not directly to sys calls | |
Can create defferent APIs from building blocks provided by system calls | |
POSIX - series of standards from the IEEE (eye-triple-E) | |
Linux strives to be 'POSIX' and 'SUSv3'-compliant | |
A very common api is based on the POSIX standard | |
A meme related to interfaces in Unix is 'Provide mechanism, not policy' | |
Syscalls | |
--------- | |
Typically accessed by function calls in the C library | |
System calls also provide a return value of type `long` that signifies success or error | |
usually negative value denotes an error | |
usually 0 is a sign of success | |
`errno` - holds a special error code of the last error | |
`perror()` - used to print human-readable error code | |
tgid - is the thread-group id. This is == pid for normal processes, while for threads is the same for all threads of the same group | |
`asmlinkage` | |
- only look on the stack for the fns arguments | |
- required for all syscall | |
in kernel, all syscalls are prefixed with `sys_` | |
### System Call Numbers | |
Unique # used to reference a specific system call | |
User-space process identifies calls by #, not name | |
#, once assigned, cannot change - else compiled apps will break | |
### System Call Performance | |
Faster than many other os's | |
Cause | |
- fast context switching | |
- streamilened enter/exit | |
- simple call handler | |
System Call Handler | |
------------------- | |
Kernel fns exist in a protected memory space. User-space programs cannot call them directly | |
Method - use an interrupt: incur an exception and the system will switch to kernel mode, executing the exception handler (in this case, the syscall handler) | |
# TODO - difference between throwing an exception and 'trapping into the kernel' | |
Some processors have a special sysenter feature | |
### Denoting the Correct System Call | |
Architecture dependent, on x86, user-space app sets the eax register before causing the trap into the kernel | |
### Parameter passing | |
Sent to the kernel on registers `ebx, ecx, edx, esi, and edi` | |
Return value is also sent via register | |
on x86, the `eax` register is used | |
System Call Implementation | |
---------------------------- | |
Must first define its purpose - exactly one | |
What are the new system call's arguments, return value, err codes | |
Should have a clean and simple interface | |
Small # args possible | |
Remember, semantics and behaviour are important, they must not change | |
- Consider how the function use might change over time | |
- Can your method be changed while maintaining compatability? | |
- Will bugs be fixable? | |
- Are you needlessly limiting the fn in some way? | |
- Purpose remains the same, but uses may change | |
- Dont make architectual changes | |
### Verifying the Params | |
Thourough checking is in order - security is imperrative | |
Check valid and legal, but also correct | |
Cant trust user-space procs | |
Always check validity of pointers!!! | |
- must point to user-space memory | |
- must point to memory in the process's address space | |
- if reading, memory must be marked readable | |
- if writing, memory must be marked writable | |
- if executing, memory must be marked executable | |
there are helper fns to do this | |
System Call Context | |
----------------- | |
In proc context | |
- Kernel can sleep | |
Makes available most kernel functionality | |
- is fully preemptible | |
Must make sure call is reentrant | |
otherwise trouble occurs when another higher priority proc makes same syscall in the middle of a lower prior proc | |
### Final Steps in Binding a System Call | |
Add entry to the end of the syscall table | |
- must be done for each arch | |
For each architecture, define the call number in `asm/unistd.h` | |
Compile the syscall into the kernel image (as opposed to compiling as a module) | |
# Note - interrupt handlers cannot sleep, and are therefore much more limited in what they can do than syscalls running in proc context | |
Since new sys calls will not be in the C libs, you can wrap access with macros. This will handle setting up register contents and issuing the trap instructions. `_syscalln()` where n is between 0 and 6. (# of params) | |
_see this section for use of your new syscall_ | |
### Option of a New Syscall | |
#### Pros | |
Simple to implement, easy to use | |
Fast performance on Linux | |
#### Cons | |
Requires a syscall number, which needs to be officially assigned to you | |
Written in stone, interface cannot change | |
Must be explicitly supported by each architecture | |
Not easily used from scripts | |
Cannot be accessed directly from filesystem | |
Hard to maintain outsize of the master kernel tree | |
Overkill for simple info exchanges | |
#### Alternatives | |
Implement a device node and `read()` and `write()` to it | |
use `ioctl()` to manipulate specific settings or retrieve specific info | |
Certain interfaces, such as semaphores, can be represented as file descriptors and manipulated as such | |
Add the information as a file to appropriate location in sysfs | |
# TODO - what is sysfs? | |
# Note: you can get a proc's processor afinity in linux with `sched_getaffinity`. In a group context, a warm cache of the processor is relevant. If someone is familiar with offer caching code, you probably want them to continue rather than ramping-up someone else. | |
# Note: for humans: loading something onto disk takes time (learning something for the first time), disk to memory is faster (recall of past), recent tasks is faster still | |
Other | |
------ | |
# note - preempt notifiers - let something know that a process has been preempted | |
# TODO what is - group leader | |
# TODO what is - slacktime | |
Kernel Data Structures | |
======================== | |
Dont roll your own | |
Not inclusive, just the basics in this book | |
- linked lists | |
- queues | |
- maps | |
- binary trees | |
Linked Lists | |
----------------- | |
Allows storage of variable number of elements | |
Elements created at differernt times | |
- must be variable in length | |
- may occupy discontiguous regions in mem | |
Single linked, double linked, or circular linked | |
Linux Kernel LL is unique, but pretty much a circular, doubly linked list | |
### Moving Through a Linked List | |
Linear movement | |
Bad at | |
Random access | |
Good at | |
Dynamic addition / removal O(1) ftw | |
Iterating | |
### In Linux Kernel | |
Rather than including pointers in the data struct, include a pointer to a LL node instead | |
`linux/list.h` | |
Reference to list_head in the data struct allows several macros to be used to `list_add()` or `container_of()` which take list_head as arguments | |
Since the offeset of a node pointer in a data struct is set at compile time, it is easy to get the containing structure of the list | |
Presumably `const typeof( ... )` will get the type of the struct in a generic way, so `container_of()` doesnt need to know anything about it | |
Be sure it is embedded in data_struct, otherwise macros arent helpful | |
Need to init the list head on creation | |
Also, there are some safe(er) methodes | |
Also, also, you may need to synchronize for exclusive list access | |
Queues | |
----------------- | |
Producers and consumers | |
see `kernel/kfifo.c` and `linux/kfifo.h` | |
Queues are one page-sized by default | |
Queues leave it up to the user to say | |
- How many bytes to copy in | |
- How many bytes to copy out | |
If a queue is nearly full, only a partial add will be done | |
If a queue is nearly empty, only a partial read will be done | |
Maps | |
------------ | |
Aka associative array | |
Keys to values | |
Hashtables are a type of map - but not all maps are hashtables | |
Trees can be used as well (self-balancing) | |
Hash | |
- better average-case complexity | |
Trees | |
- better worst-case complexity | |
- can also maintain a sorted order | |
- can use any type of key, so long as it can define the <= operator | |
### In Linux | |
Map is used to map a UID to a pointer | |
Will also generate a UID for you | |
`idr` maps user-space UIDs to their kernel data structures | |
Need to preallocate a UID before getting a new one, cause memory and locking are hard | |
Binary Trees | |
---------------- | |
Math says it is an 'acyclic, connected, directed graph' where each node has 0 or more outgoing edges | |
Binary tree is a tree with no more than 4 outgoing edges per node | |
_binary search tree (BST)_ - specific ordering on nodes | |
left of root is only nodes < root | |
right of root is only nodes > root | |
all subtrees are also binary search trees | |
Pros | |
fast search | |
fast ordered traversal | |
### Self-balancing BST | |
Depth of all leaves differs by at most 1 | |
In other words, height is managed | |
#### Red-Black Tree | |
type of self-balancing bst | |
binary tree of choice in linux | |
6 properties are enforced | |
Ensures that the deepest leaf has a depth of no more than double that of the shallowest leaf | |
Tree is always semi-balanced | |
B | |
/ \ | |
R B | |
/ \ / \ | |
B nilnil B | |
It does this by rotating shit thats unbalanced, but dont worry about it unless you are an undergrad compsci student, which im not (anymore) | |
`rbtree` - `lib/rbtree.c` and `linux/rbtree.h` | |
inserts are always logarithmic | |
What Data Structure to Use, When | |
-------------------------------- | |
What is primary access method? | |
LL | |
if iterating | |
when performance is not important | |
when interfacing with other LL consuming code | |
Queues | |
if producer / consumer pattern is used | |
if you know your buffer can hold your data | |
Maps | |
if UID to obj is needed | |
Rb tree | |
if storing lots of data is required | |
fast searching, with efficient in-order traversal | |
in-memory footprint is not bad | |
Radix tree | |
Algorithmic Complexity | |
---------------------- | |
What is the _asymptotic behavior_ of the algorithm | |
Big-oh - upper bound | |
Big-theta - both an upper and lower bound | |
usually when big-o is dropped, they are really talking abouth the least such big-o (the big theta) | |
Time Complexity - See AoA - Colby CS | |
Interrupts and Interrupt Handlers | |
================================ | |
Communicating with hardware is a core responsibility | |
Aint got no time to wait for hardware | |
Rather than polling these devices, the responsibility is on the HW to interrupt the processor when it wants attension | |
Polling - check periodically | |
Interrupt - send a signal to the processor | |
Interrupt handlers - fn to handle interrupt | |
HW device interrupts | |
async w.r.t. processor clock | |
aggergated by multiplexes | |
directly connected to the CPU | |
have a unique value - for differentiation | |
aka interrupt request (IRQ) lines | |
IRQ 0 on a classic PC is the timer interrupt | |
IRQ 1 on a classic PC is the keyboard | |
Not always dynamically assigned | |
PCI bus dynamically generates values | |
Exceptions | |
- synchronous ~interrupts~ | |
- generated by the processor | |
- from software | |
Interrupts | |
- asynchronous interrupts | |
- generated by hardware | |
Interrupt Handlers | |
------------------ | |
aka interrupt service routine (ISR) | |
An associated handler for every device | |
Function | |
One per device | |
part of the _driver_ | |
normal C functions | |
run in _interrupt context_ | |
aka _attomic context_ | |
code cannot block | |
must run quickly | |
acknowledge hardware, at a minimum | |
Top Halves vs Bottom Halves | |
--------------------------- | |
To balance large amount of work with response speed | |
### Tap Halve | |
- only the time critical stuff | |
### Bottom Halve | |
- run at more convenient time | |
Network card example | |
- card has very little memory | |
- dont want buffer overruns | |
Registering an Interrupt Handler | |
-------------------------------- | |
`request_irq()` in `linux/interrupt.h` | |
One driver for every device | |
One interrupt handler per driver | |
`request_irq()` | |
params | |
int irq | |
the interrupt number | |
usually dynamically allocated | |
irq_handler_t handler | |
pointer to actual handler fn | |
unsigned long flags | |
bitmask of one or more of the flags | |
IRQF_DISABLED | |
usually poor form to use | |
handle fn blocks other interrupts from occurring | |
IRQF_SAMPLE_RANDOM | |
use as source of entropy | |
helps with random number generation | |
obv. dont set for predictable device interrupts | |
IRQF_TIMER | |
this handler procsses interrupts for the system timer | |
IRQF_SHARED | |
interrupt line can be shared among multiple interrupt handlers | |
conts char *name | |
ascii text name of device | |
used by `/proc/irq` and `/proc/interrupts` | |
void *dev | |
used when lines are shared | |
provide a unique cookie for later identification on removal | |
the way the kernel tells which handler to remove for shared lines | |
common to use the driver's device structure | |
can sleep | |
so no calling in interrupt context | |
cause somewhere `kmalloc()` is called | |
careful to initialize hardware fully before registering handler | |
`void free_irq(unsigned int irq, void *dev)` | |
disconnects line iff (not shared || last registered) | |
must be called from process context | |
Unless your handler is old and crusty | |
good chance you will need to share | |
Writing an Interrupt Handler | |
----------------------- | |
`static irqreturn_t intr_handler(int irq, void *dev)` | |
params | |
int irq | |
line # | |
void *dev | |
same as passed to `request_irq()` | |
typically the `device` structure | |
b/c unique to each device | |
potentially useful to have within the handler | |
return | |
IRQ_NONE | |
handle's device was not the originator | |
IRQ_HANDLED | |
handler correctly invoked | |
IRQ_RETVAL(val) | |
macro for returning one of the above two based on val | |
used to determine if device is issuing _spurious interrupts_ | |
spurious interrupts - unrequested | |
Interrupt handlers in Linux need not be reentrant | |
Processors do not receive inturrupts on the same line while executing a handler for tat line | |
Avoids dealing with nested interrupts | |
Code can be preempted, though, by interrupts on other lines | |
### Shared Handlers | |
- IRQF_SHORED flag must be set | |
- dev arg must be unique | |
- Interrupt handler must be able to distinguish what device actually generated an interrupt | |
hardware capability must exist | |
Invoked sequentially | |
Handler should quickly exit if its device was not the generator of the interrupt | |
Real-time clock (RTC) | |
`drivers/char/rtc.c` | |
separate from the system timer | |
sets the system clock | |
provides an alarm | |
supplys a periodic timer | |
driver's init code registers the handler | |
Interrupt Context | |
----------------- | |
Not associated with a running process | |
`current` macro is not relevant | |
cannot sleep | |
other functions are also not relevant | |
Time-critical | |
can wait-loop, but dont | |
push as much as possible into the bottom half | |
Stack location varies | |
Handler should not care what stack setup is in use | |
Always use the absolute minimum amount of stack space | |
Processor jumps to a unique location for each interrupt line | |
executes code at that location | |
### `/proc/interrupts` | |
stats realated to interrupts | |
line # | |
# of times executed | |
handling controller (eg `XT-PIC`) | |
name | |
_procfs_ | |
a virtual file system | |
in kernel memory | |
mounted at `/proc` | |
reading / writing files | |
invokes kernel functions | |
seems like actually reading/writing files (but are just fns) | |
`XT-PIC` | |
std PC programmable interrupt controller | |
Interrupt Control | |
----------------- | |
Can enable/disable | |
a line for the whole machine | |
aka _masking out_ | |
not used as often now since, increasingly, lines are shared | |
the interrupt system on a processor | |
`local_irq_disable()` && `local_irq_enable()` | |
or safer: `local_irq_save(flags)` && `loal_irq_restore(flags)` | |
Reasons | |
need synchronization (disable interrupt preemption) | |
kernel will not be preempted | |
Does not prevent concurrent access from another processor | |
need a lock for that | |
No `cli()` (global disable interrupts) | |
means all synchronization must be specific and narrow | |
results in faster performance | |
### Status of the Interrupt System | |
`in_interrupt()` | |
- nonzero if kernel is executing a handler | |
- nonzero if kernel is executing a bottom half | |
- if 0, kernel is in process context | |
`in_irq()` | |
- nonzero if kernel is executing a handler | |
Bottom Halves and Deferring Work | |
========================== | |
Top half is not enough | |
Why need for fast | |
- TH interrupts all the code | |
- TH disables at least a line, if not all HW comms | |
- Timing critical, eg. small HW buffers | |
- Cannot block | |
There is a need to do less critical work later. | |
Bottom Halves | |
---------------- | |
Do most work here. All but the critical. | |
You should probably process data copied in the TH, though. | |
Tips for dividing work: | |
perform in handler | |
- perform in handler | |
- HW related | |
- ensure another interrupt doesnt happen | |
perform in bottom half | |
- all other | |
### Several Mechanisms | |
First approach - a static set of handlers | |
Second approach - task queues that a driver could append to | |
Third approach | |
Softirqs | |
- statically definded - compile-time | |
- can run simultaneously on any processor | |
- two of the same can run concurrently | |
(require more care) | |
Tasklets | |
- dynamically defined | |
- can run simultaneously on any processor | |
- two of the same can _NOT_ run concurrently | |
- build on softirqs | |
Task Queue replaced with Work Queue | |
- queue work for later | |
- runs jobs in process context | |
(kernel timers) | |
- after specific time, rather than 'any time but now' | |
Softirqs | |
------------ | |
Rarely used directly. Tasklets are much more comon form. | |
`kernel/softirq.c` | |
Usually reserved for the most timing-critical / improtant bottom-half prcessing | |
Currently, only used by networking and block subsystems. But also, tasklets and kernel timers are built on top. | |
Needed for scalability. If you do not need to scale to infinity processors, consider a tasklet | |
Used for high frequency or highly threaded use. | |
Represented by the `saftirq_action` structure (see `linux/interrupt.h`) | |
Each is registered in a 32-entry array | |
only nine exist now | |
A softirq never preempts another softirq | |
But it can run on a differnt processor | |
Only interrupt handlers can preempt | |
Must mark the softirq before it will run | |
aka _raising the softirq_ | |
usually done by the interrupt handler | |
check for pending occurs | |
in return from HW interrupt code path | |
in the `ksoftirqd` kernel thread | |
any code w/ explicit checks | |
eg. networking subsystem | |
### Use | |
index | |
must be assigned (`linux/interrupt.h`) | |
used for priority - lower executes before higher | |
handler must be registered at run-time | |
`open_softirq()` | |
eg `open_softirq(NET_TX_SOFTIRQ, net_tx_action)` | |
To avoid need to lock, try and only use per-processor data | |
After installed and init'd | |
can be raised using `raise_softirq(NET_TX_SOFTIRQ)` | |
Tasklets | |
----------- | |
Called by softirq logic | |
`HI_SOFTIRQ` - higher priority tasklets | |
`TASKLET_SOFTIRQ` - regular | |
Represented by the tasklet structure | |
`struct tasklet_struct` | |
`unsigned long state` | |
0, `TASKLET_STATE_SCHED`, or `TASKLET_STATE_RUN` | |
`count` | |
tasklet can only run if non-zero | |
`func` | |
called the same way `action` is in softirqs | |
### Scheduling tasklets | |
Multiplexed on two softirqs | |
_sheduled tasklets_ | |
- equivallent of raised softirqs | |
- stored two per processor | |
linked lists | |
`tasklet_vec` - regular tasklets | |
`tasklet_hi_vec` - high priority | |
- cannot be scheduled if already scheduled | |
System disables interrupts while tasklets are being sched'd | |
Raises the `TASKLET_SOFTIRQ` or `HI_SOFTIRQ` which will call | |
`tasklet_action()` and `tasklet_hi_action()` respectively | |
Loop over each pending tasklet in the retrieved list | |
if already running on another processor | |
skip it | |
else | |
mark it as being run, and run it | |
### Using Tasklets | |
#### Declare your tasklet | |
statically or dynamically | |
statically - use macros | |
use `DECLARE_TASKLET` | |
or `DECLARE_TASKLET_DISABLED` (count is initially 1, not 0) | |
both create `tasklet_struct` | |
dynamically create `tasklet_struct` and call `tasklet_init` | |
#### Write your handler | |
`tasklet_handler` passed an unsigned long data | |
Writing | |
Cannot sleep | |
no semaphores | |
no blocking functions | |
Interrupts are enabled | |
careful if you share data with one | |
#### Scheduling Your Tasklet | |
Call `tasklet_schedule` with a pointer to your `tasklet_struct` | |
If already scheduled, it only runs once | |
If already running (eg on another processor), tasklet is rescheduled | |
As an optimization, a tasklet always runs on the processor that scheduled it (hopefully making better use of cache) | |
Need to enable the tasklet if it is created disabled | |
Disabled tasklets are just skipped | |
You can kill a pending tasklet with `tasklet_kill` (sleeps) | |
Helpful when a tasklet frequently reschedules itself | |
#### ksoftirqd | |
Per processor kernel threads | |
Help process softirqs and tasklets | |
Grew out of problem with heavy interrupt load | |
esp. with softirqs that rescheduled themselves | |
procs would starve for processor time | |
but softirqs need to be processed | |
Possible solution 1 | |
Perform rescheduled softirqs before returning to procs | |
Might starve out processes under high load | |
User space needs to be run periodicly | |
Possible solution 2 | |
Dont handle reactivated softirqs | |
must wait for the next time softirqs are processed | |
doesnt take advantage of an idle system | |
may starve softirqs | |
Actual Solution | |
Dont immediately process reactivated softirqs | |
if softirqs are excessive, wake up some special kernel threads | |
special kernel threads | |
run at lowest priority | |
one per processor | |
named `ksoftirqd/n` | |
loop over pending softirqs | |
sleep to yield to other procs | |
coop nice toooo | |
#### The Old BH Mechanism | |
Old BH | |
Max 32 | |
Must be statically defined | |
inaccesible to modules | |
Dynamicly defined must piggyback off already defined one | |
Only one handler can execute at a time | |
(strictly serialized) | |
Did not scale well to multiple processes | |
network layer, in particular, suffered | |
Much like tasklets in other reguards | |
Work Queues | |
----------------- | |
Defer work into a kernel thread | |
called _worker threads_ | |
default ones named `events/n` | |
one per processor | |
must have a strong reason not to use default | |
eg. processor-intense | |
performance-critical | |
pervent starving out other default werk | |
Always run in process context | |
Can sleep - allows | |
allocation of lots of memory | |
semaphores to be used | |
perform block I/O | |
Decision - if you need to sleep, use a work queue | |
otherwise, tasklet | |
Alternatively - one could spin up a new kernel thread | |
but this is frowned upon | |
or punishable.. | |
### Data Structures | |
#### Representing the Threads | |
`workqueue_struct` | |
one per _type_ of worker thread | |
interface / abstraction exposed to the rest of the system | |
holds a `cpu_workqueue_struct` per processor | |
`cpu_workqueue_struct` | |
#### Representing the Work |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment