Skip to content

Instantly share code, notes, and snippets.

@rsampaio
Last active June 28, 2023 03:33
Show Gist options
  • Save rsampaio/1e5a5407578ac441dc23 to your computer and use it in GitHub Desktop.
Save rsampaio/1e5a5407578ac441dc23 to your computer and use it in GitHub Desktop.

Table of Contents

Week 1

Introduction

Logistics

  • Info about Hillside
  • Class times

Material overview

  • Man pages
  • Outline
    • Listed in the printed material

Introduction to the kernel

Unix VM

  • Unix was about portability
    • The vm is built uppon 5 properties
      • paged virtual addr space
        • If you want to allocate more than physical memory
      • software interrupts
        • Define hardware events and turn it into a software event
        • Catch signal interrupt and deal with things like division by zero
        • Call your signal handler
        • Unified interface for hardware events
      • timers and counters
      • identifiers for accounting, protection and scheduling
      • descriptor
        • Handle of some kind of I/O
        • Before that each thing had it own descriptor

Process

  • Fundamental to the unix philosophy
  • Input/Output connected thru pipes creates a pipeline of execution
  • Process is made of 4 main pieces:
    • CPU Time
      • Scheduling made by kernel
    • Asynchronous Events
      • Interruptions have to be dealt with, signals, network traffic
    • Memory
      • How virtual memory is managed
      • How to make a hardware provide an address space
    • I/O Descriptors
      • Any useful process is going to have I/O Descriptors
      • Most fundamental service provided by UNIX

System calls

  • Top half of the kernel
    • Top half has its own stack because when the kernel wants it is there in a real physical memory
  • More malloc and free in the kernel stack because there is a limited stack available
  • Bottom half
    • Scheduled by interrupts
    • Top-half will be preempt when bottom-half interrupts
    • You can block but generaly don't block on bottom-half

Entering the kernel

  • Part 1:

    1. Hardware traps
    2. System calls
      • Special case of hardware traps
    3. Hardware interrupts
      • Async (disk io complete, user typed a key)
      • Higher priority
    4. Software-initiated traps or interrupts
      • Lower priority
  • Part 2: Switch to kernel mode

    1. Push PC and PSW into the kernel per-process stack
    2. Push trap type os syscall number into the stack
    3. Save general purpose registers
    4. Call handler (syscall()) for a syscall, trap() for a trap

Returning from the kernel

  1. Timers

    • 3 per process interval timers, optional value to be loaded when it expires
    • ITIMER_REAL
      • Decrements in real time
      • Sends SIGALRM on expiration
    • ITIMER_VIRTUAL
      • Decrements in user virtual time
      • Sends SIGVTALRM on expiration
    • ITIMER_PROF
      • Decrements in user + system virtual time
      • Sends SIGPROF on expiration
    • Select timer is independent from the interval timers
    • Kqueue can also create timers
  2. Timouts

    • Arrange a timeout in the kernel
      • handle = timeout(function, arg, ticks)
    • Cancel it
      • untimeout(handle)
    • Compute the number of ticks
      • ticks = hzto(timevalp)
    • Timout routines are called from softclock at splsoftclock

Week 2

Locking

Historic Synchronization

  • When a lock is not available sleep on it
    • wchan or wake_channel receive a pointer to the resource
    • process waiting for a specific resource will be waken

Lock Hierarchy

  • From the most simple to the most complex
    • Hardware
    • Spin Locks
    • Locks that blocks briefly
      • blocking mutex
      • pool mutex
      • reader-writer locks
      • read-mostly
    • Locks using the sleep-queue interface
    • Witness

Turnstiles

  • Sys/sys_proc.h
    • data structures using locks to protect the access into the fields
  • Designed for short periods
  • Used to protect read and write
  • Track current lock holder
    • If you hold a short term lock and goes to sleep the kernel will panic
  • Priority propagation

Turnstiles implementation

  • Hash header for quickly find locks
  • A turnstile is needed each time a thread blocks
    • A thread can only block on one thing

Sleep queues

  • Used by shared-exclusive locks, condition variables
  • No propagation
  • You're are not allowed to own a turnstile when requesting a sleep-queue lock
  • The queue track the exclusive lock holder
  • May be recursive

Sleep interface

  • wchan and priority
    • msleep
  • sleep on lbolt (lightning bolt)
  • wait syscall wait for children to exit and put the parent to sleep in its own context
  • sigpause
    • The only way to exit a sigpause is a longjmp
  • sleep implementation
    • lock sleep hash header
    • record wchan, hashs to find slepe queue
    • set td_priority (for scheduling)
    • place on approp sleep queue
    • call sched_switch() to run the next thread
    • sleep thread is not running until awakened
    • if PCATCH is set check for signals
  • wakeup implementation
    • awaken ALL sleepers on wchan starting from the front
  • wakeup_one
    • awaken first sleeper on wchan
  • To awaken a thread
    • remove from sleep queue
    • recompute kg_user_pri for long sleepers
    • sleeping, make runnable, place on run queue
    • stopped: "unsleep", remain stopped
  • request CPU rescheduling by setting NEEDRESCHED

Critical Sections

  • protected by locks

Hardware requirements for lock

  • Lock is not possible without assitance from hardware
  • Test-and-set
    • Very expansive instructions
  • FreeBSD uses a compare and swap
    • instead of a memory bucket it is a pointer
    • Owner field for a free lock has MTX_UNOWNED
    • Owner field for a held lock has the pointer to owning thread

Spin mutex

  • Exclusive access only
  • Loops waiting for the mutex to become available
  • Run inside the critical section
  • More expensive than a blocking mutex

Blocking mutex

  • Exclusive only
  • Adaptive spining which only spins if the owner is running
  • All waiters are awekened when lock is released
    • Cheaper to release an uncontested lock
    • Often end up scheduling sequentially

Pool Mutex

  • Used for small short-lived data structures
    • Just need a pointer to a mutex rather than a large mutex itself
    • Mutex is preallocated so avoid high creation and destroy times
  • Example is pool syscall that needs a struct to track poll requests

Reader and Writer locks

  • Exclusive and shared mutex
  • Uses a turnstile so cannot be held when you sleep
  • Priority propagation for exclusive not for shared access
  • May specify permission to recurse

Read-Mostly lock

  • Same properties as reader-writer
  • Priority propagation for shared
  • Designed for fast access
    • Read without locking then check if write happened
    • if so redo the reading
  • The routing table is a good example of this lock
  • IBM holds patent on RCU
    • Allow GPL code to use
    • FreeBSD is not GPL

Shared-Exclusive locks

  • fastest ans simplest of the locks that can sleep
  • Shared and exclusive
  • Permission to recurse
  • May request interruptions (PCATCH)
  • No priority propagation

Condition variables

  • wrapper on traditional sleep and wakeup
  • Allow wake one or all waiters
  • Must hold a mutex before awakening or waiting

Locking Manager locks

  • Most full-featured of the locks that can sleep
  • Downgrade (from exclusive to shared)
  • Upgrade (from shared to exclusive)
  • Upgrade exclusive prevent other exclusive lock
  • Can pass ownership from a lock from a thread to the kernel
  • You need to clean up before leaving the kernel
  • Ability to drain all accessing threads for deallocation
  • No propagation

Witness

  • Partial ordering requires:
    1. Thread acquires only one lock in a class
    2. Thread may acquire only a lock in a higher-numberred class
      • Class 1 > Class 2 - thread with locks on class 1 can acquire class 2 locks
  • Programmers can define lock classes
  • Witness code observes actual lock ordering and complains when either rules is violated
  • Very very expansive
  • Debate exists about enabling in production

Processes

Process Resources

  • CPU Time
    • What runs when
  • Async events
    • timers
    • hardware exceptions
    • external events
    • I/O
  • Memory
    • text/data
    • heap
    • runtime stack
  • I/O Descriptors
    • files
    • pipes
    • socket
    • stream

BSD Process structures

  • Process entry
    • All unix has one, solaris and linux process entries may be different
  • Process structures dynamically allocated
  • Placed in linked list

Open file information

  • filedesc structure
  • Open file array is reallocated when overflow

Other Process Substructures

  • Process limits (plimit)
    • array of rlimit
    • copied lazily
  • Process accounting/statistics (pstats)
    • rusage (Resource usage)
    • virtual-time timer
    • profiling
    • start time
  • Signal actions
    • Pending signals
      • sigacts per signal actions,
      • signal stack info,
      • current state during sigpause

Processes versus Threads

  • kthread model (started in BSD world)
    • One process structure
    • Multiple threads
  • Fork model (started in the Linux world)
    • Each thread has its own process structure
    • same model as that used by Linux
  • kthread context switch is wayyy cheaper
  • rfork slower for context switch

Week 3

Process

Process States

  • runnable
  • sleeping
  • stopped
    • stopped by job control or tracing
    • return to sleep or run
  • minor_states
    • new
    • zombie
      • Zombies won't close socket descriptors because tcp may have data sitting around
      • Init's job is to wait for orphans left by other processes

Finding processes entries

  • by process ID
    • hashed, linked through p_hash index
    • pfind()
  • by state
    • run queue: td_runq
    • sleep queue: td_slpq
    • lock queue: td_lockq
  • by type
    • on one of allproc or zombproc

Process entries of related processes

  • p_ is always the process_struct

Scheduling classes

  • REALTIME
  • ULE
  • 4BSD
  • IDLE

Week 4

Security

Security mindset

  • UNIX was designed with some security in mind
  • Users exist in UNIX since the begining
    • Windows didn't had users until NT

Trusted Computing Base

  • Things that have to be secure
    • Kernel
    • Boot scripts
    • Core utilities (shell, login, ifconfig)
    • Libraries used by core utilities (libc)
  • Solid crypto base

Process credentials

  • process credentials (ucred structure)
    • effective uid (setuid)
    • group set (gid)
    • real uid, gid
    • saved uid, gid from set*id program
  • exported credentials (xucred structure) - outside kernel
    • effective uid
    • group set

Set-User-Identification Functionality

  • seteuid
  • setegid

Immutable and Append-only flags

  1. Immutable limitation

    • Immutable files can be changed only on singleuser
    • Append-only files can only be rotated in single user
    • Direct hardware is retricted
    • All startup activities needs to be protected

Jails

  • Started from chroot (xiruu)
  • Jails was created by a demand of the user base (ISP's)
  • vnet
  • vem0a - vem0b (master slave virtual net)

Jails rules

  • Permitted
    • running or signaling process within the jail
    • change files within jail
    • bind ports
    • access raw, divert or routing sockets jail's virtual network
  • Not permitted
    • get info from processes outside jail
    • change kernel variables
    • mount or unmount filesystem
    • modify phys net interface
    • reboot

Random number generation

  • /dev/random apps access
  • Uses yarrow, PRNG
  • Yarrow
    • reuse existing crypto primitives such as hash and block encryption
    • framework for entropy acquisition
    • an entropy accumulator
    • a reseed mechanism
    • a generation machanism (SHA256 AND AES)
  • Yarrow will be replaced by Fortuna on FreeBSD 11

Access control lists (ACL)

  • File permission bits
  • ACL capabilities
    • Look up
    • Administration permissions
      • Change acl on the file
      • List of users
      • List of groups
    • Default ACL Inheritable

Access control list semantics

  • ZFS uses NFSv4 ACL model (compatible with windows ACL)
  • NFSv4 uses inheritable ACL ranther than default ACL in POSIX

Privileges (Kernel only, no userland interface)

  • Nearly 200 privileges
  • In the kernel checks for uid=0 also calls priv_check()

Privileges applied

  • Jails uses privileges to control what a process can do
  • MAC is a loadable kernel module

Auditing

  • Used when you know when things go bad
  • Good auditing should have Intrusion Detecting
  • OpenBSM
  • Looks like truss or strace

Auditing handling

  • auditd daemon
    • starts a kernel thread that control record distribution

Capsicum

  • Sandbox
    • Powerful than jails
  • Untrusted code run in separated process with access limited
  • cap_enter()
  • No access to filesystem

Sample Capsicum capability

  • Defined at /sys/sys/capability.h

Virtual Memory

Virtual Memory Layout

  • Top part of the address space used by the kernel
    • malloc() memory kernel thread stack
    • no page faults on kernel
    • load everything, really
  • Bottom part for userland
    • Context switch happens here
    • Trampolin code for signals
    • User stack no executable page
    • First page left invalid for null-pointer dereference
  • Heap grows up, Stack grows down
  • Shared library seat a couple pages after Heap limit

Virtual memory hardware operation

  • Virtual address to Physical address
  • MMU
    • L1 cache will probably have a page cached
    • TLB - Translation Lookaside Buffer
    • Page_fault - interrupt
    • MMU should be informed that a new page is load
  • Context-swich is expensive because a whole new pagedir needs to be loaded

Virtual Memory Design

  • Process as regions
    • Collection of address space that is threated the same way
    • Shared memory
    • File Mapping
    • Copy on write after fork
  • vm design is archtecture independent

User interface to Virtual Memory

  • mmap()
    • maps file into memory
  • munmap()
    • removes mapping from memory
  • msync()
    • sync hardware caches to main memory
  • mprotect()
    • control access to parts of the address space
  • madvise()
    • advise the kernel on how an area of memory will be used
    • garbage colletor advice random
  • mlock()
    • force page to be resident no paged out
    • used in database for transactions, prevent pages to go away
    • call msync() to sync main memory from caches

Week 5

Virtual Memory

Physical memory and swap space

  • when malloc fails you probably used all your address space
  • linux does not enforce address space size

Virtual Memory organization

  • Data Structures
    • Per process: shared per address space, per thread
    • Per address space: collection of objects

Process Address Space

  • most of the code is machine independent
  • vmspace
    • vm_map
    • vm_pmap (p for physical)
    • statistics (about the vm system, page faults, workset size)
  • vm_map_entry
    • anonymous object
    • vnode / object
    • start and end address
  • vm ask object if it have a page

Process Address Space Sharing

  • Private mapping
    • Shadow objects between the vm_entry and the file object/vnode
    • On low memory scenarios page are taken from vnode and put into memory

Shadow object chains

  • When a process with MAP_PRIVATE mapping forks, its child must get it own shadow map

Machine-dependent virtual-memory description

  • 32bit the kernel can only have 2gb of address space

Kernel memory allocator hierarchy

  • Memory in kernel space
    • submaps
    • vmem arena
    • slabs out of the arena
    • bucket out of slabs
    • bucket are per CPU

Kernel Maps ans Submaps

  • kernel described by map entries

Vmem Data Structures (Arena)

  • bt boundary tag
  • power-of-2 free list
  • BESTFIT or INSTANTFIT
  • hash list for used pages pointed by the bt

The Slab Allocator

  • the slab allocator is responsable for allocating a zone of particular thing
  • all objects are the same size

The Keg Allocator

  • manages a particular zone
  • track slab objects
  • list for full, partially allocated and empty slabs
  • request a new slab when everything is in use

The Zone Allocator

  • Lock per zone
  • Each CPU has it own lockless list protected by a critical section
  • Each CPU has it own bucket of memory
  • Two buckets per CPU

Fast Kernel malloc implementation

  • Memory pool characteristics
    • Size is fixed
    • Max size is no bigger than phys mem
  • Memory mapping hardware can be directly controlled
  • Known usage patterns
    • Most allocations are small
    • Many allocations are for a power-of-2 sized piece of memory

Inteface look and feel

  • Simple and familiar similar to C malloc() and free()
  • Small allocation
    • use power of two
    • called using malloc and free
  • Large common structures
    • use zone-based allocation
    • more compact than power-of-two
    • avoid cache line problems
    • avoid initializing structures
    • must reclaim unused memory from zones

Types of pages

  • Wired - kernel
  • Active - In use
  • Inactive - valid contents, dirty
  • Cache - valid contents, clean
  • Free - invalid contents, ready for use
  • Necessary to have higher priority page access

Paging strategies

  • free pages for the free list
    • no longer explicity
    • process exit anon pages go to the freelist
  • To add inactive pages to the cache list
    • remove oldest page of the inactive list
    • if first inspection mark scanned
    • if dirt put back to the inactive list
    • pages on cache needs to be clean
  • Inactive and cache are LRU

Paging strategies (part II)

  • To add to inactive list
    • Remove oldest page from active list
    • if in use, increment active count and put on the end of active list
    • else not in use, decrement active count
      • if active count greater than zero, put on end of active list
      • else clear scanned flag and move to end of inactive list

Superpage Motivation

  • The problem is the limited size of the TLB
  • if TLB takes more than a cycle it is slow

Superpage Solution

  • 4k pages
  • 2Mb pages on PAE
  • 4Mb pages without PAE

Superpage Reservation

  • On FreeBSD the kernel decides when to use superpages
  • Others allow programs to decide
  • Starts on the begining (first fault) because small pages are already in use
    • Annon memory is always elegible for superpage
    • mapped file at least superpage size is eligible
  • Only save TLB misses

Superpage promotion

  • if your are touching all pages on your reservation
  • promotion to read-only superpage

Superpage fragmentation control

  • Kernel memory is aways superpage
    • Kernel do page out
  • Cache and free are kept in buddy lists (pair)
    • Aggregate small pages until a full superpage is available

Filesystem interaction with the virtual memory

  • filesystem own "buffer cache" pages
  • annon object owns it memory page
  • annon objects don't have a filesystem
  • pages may be both mapped and cached (copy on write)
  • management of swap space
    • page cluster

Kernel structure

  • p 315 (book reference picture)

Week 6

I/O System Overview

Kernel I/O Structure

  • GEOM handles the disk IO
    • Can do all sort of stuff, RAID, Stripping, etc
  • CAM layer handles in a generic way the disks I/O
    • Individual drivers are much smaller
    • Based on SCSI but extended to ATA and SATA

I/O Descriptors

  • File Descriptor
    • An array of bytes
    • Move file descriptor
    • Read " "
    • write " "
  • Access to other hardware
    • Terminals
    • Tape drivers
    • Printers
    • Disks
    • Modems
  • Pipes
    • Reliable byte stream

Descritor I/O Structure

  • Per-process file descriptor index (integer array)

File Entry

  • One per open file descriptor, shared across dups()

File Entry Types

  • VNODE
  • SOCKET
  • PIPE
  • FIFO

. .

  • KQUEUE, MQUEUE, ETC

Multiplexing I/O

  1. Use select() or poll() system call
    • Useful when the I/O usually blocks
    • Expansive when I/O succeeds
    • Stateless
  2. Use non-blocking mode
    • Useful when the I/O usually succeeds
    • Expansive when I/O blocks
  3. Use signal driven I/O
    • Useful for infrequent stuff
    • Expansive for high volume I/O
    • Context switch for each signal
  4. Use kqueue to monitor kevent
    • Useful for monitoring many events
    • Only in BSD and MacOS

Select Top-Level

  • Select will sit on top of the underlying bottom half to ask about states
  • Pool of mutex

Select Bottom-Level

  • Select on async devices

Select Notification

  • Select notification when I/O becomes possible
  • selwakeup selinfo is passed to be used to traverse the linked list
  • Device remove horizontol list, thread remove the vertical

Kqueues and Kevents

  • General event notification facility
  • Stateful interface
  • Register events of interest
    • May modify the event (add, delete, enable, disable)
    • Tracked events are deleted when resource is gone
  • Poll for events using kevent()

Types of kqueue events

  • Supported kevents on FreeBSD
    • R/W as in select
    • AIO Async I/O events
    • VNODE files information has changed
    • PROC status of processes
    • SIGNAL posted to processes
    • TIMER an event-based timer
    • USER application defined events
  • Each event has an identifier

Kevent notification

  • Each event specifies a filter flag
  • An event may be another kqueue descriptor

Kevent Implementation

  • Hash list for free form descriptors
  • Descriptors indexed array by its number
  • List of pending events

Semantics of Non-Blocking

  • Non-block flag is stored in the file table

Uio Structure

  • IO vector
  • Increment Offset
  • Decrement Resid
  • Segment flag tells you where/from the I/O is coming to
    • USERSPACE
    • SYSSPACE
    • NOCOPY
  • Pointer to the thread that requested the I/O

Level of Filesystem Interface

  • system call
    • open/close, read/write, link/unlink
  • file table
    • open/close, read/write, ioctl
  • virtual node (vnode)
    • VOP_
  • inode
    • namei,iget/iput
  • block I/O
    • bread/bwrite

Contents of a Vnode

  • type
  • pointer to type specific control block
  • flags
  • reference count
  • pointer to operations
  • pointer to containing file system
  • list of associated clean buffers
  • list of associated dirt buffers
  • content of pending I/O operations

Vnode interface - Name Translation

  • Allow but does not require state
  • Steps in translating pathname

Vnode interface - Vnode Management

  • Single system-wide vnode table
    • vn_inactive() references are back to zero
    • vn_reclaim() recycle vnode
  • Genral set of utility routinges are provided to manage vnodes for a all filesystems

Vnode interface - Name Cacheing

  • Single system-wide soft-reference name cache with common routines
    • cache_lookup()
    • cache_enter()
    • cache_purge()
    • support negative cacheing

Vnode linkages

Vnode Mount interface

  • /etc/mtab info is now maintaned in the kernel
    • statfs(2)
    • getfsstat(2) return how many things are mounted
    • getmntinfo(3) saner than ^
  • Generic mount information can be updated at any time

Buffers

  • Getting buffers
    • bp = bread(vnode_ptr, file_offset, size);
  • Releasing buffers
    • brelse(bp)
    • bwrite(bp)
    • bawrite(bp)
    • bdwrite(bp)

Stackable filesystems

  • Born in UCLA

Loopback mounts

  • allow arbritary directories to be mounted anywhere else
  • implemented as a filesystem layer
  • bases for other filesystems implmentations

Union mounts

  • Merge filesystems together instead of hide the underlying fs

Union mount naming

Union mount issues

  • Remove files from lower layers
    • whiteout write in the top layer
  • Removed directory must be marked as opaque for whiteout of directories

Union implementation

  • it basically deals with naming
  • all other operations are passed down to the lower layers

Dead file system - DFS

Week 7

Devices

Special Devices

  • Character devices
    • pseudo-terminals
    • frame buffer
    • printer
  • Network devices
    • network interfaces
  • Disk devices
    • disk management (GEOM)
    • filesystem interface

Device Naming and Access

  • Historic device nodes
  • /dev filesystem
  • devd

Character devices entry points

  • Operations supported by any character device

Line displines entry points

  • Used by tty devices
  • Divided into 2 parts
    • One from things going down to the kernel
    • One from things coming out of the device

Termios

  • Programatic interface to terminal drivers
  • All states contained in one structure
  • struct termios{}

Changing the termios structure

  • tcgetattr()
  • tcsetattr()
  • command flags passed to tcsetattr()
    • TCSANOW - make change immediate
    • TCSADRAIN - drain output, than change
    • TCSADFLUSH - drain output, flush input
  • TCSASOFT - only interested in soft state

Allocation of controlling terminal

  • Historically a side effect of open
  • Now and explicit ioctl: TIOCSTTY
  • Only session leaders can acquire controlling terminal
  • only one per session

Revocating of controlling terminal

  • SIGHUP when the controlling process disapear
  • revoke the vnode that references the pty

Network driver entry points

  • Entry points in the ifnet structure
    • if_ operations

Network driver control messages

  • Control messages sent using if_ioctl
    • SIO* various operations of the network interface

Network packet reception

  • D buffer owned by device may be filled
  • K buffer owned by kernel and may be read
  • When k is done kernel stops, when d is done network drops packets

Disk volume management

  • GEOM layer
    • disk partitioning and labelling
    • disk aggregation
    • collection of I/O statistics
    • I/O optimization such as disk sorting

Mirroring a filesystem

  • GEOM mirror layer
    • mirror underlying partition
    • can be inseted to copy filesystem

GEOM Operation

  • Downwards is handled by g_down thread (single threaded)
  • Upwards is handled by g_up thread
  • Module cannot sleep or compute excessively
  • Modules providing locking can allow direct dispatch
  • When the provider goes away, error is propagated
  • When provider changes, change is propagated up the stack

Disk sorting

Organization of buffers awaiting I/O for a disk

  • currently working on buffer for block 25
  • next buffer to be processed is at block 40
  • blocks to be done in next pass start at switch_point_list

Communication from the top to the bottom of the kernel

  • Acquire a queue mutex and block on CPU
  • Place request on queue
  • If device is idle, start it
  • If synchronous request, sleep
  • Release mutex

From bottom to the top

  • Acquire the mutex
  • Pop next request from queue
  • Release the mutex
  • Do required work
  • If synchronous, wakeup thread(s)
  • If asynchronous, free resources

FreeBSD Disk I/O Subsystem

  • Disk can connect in several ways
  • Non-disk USB and Firewire devices bypass CAM

Physical Drive Management

  • CAM Layer handles disk devices
    • routing throught I/O busses to drive
      • PCI
      • Cardbus
      • Firewire
    • tag queueing
    • serializing requests
    • allocating and freeing DMA resources
  • DMA map works similar to a TLB but much simpler

Autoconfiguration

  • policy
    • support widest possible range of CPU types and configuration
    • minimize information that specified in advance
  • strategy
    • decode CPU type
    • check memory controller type
    • find types and locations of possible main I/O busses
    • descend I/O busses, testing for presence of controllers and devices
    • attach and configure discovered controllers and devices

Autoconfiguration Functions

  • device_probe
  • device_identify
  • device_attach
  • device_detach
  • device_shutdown
  • device_suspend
  • device_resume

Device driver bidding

  • Bid process for drivers
  • Highest bid will get it
  • probe - bid - attach

Autoconfiguration data structures for pci0

  • the pci0 device scans its bus and finds an ATA disk controller
  • scanning its drivers, its "atapci" drivers wins the auction
  1. Sample autoconfiguration hierarchy (devinfo)

Device virtualization

  • full virtualization lets guest operating system run without change
  • paravirtualization requires guest operating system to use virtualized devices and CPU features
  • vertio model splits device driver
    • guest operating system front end sends device commands to hypervisor
    • hypervisor back-end operates physical devices
    • communication is often done using shared memory rings

Bhyve virtualization

  • FreeBSD embeds bhyve to provide virtualized devices vesus Xen which is a standalone hypervisor that allows guest operating system to interact with each other
  • Bhyve exposes a virtual PCI bus to the guest
    • guest can probe and attach back-end devices

Xen virtualization

Xen Grant tables

  • grant tables are shared memory used to communicate between a guest operating system and Xen or another guest
  • grant table operations
    • permit_access
    • accpet_transfer
    • may limit export or import page to read-only
  • grant table states
    • reading
    • writing

Fast Filesystem

File filesystem history

Directory structure

  • all entries start on a 4-byte boundary
  • active entries are linked together

Extended attributes

  • It is a place to put metadata about the file
  • Header of each attribute has:
    • 4-bytes length
    • 1-byte namespace class
    • 1-byte content pad lenght
    • 1-byte name lenght
    • name padded so that the contents start an 8-byte boundary

File locking

  • Advisory file locking
    • no lock
    • shared access
    • exclusive access
  • locks are not tied to file access mode
  • locks may be ignored
  • locks may change after the file is open

File locking implementation

Week 8

Filesystems

Quota

  • each "uid" may have quotas associated
  • quotas may be specified by filesystems

Semantics of quota

  • quota may be imposed on both users and groups
  • quota policy
    • Users should stay below their "quota"

Current UNIX Filesystems

  • process descriptor
    • open file entry
      • vnode
        1. inode
          1. disk
          2. inode
          3. data

Inode

  • Contains all the metadata about the I/O
  • Symlink stored directly into the inode
  • direct blocks
    • single indirect (pointers to pointers to data)
    • triple indirect (pointer to pointers to pointers to data)

A small filesystem

UNIX Filesystem

I/O in UNIX

  • align operations to block boundaries
  • inodes map logical block numbers to disk blocks
  • write to system buffer and copy back to blocks

Contents of a cylinder group

  • Superblock - static parameter of the filesystem
    • block size
    • fragment size
    • disk characteristics
    • layout policy
  • Cylinder group map - dynamic parameters of the cylinder group
    • free blocks and fragments
    • free inodes
    • allocation summaries
    • recent allocation history
  • Inode - per file information
    • type/owner
    • access/mode times
    • array of pointers to data blocks
  • Data blocks

Optimizing storage utilization

  • Time/space tradeoff
    • Time - big blocks reduce the number of reads/writes to access a file
    • Space - big blocks waste space for small files
  • UNIX File are predominantly small and so use a hybrid block size

Filesystem parametrization

  • Maintain system characteristics
    • time to schedule an I/O operation
    • ability to chain I/O operations
  • Use this information to do block layout ASSUMING sequential access.
  • Laying out blocks in a totationally optimal fashion.

Layout policies

  • need to localize allocation of resource that are accessed together
  • need to spread out unrelated allocations
  • Inode layout policy
    • localize
      • try to place all files contained directory in the same cylinder group as the directory
    • spread out
      • Place new directories in a cylinder in a cyliner group with a greater than avarege number of free inodes but fewest number of directories. Cluster leaf directories.

Layout policies (part 2)

Data block layout policy

  • Localize
    • Place data blocks together in the same cylinder group as the inode resides
  • Spread
    • Redirect block allocation for a file to a new cylinder group every ten to twenty Megabytes
  • Policy aim
    • Keep some free blocks in every cylinder group
    • Localize all references when many small files

Dynamic block reallocation

  • Size of file is not known when the file is open

Implementation of dynamic block reallocation

Optimized Metadata Layout

  • first 4% percent of data area of each cylinder group is held for metadata

Implementation of allocation

  1. Use requested block
  2. Use closest forward block within cylinder group
  3. Use quadratic rehash to pick a different cylinder group
  4. Brute force search of cylinder group

To maintain reasonable optons, 5%-8% of the disk is kept free

Filesystem Consistency

  • Metadata must be consistent
    • directories
    • inodes
    • bitmaps
  • Rules
    • Never point to a structure before it is initialized
    • Never reuse a resource before nullifying all previous pointers to it
    • Never reset an old pointer to a live resource before the new pointer has ben set
  1. Keep metadata consistent 1

    • Synchronous writes
      • simple and effective
      • create/delete intensive is slow
    • Non-volatile RAM
      • usually runa all operations at memory speed
      • expensive hardware unalavailable on many machines
    • Atomic updates (journaling and logging)
      • create/remove do not slow down under heavy load
      • extra I/O generated, little speed-up for light loads
  2. Keep metadata consistent 2

    • Copy-on-write filesystems (LFS, ZFS, WAFL, etc)
      • write throughput, cheap snapshots, always consistent
      • disk fragmentation, memory overhead
    • Soft updates
      • most operations run at memory speed, reduce system I/0, instant recovery after a crash
      • complex code, background fsck, and increased memory loading

Adding journaling to soft updates

  • Only need to journal operations that orphan resources
  • Journal needs only 16Mbyte independent of filesystem size
  • Filesystem operations that require journaling
    • Increased link count
    • Decreased link count
    • Unlink while referenced
    • Change of directory offset
    • Cylinder group updates of reed blocks and inodes

Additional requirements of journaling

  • Additional soft updates tracking
  • Reclaiming journal space

Tracking file removal dependencies

  • Ordering constraints
    • log of the location in the directory that has the name to be deleted
    • the filename in the on-disk copy of the directory must be deleted
    • log the blocks to be deleted an deallocated the file zeroing out its on-disk dinode.
    • the blocks formerly referenced by the inode for the file must be released to the free-space bitmap
    • log the successful completion of removal

Journaled Crash recovery

  • Crash recovery is done by fsck
  • Recovery steps
    • Scan journal
    • Link count increases
    • Link count decreases
    • Free inodes with zero link count
    • Free inodes that were unlinked but busy
    • Free unallocated blocks

Snapshots

  • Create a copy-on-write image of a filesystem image
    1. suspend processes initiating system calls that modify the filesystem
    2. allow all modifications in progress to complete
    3. write out all dirty buffers to disk
    4. create an empty snapshot file the size of the filesystem partition
    5. mark the block that are currently in use
    6. resume write operations on the filesystem
    7. on each disk write, check to see if it has been copied making copy if the write is for an in-use block that has not yet been copied

Snapshot implementation

Week 9

ZFS filesystem

ZFS overview

  • Never over-write an existing block
  • It is always consistent
  • State atomically advances at checkpoints
  • Snapshots(read-only) and clones(read-write) are cheap
  • Metadata redundancy and data checksums
  • Don't need fsck
  • Selective data compression and deduplication
  • Filesystem shared pool
  • Quota
  • Mirroring and multiple parity RAIDZ

ZFS Modules

  • ZFS is monolithic

Functional organization

  • GEOM used to provide access to disks
  • GEOM used also to present zvol as disks

Structural organization

  • Multiple layers
    • Meta-object-set
    • Object-set
      • Array of inodes or dnodes
  • Uberblock anchor the pool
  • Meta-object-set (MOS) describes array of filesystems, clones, snapshots, and ZVOLs
  • Each MOS object references an object-set that describes its objects
  • Space map is at the MOS layer, if Object-set needs more space just ask MOS

Uberblock

  • Uberblock describe the entire ZFS pool
  • Areas are used as a circular queue

Dnode

  • Equivalent of inode in FFS
  • Describe files and directories
  • But also snapshots, clones and etc.
  • Dnode is embedded with a znode when it describes a file
  • Blocks range in size from 4 to 128Kbyte

ZFS Block pointers

  • it is huge, much larger than FFS block pointer

ZFS block management

  • Disk blocks are kept in the pool

ZFS Structure

  • Huge picture page 532-535

ZFS Relationships

  • Clones can only be taken of a snapshot
  • Can make multiple clone of same snapshot
  • Clones can be promoted to the filesystem
  • Can also take snapshots and clone ZVOLs

ZFS Checkpoints

  • Checkpoint affects all filesystems, clones, snapshots and ZVOL

Writing new data

  • Checkpoints are taken when one of:
    • five seconds have passed
    • 64Mb of dirty data accumulate
    • administrative task (like a snapshot)
  • To take a checkpoint
    • finish in-progress system calls
    • flush all dirty data
    • complete by writing new uberblock

Flushing dirty data

Logging

  • Handled by ZIL
  • Used to record changes between checkpoints
  • Not only metadata but content (log no journal)
  • Data over 32-kb are written to disk and pointer put in log

RAIDZ

  • Traditional RAID has fixed stripe size
  • RAIDZ has variable stripe size

RAIDZ Recovery

  • build by traversing all MOS objects and rebuild their blocks
  • never need to recalculate and write parity

Storage pool allocator

  • Manages the space in the pool
  • It is a bitmap
  • Allocation proceeds as:
    • Find disk with the most free space
    • select the least fragment fixed-size map
    • allocate space closest to previous allocation
  • SPA also handles
    • compression
    • deduplication

Snapshots

  • take a checkpoint
  • allocate new dnode in the MOS
  • copy the block pointer from the fs dnode to the snapshot dnode
  • link the snapshot dnode into the head of filesystem's snapshot list
  • move the filesystem deadlist to the snapshot

Freeing filesystem blocks

  • Blocks are tracked using space maps, birth time and deadlist
  • When a block is allocated, its birth time is set to the current transaction number
  • Over time snapshots are taken which also reference the block
  • When a file is overwritten, truncated, or deleted, its blocks are released
  • For each freed block, the kernel must determina if a snapshot still references it.
    • if born after most recent snapshot, it can be freed
    • otherwise it is added to the filesystem's deadlist

Deduplication

  • It is done in a pool wise basis
  • Block identify duplicated by hash (using the checksums)

Remote replication

  • backup of pool either locally or remotely
  • can do full backup
    • stored as a single file
    • expanded back to its original form
  • can do incremental backup since backup of earlier snapshot
  • tracks the progress of partial completed backup

ZFS strenths

  • high write throughput
  • fast RAIDZ reconstruction on pools with less than 40% usage
  • avoid RAID "write hole"
  • blocks move between filesystem as needed
  • integration eases administration (mount points, exports, etc)

ZFS weakness

  • Slowly written files scattered on the disk
    • That is why it uses a lot of memory
  • Slow reconstruction on pools with greater than 40% usage
  • Block cache must fit in the kernel's address space, thus works well only on 64-bit
  • Needs under 75% utilization for good performance
  • RAIDZ has high overhead for small block size such as 4kb blocks typically used by databases and ZVOLs
  • blocks cached in memory are not part of the unified memory cache so inefficient for files and executables using mmap() or when sendfile()

Week 10

NFS

Sun Microsystems Network File System

  • Widely available
  • Divided into two halves
    • fileserver exports locally attached disks to client hosts
    • client hosts that import filesystems from central server
  • NFS basically started open source
  • Sun promoted connectatons to make sure evey implementation were interoperable

NFS v3

  • stateless protocol
  • benefits
    • can continue to operate when some systems fail
    • fast recovery when failed components return
  • drawbacks
    • sync writes
    • high network traffic

NFS v4

  • stateful protocol
    • state is bounded to speed recovery
  • benefits
    • can continue limited operations when some systems fail
    • bounded recovery time when failed components return
    • minimal network traffic validating clients cache
    • less need for separate
    • sessions (added in nfs 4.1) ensure idempotency of requests
  • drawbacks
    • sync writes
    • server required to maintain and manage states

NFS File identification

  • file handles
    • created by server to identify files when first looked up
    • used by client in all NFS requests after open
  • file handle contents
    • filesystem identifier
    • inode number
    • generation number

NFS v3 daemons

  • rpcbind (all)
  • mountd (server)
  • nfsd (server)
  • nfsiod (client)
  • lockd (all)
  • statd (all)

NFS v3 mount operation

NFS v3 I/O

NFS v4 goals

  • good performance on the internet
  • strong security with negotiation built into the protocol
  • good cross-platform interoperability
  • designed for protocol extension
  • provide protocol support for clustered-server deployment

NFS v4 changes

  • compund RPC's
  • lock operations included in the protocol
  • explicit open and close operations
  • pseudo-filesystem hierarchy to support multiple roots
  • extensive attribute support
    • required attributes (16)
    • recommended attributes (43)
    • extensible attributes(k/v pair)
  • access control list
  • delegation and callbacks(leases)

NFS v4 daemons

  • rpcbind (all)
  • nfsd (server)
  • nfscbd (client)
  • nfsuserd (client)

NFS v4 mount operation

NFS v4 sessions

NFS v4 delegation/recall

NFS v4 crash recovery

  • after client crash
    • client attempt first mount from server
    • server finds previous client state (if not timed out yet) and releases it
  • after server crash
    • server sends BAD_SESSION replies to out of date client IDs
    • client realizes server has crashed and reestabilishes it state
    • server gives clients time (default 15 seconds) to rebuild state before handling out new state

Week 11

Networking and Interprocess Communication

Graph of communicating processes

  • Domain A has P1 and P2 processes
  • Domain B has P2, P3 and P4
  • Protocols used for Addressing Domains
  • Domain addressing is the problem to find a process within a domain
    • No centrelized authority

Name server

  • Domain based distributed name server
  • Gethostbyname() does RPC to name server daemon usually run on gateway
  • Bind was made by one of his students, got an A because it was deployable

The ISO networking model

  • Came up to create a standard way for interoperability
  • OSI layers came out to put in the communication protocol layer
  • internet in EU born from a guy buying modems and distributing to friends, before internet protocols (UUCP)

Packet traffic

  • mbuf designed to hold network data

Socket system call interface

  • socket, bind, listen, write, read, etc…

Types of network address

  • local domain
  • IPv4
  • IPv6

Use of IPC/Networking facilities

  • Outline of a simple "remote write"

System layering

  • sockets
  • protocols
  • net inerface and hardware

UDP

  • it is the absolute minimum you can get with
  • layer 4 transport, datagram protocol
  • demultiplex on port number
  • optional checksum

TCP

  • layer 4, connection oriented
  • provides everything that UDP does not
  • flow control (windowing)
  • port to port connection
  • acknowledgements
  • may piggybacked, coalesced
  • 3-way handshake on open, 2-way handshake on close

SCTP

  • Reliable multiple-stream transport protocol
  • provides associations between up to 65k independent streams
  • good for multi-streams applications, no need for multiple handshakes

IP

  • layer 3
  • designed for packet-switched network
  • both IPv4 and IPv6
    • host-to-host addressing
    • routing
    • time to live
  • only IPv4
    • fragmentation and reassembly
    • type of service
    • options: source route, record route, timestamp

Mbufs

  • limited to 224k of data
  • can have data written in the begining and in the end of the buffer

Mbufs continued

  • use mbufs pool, allocated in a separate page map
  • pages are allocated when the mbuf is allocated and returned free of last reference
  • mbuf data segment was converted into other fields
  • mbufs data will sit outside of the mbuf for larger data sizes
  • external mbufs

Socket data queuing

  • stream socket
    • packets are linked to the next
    • no boundaries without application interaction
  • datagram socket
    • are record based
    • if you read a datagram you get the entire datagram
  • 30 is the number of root servers because it is the amount of servers that fit in a UDP packet

Socket interface protocols

  • pru*

Network interface addresses

  • ifnet
  • pf1_addr
  • pf2_addr

Network interface to link layer

  • driver, one per device type
  • link layer are linke normal drivers but easier
    • there is not much in link layer we really had to do

Net to protocol interface

  • input
    • common routines, copy or map data into mbufs
    • save interface id with packet

Week 12

The Network Layer

Internet Addressing - IP Version 4

  • IHL = Ip Header Length
    • number of 32-bit words after the header
  • ID line is mostly used be fragmentation
  • TTL
  • Header checksum is for the IP header only
  • 3 layer routing, network -> subnet -> host

Internet Addressing - Subnets

  • 22bit network

IP Multicast

  • Class D reserved for multicast
  • Interesting how multicast is created on gateways on the path of the publisher
  • Gateways should have to have multicast support if they are doing multicast
  • IGMP
    • add, leave group
    • programming interface: setsockopt/getsockopt
    • on Ethernet group address is added to interface filter
  • Multicast agent
    • listens to IGMP
    • fowards multicasts to remote members

Address Resolution Protocol

  • Problem
    • how to map network-protocol-level addresses to a link-level address
  • Solution
    • broadcast an "ARP" request containing the host being sought
    • all the machines on the wire get the request
    • machines that matches sends response back to broadcaster containing link-level address to use
    • netowk keeps cahe of "ARP" translations (in routing table)

Internet Addresseing - IP Version 6

  • 128-bits was planned way ahead
  • Autoconfiguration
    • router discover
    • neighbor discovery
  • Security (IPSec)

IPv6 addressing

  • Version
  • traffic class and flow
  • next-header used to describe the type of routing among a lot of other things like authentication and encapsulation

IPv6 autoconfiguration

  • Router discovery
    • send router solicitation message to router "anycast" address
    • receive router advertisement messages containing the router IP address and router link-level address
    • mimicked by IPv4 by DHCP
  • Neighbor discovery
    • send neighbor solicitation
    • receive neighbor advertisement containing the neighbor IP address and neighbor link-level address
    • Mimicked in IPv4 by ARP

Internet Protocols - ICMP

  • part of IPv4, IPv6 has something similar
  • error messages (include header of problem + 8 bytes)
    • destination unreachable
    • service unavailable
    • time-to-live expired
    • header problem
  • control messages
    • redirect
    • source quench
  • low level network functions
    • echo (ping)
    • net address request/response
    • net mask resquest/response
    • timestamp request/response

ICMP Redirect processing

  • ICMP redirect cause direct modification of kernel route, changing gateway address
  • Redirects on FreeBSD accepted only from current router

IP Forwarding

  • First thing is to check if forwarding is enabled
  • check destination, no loopback, no broadcast
  • routing lookup
    • if from the same wire, ICMP redirect message will be sent

Routing goals

From the 1980s:

  • most machine should only need one network interface
  • network topology should not be visible to users
  • routing should have dynamic reconfiguration
  • multi-interface machines should potentially be able to act as gateways

Separation of policy and machanism

  • kernel implements mechanism
  • intelligent routing process determines policy
  • simple hosts may be simple
  • sophistication should be easily configurable
  • routing can be arbitrarily complex

Routing design

  • User level process: routed
  • routing daemons exchange information to determine best routes
  • routing daemons add, delete and replace routes in the kernel table to reflect best routing
  • kernel limited to looking up routes in its table

Kernel routing table

  • all routes are in a single hierarchical table organized as a redix tree
  • routes include variable-length destination, mask and host or gateway
  • routing table holds/cache round-trip time, hop count, packet size
  • link layer may allocate spaces for link-layer information/cache (ifaddr structure argumented with link-layer calls for route management)
  • "cloning" routes create host-specific routes dynamically; may use external resolution agents

Routing: radix search algorithm

Example radix tree:

  • circles are internal node
  • bit position tested is shown in circle
  • leaf nodes are rectagles containing a key ans mask
  • same interior nodes are associated with masks

– Picture here of radix tree routing algorithm

Kernel routing lookup

  • Callers set destination in route structure, then rtalloc() looks up an rtentry, returning a reference
  • rtallocl() does a lookup, optionally cloning the target route if necessary

Kernel routing socket

  • Routing ioctl commands replaced with message-oriented routing socket
  • Changes to the kernel routing table are requested by messages via routing sockets
  • Responses include message ID, errno, are sent to all listeners
  • Routing socket also reports other routing changes, redirects, lookup failures, interface up/down

Routing strategies

  • DHCP
    • assigns IP address
    • installs route to gateway
  • Routed (RIP)
    • dynamic local routing
    • best strategy for clusters of local networks
  • Gated
    • OSPF
    • EGP
    • BGP

Routing Daemon (routed)

  • dynamic local routing using variant of Xerox NS routing information protocol
  • Bellman-Ford (vector-distance) algorithm
  • Gateways broadcast destination and hop for known routes
  • Routes may "count to infinity" when destinations become unreachable; inifinity is chosen to be 16
  • Monitors network link for proper functions, drops routes through down link
  • On host with one interface no updates are sent

OSPF routing protocol

  • Open Shortest-Path-First routing protocol
    • shortest path first, or link-state routing algorithm
    • each router maintains topology database, monitor link state
    • two-level hierarchy: routing areas connected by backbone
    • internal routers have complete topology for their area
    • backbone routers have complete information about areas and backbones

IPSec Goals

Goals:

  • Authentication - knowledge of a key proves you are who you claim to be
  • Confidentiality - data is encrypted across the connection
  • Replay protection - packets are sequenced at the application level so the replay attempts can be detected

IPSec transport mode

Key:

  • AH - authentication header
  • ESP - encapsulating-security payload
  • SPI - security-parameter index

IPSec tunnel mode

  • Router to router connection
  • Encryption happens only on both connected routers
  • Host to router connection is possible, the host has to have the key

IPSec key management

  • Keys stored in kernel managed with key socket with messages:
    • getspi
    • update
    • add
    • delete
    • get
    • acquire
    • register
    • expire
    • flush
    • dump

Packet-processing frameworks

  • Packet processing frameworks used to:
    • implements firewalls
    • NAT
    • debug network problems
    • provide framework for network-research testbeds
  • Available filter frameworks:
    • BPF - identify and copy selected packets
    • IP firewall (IPFW) - modify or drop seleccted packets
    • Dummynet - traffic shaping
    • Packet Filter (PF) - drop selected packets
    • Netgraph - provides data-flow between network processing modules
    • Netmap - direct access to network packet rings

Berkeley Packet Filter

  • Generaly used via tcpdump which uses pcap to express filter
  • Scans all incoming packets and copies one that match selection criterion
  • Never drops or modifies packets but does not allow packets to be injected into input stream
  • pcap compiles high-level description to simple byte codes(which a JIT turns into native machine code)
  • Can avoid copying by using shared user-space/kernel buffers

IPFW

  • Single central dispatch receives all incoming packets and each packet can be:
    • passed through unchanged
    • copied to additional packet streams
    • diverted to an alternate packet stream
    • subject to NAT
    • reassembled for further inspection
    • send to dummynet and/or netgraph
    • dropped

Decision on what to do with packets described by rulesets loaded by the system admistrator

Dummynet

Packet-processing framework that provides:

  • traffic shaping
  • packet delay emulation
  • packet scheduling

Pipe emulates a communications link with a scheduler arbitrating multiple independent queues that features:

  • programmable bandwidth and delay
  • selectable queue priorities
  • variable number of queue
  • selectable queue priorities
  • scalable to many thousand pipes and queue
  1. Dummynet scheduling policies

    • FIFO - constant running time with no service guarantees
    • weighted round robing - contant run time with limited service guarantees
    • quick fair queueing - constant run time with bounded service guarantees
    • weighted fair queueing - logarithmic run time (in number of queues) with optimal service guarantees

Packet filter (PF)

  • FreeBSD version is PFv1, PFv2 still to be ported
  • Similar functionality to IPFW
  • Developed under OpenBSD and ported to FreeBSD
  • Packets are normalized then run through a set of rules provided by the system administrator
  • Packets that pass the first set of rules run through another ruleset specific to their protocol type (TCP, UDP, ICMP, etc)
  • Packets that pass second set of rules are passed through for normal processing

Netgraph

Built around a data-flow model:

  • Nodes provide data processing
  • Edges provide packet flow between nodes

Currently 50 nodes available ranging from simple node that echos its input to a complex node that implements Bluetooth

Nodes have a seprate control channel to receive configuration (as well as input and output)

Nodes can be added as loadable kernel modules

  1. Netgraph example

    • Ethernet bridge

Netmap

  • network interface packet rings directly mapped into the application address space
  • system calls to request packet transmission and be notified of packet reception
  • access via command to/from: /dev/netmap
    • NIOCGREGIF command to attach to interface
    • attaching stop flow of packets into kernel stack
    • software ring provided to selectively copy packets into stack and accept packets from the stack

Netmap receive ring

  • tail and head sync with kernel/hardware -> userland

Netmap transmit ring

  • similar to receive ring, sync with the kernel periodicaly to sync head and tail on the list

Week 13

The Transport Layer

Protocol Control Block

  • TCP example
  • No other connection will have the tuple (local addr -> local port / foreign addr -> foreign port)
  • Hash table for connections using the four tuple

Internet Protocol control block

  • Create control block
  • Bind local address and port
  • Set remote address (connect, sendto)
    • source is chosen according to outgoing network interface after routing lookup
  • Look up control block with matching address and port
  • Deallocate control block

TCP Sequence space

  • so_snd.sb_hwat - Socket buffer high water mark
    • max number of bytes accepted from applications
  • so_snd.sb_cc - Socket buffer character count
    • bytes already written from the application
  • snd_wnd - Send window
    • flow control uses the window, if it is full do not expand the window
  • snd_nxt - data sent but no ack received
  • All flow is controlled by sequence numbers
  • Window cannot shrink, you can move forward though
  • Sequence number negociated during handshake

TCP Packet header

  • source port
  • destination port
    • IP already has the local and external address
  • sequence number
  • acknowlodge number
  • data offset
  • Urgent pointer only used when URG flag set
  • TCP protocol limits options to 40 bytes
  • Windows scaling can be negociated
    • pick a scaling factor, windows will be multiplied by
  • Checksums

TCP timers

  • Five tcp timers
    • 2ML (twice maximum segment lifetime) 60 seconds, used while waiting for close
    • KEEP (keep alive) 75 seconds, send keep alive or drop dormant connection
    • PERSIST (persist) 5 to 60 seconds, force a connection to persist
    • REXMT (retransmit), 30 milliseconds to 64 seconds, called when retransmission is necessary
    • DELACK (delayed acknowledgement), 100 milliseconds, send a delayed ack to the peer

TCP Output states

  • Idle
    • no data has been sent that has not yet been acked, send queue is empty
  • Transmitting
    • data have been sent and not yet acked
    • REXMT timer is running
  • Persisting
    • Window is zero or too small to send
    • PERSIST timer is running

TCP Round tripe timing

RTT timing

  • Counter (rtt) is started when frist data segment sent from idle state
  • Counter is sampled when timed sequence number is acked
  • Smoothed round-trip time (srtt) and variance (rttvar) are computed as moving avarege of rtt samples using fixed point arithmetic
  • Sample is discarded if timed data is retransmitted
  • If timestamps are in use, round-trip time of every packet can be measured

Connection establishment

  • Opening 3-way handshake
    • client sends SYN to server
    • server sends SYN+ACK to client
    • client sends ACK to server
  • Closing 2-way handshake
    • send FIN
    • received ACK
    • FIN_WAIT_2 until other side does 2-way
  • During opening handshake each side defines:
    • initial sequence number
    • max segment size (MSS)
    • windows size
  • Options both sides must offer to be usable
    • window scaling
    • timestamp request with units used
    • selective acknowledgement (SACK)

Socket connection queueing

  • Socket in accept mode
  • Complete and Incomplete list
  • sonewconn1() duplicates head socket
  • soqinsque() adds to a queue
  • soqremque() removes from queue
  • so_incomp links partially set up connections
  • so_head links connections read for acceptance
  • Need to cope with syn-flooding attacks

Handling SYN attacks

SYN Cache is hash table holding SYN state

  • lookup when SYN received
    • if not found create entry
    • if found, resend SYN/ACK
  • never ACK any data sent in initial SYN packet

SYN Cookies used when SYN Cache is full

  • maintain no state
  • encode local and foreign addresses and ports and MSS and place in initial sequence number in SYN/ACK
  • if timestamp supported then encode send receive window scaling and SACK support and place and timestamp field
  • must respond within 16 seconds with valid ACK (subtract 1 from initial sequence number and encode to see if they match)

Connection metrics

Metrics cached about a host

  • rmx_mtu - MTU for this path
  • rmx_ssthresh - outbound gateway buffer limit
  • rmx_rtt - estimated rtt
  • rmx_rttvar - rtt variance
  • rmx_bandwidth - estimated
  • rmx_sendpipe - outbound delay bandwidth product
  • rmx_recvpipe - inbound delay bandwidth product

Looked up at connection establishment

Updated at connection shutdown

Cached entries kept for one hour after last use

Silly window syndrome avoidance

  • Silly window syndrome - offering or accepting windows less than maximum segment size when max window is large
  • Receiver silly window avoidance: if offered windows would be less than 1/4 of receive buffer and less than max segment, offer zero window
  • Sender silly window avoidance: if window is too small for all data to be sent, and window is less than max segment, wait for window update or persist timeout.

Delayed acknoledgement

  • Delayed ack is used to piggyback ack on window updates and application-level responses
  • Delayed ack also allows multiple packets to be acked at once
  • Implementation:
    • Delayed-ack flag is set when new data arrives
    • After process takes data, if window update would move window by 2 max-size segments, window update and ack are sent.
    • If data is returned, ack is piggyback
    • DELACK runs every 100 ms forces out ackowledgements not yet sent

Small packet avoidance (Nagle algorithm)

  • Problem: how to coalesce small segments into larger one without delaying transmission too long
  • Other solutions: set a "short" timer when first character is received, transmit when timer expires. (Timer isn't long enough to help slow nets, is just enough to irritate users on local nets.)
  • Nagle algorithm: send immediately when a smal amount of data is presented; then, hold additional data until a full packet can be sent or until the first segment is acked
  • Certain clients send small segments that receive no responsee, expect rapid action; need option to defeat

Slow start

  • Van Jacobson *
  • Observations from Van: in steady state, ACKs from receiver should tell transmitter when to send new data
  • Problem: how to reach steady state.

Acknowledgement clocking

Congestion avoidance

  • Slow start doesn't solve congestion
  • Solution: maintain estimated window size aceptable for network (ssthresh),
  • Whenever packet is lost, ssthresh to half
  • Slow start opens windows exponentially up to ssthresh
  • Window is opened linearly after slow start, probing network capacity

TCP retransmission

  • REXMT timer set to initial value, when sending data
  • If timer expires:
    • snd_nxt is set to snd_una
    • sshthresh is set to half the current window
    • cwnd is set to 1 segment
    • tcp_output is called, transmits one segment
  • Timer is restarted, backed off at exponential rate
  • After 12 transmission packet for each additional packet for each additional duplicated ack
  1. Fast restransmission

    • keep acking the same previous packet until the missing packet is retransmitted

Selective acknoledgement

  • Tells the sender packets it has received beyond currently ACK'ed data
    • Acked to 500
    • SACK indicated that 1000-2000 and 3000-2400 have been received
    • Sender will respond by sending 500-1000 and 2000-3000, then resume at 3500

Modular congestion control

  • Until FreeBSD 7 kernel had a single congestion control algorithm(New Reno from 4.4BSD)
  • Congestion control algorithms can now be added as a loadable kernel module
  • Currently five available:
    • New Reno (default)
    • Vagas
    • CUBIC
    • Hamilton Institute's delay based
    • H-TCP

Modular congestion control (II)

  • Each connection gets to choose its algorithm (using TCP_CONGESTION option to setsockopt())
  • Every time ACK received you get this information
  1. Algorithms

    • New Reno
      • detection based on retransmission timer
      • exponential backoff
    • Vegas
      • On each ACK measures the difference between expected and actual rate
      • if difference below threshold, increase the congestion window
      • if above, decrease
      • if within threshold, make no change
    • CUBIC
      • Designed to quickly find network capacity on long fast link (1Gbit US to Japan has 110ms RTT and takes 10 min to reach capacity)

Basic TCP send policy

Stream Control Transmission Protocol

  • Reliable multiple-stream transport protocol
  • Provides associations between hosts with up to 65k independent streams
  • Each stream can be message- or stream-based
  • Associations setup with 4-way handshake to avoid SYN attack
  • No server state required until third handshake
  • Data can start flowing on third handshake
  • Streams setup within an association do not require handshake

SCTP chunks

  • look like tcp packets
  • with the addition of the multi-stream handling
  • transmission sequence number
  • stream identifier tells which stream
  • stream sequence number tells which packet within stream
  • though length 16-bits, messages may be made up of multiple chunks

SCTP multiple paths

  • SCTP allow association to have multiple address/route using stcp_bindx() and sctp_connectx()
  • One route selected for use
  • Backup routes send heatbeat packets to confirm viability
  • If selected route fail, switch to an alternate

Week 14

System startup and shutdown

Booting overview

  • Firmware initialize hardware and loads first-level bootstrap
  • First level bootstrap identifies disk label and loads second-level bootstrap (8kb to 64kb)
  • Second level bootstrap interprets filesystem to load full bootstrap
  • Final stage loads kernel and kernel modules and starts the kernel
  • Kernel runs autoconfig and starts remaining CPUs
  • Process 1 (/sbin/init)

Firmware bootstrap

  • Initialize low-level hardware
  • Optionialy run hardware diagnostics
  • Pre-boot services such as RAID configuration and remote management
  • Select boot CPU and put remaining CPUs in suspend mode
  • Build list of attached I/O infracture to export to kernel using ACPI or FDT
  • Identify possible boot devices and select one
  • Read in first-level boot strap from front of boot device and branch to it to start it running
  • Provide rudimentary I/O capabilities until kernel can operate devices by itself

First level bootstrap

  • Contains either a (historic) MBR or more modern GPT label
  • Use lebel to identify location of FreeBSD bootable partition
  • Load first 64K of FreeBSD partition and branch to it to start it running

Second level bootstrap

  • gptboot to load from UFS
  • gptzfsboot to load from ZFS

Actions taken by the second-level bootstrap

  • enable access to full memory
  • fetch boot.config and interpret any flags
  • load /boot/loader

Final sate bootstrap

Full featured interactive loader

  • choices of kernel to load
  • inspect and set kernel env vars
  • preload kernel modules
  • preload security keys
  • preload memory-disk images
  • filesystem access
  • forth script interpreter

Default at startup is to run /boot/loader.4th

  • load /boot/kernel/kernel and any modules in /boot/loader.conf
  • at end of countdown, boot loaded kernel

Kernel loadable modules

Modules can be loaded into the kernel

  • when it is compiled
  • when it is loaded
  • when it is running (if not disabled)

Module register an event handler

  • called at startup to initialize the module
  • called at shutdown to clean up the module

Has version number to allow field upgrades Can register new system calls that it supports

Kernel boot

Stages of kernel startup

  • Assembly-language setup and enabling memory management unit
  • Platform-specific C-language startup of architecture specific interrupt vectors, stacks, and crafting first kernel process and thread
  • Setup basic kernel services: memory allocation, scheduling, locking
  • Run through kernel module initialization to initialize higher-level services (filesystem, network stack, start /sbin/init)

Process 1 will be forked from process 0 (usually swapd)

Kernel module initialization

Register with:

  • SYSINIT(name, subsystem, order, function, indentifier) to start
  • SYSUNINIT(name, subsystem, order, function, identifier) to stop

Ordering based on 2-level hierarchy:

  • subsystem is first level listed in /sys/sys/kernel.h
  • order is second level to differentiate modules in same subsystem

Module invoked as function(identifier)

Sorted at startup, then executed in order

Kernel modules (part I)

Defined in /sys/sys/kernel.h

  • SI_SUB_TUNABLES
    • set subsystem parameters from loader tunable
  • SI_SUB_VM
    • early virtual memory
  • SI_SUB_KMEM
    • kernel malloc() initialization
  • SI_SUB_HYPERVISOR
    • Xen HVM hypervisor detection
  • SI_SUB_WITNESS
    • configure witness lock-debugging facility
  • SI_SUB_MTX_POOL_DYNAMIC
    • initialize dynamically allocated mutex pools
  • SI_SUB_LOCK
    • initialize statically declared or allocated locks
  • SI_SUB_EVENTHANDLER
    • register statically declared event handlers

Kernel modules (part II)

  • SI_SUB_VNET_PRELINK
  • SI_SUB_KLD
  • SI_SUB_CPU
  • SI_SUB_RACCT
  • SI_SUB_RANDOM
  • SI_SUB_KDTRACE
  • SI_SUB_MAC
  • SI_SUB_MAC_POLICY
  • SI_SUB_MAC_LATE

Kernel threads

  • SI_SUB_INTRINSIC - hand craft proc 0

  • SI_SUB_AUDIT - create audit worker thread

  • SI_SUB_CREATE_INIT - fork process 1

  • SI_SUB_SCHED_IDLE - create idle thread

  • SI_SUB_INTR - create device-driver threads

  • SI_SUB_SOFTINTR - create callout thread, netisr thread

  • SI_SUB_DRIVERS - GEOM thread, random device

  • SI_SUB_CLOCKS - create and start system clock threads

  • SI_SUB_KTHREAD_PAGE - create vm_pageout daemon thread

  • SI_SUB_KTHREAD_VM - create vm_daemon thread

  • SI_SUB_KTHREAD_BUF - create buf_daemon

  • SI_SUB_KTHREAD_UPDATE - create vnlru_proc daemon

  • SI_SUB_KTHREAD_UPDATE - create sched_sync daemon

  • SI_SUB_KTHREAD_INIT - schedule init process to run

Mouting the root and /dev filesystems

When init thread starts running it kernel it must get root filesystem mounted

  • Mount devfs filesystem as root ("/") filesystem
  • Search through "rootdevnames" for a viable root filesystem
  • Mount selected root filesystem on /root
  • Swap filesystems: /root => / and / => /dev

Find and load /sbin/init Exit kernel into init process

User-space startup

Load additional kernel modules In single-user mode init spawns a shell In multi-user mode, init spawns a shell to run commands in /etc/rc

  • Check filesystems as needed and then mount then
  • Run rcorder to select order that scripts in /etc/rc.d and /usr/local/etc/rc.d should run based on REQUIRE and PROVIDE declarations in each script
  • Run the script using /etc/rc.conf and /etc/default/rc.conf to configure subsystems to be run (such as sshd) and the options with which they should run

System shutdown

Shutdown happens in 3 steps

  1. Shutdown services that need to store data in the filesystem
  2. Shutdown the filesystem
  3. Shutdown of the remaining services

Service register using EVENTHANDLER_REGISTER(name, function, argument, priority)

  • The name identifies which shutdown step
    1. shutdown_pre_sync
    2. shutdown_post_sync
    3. shutdown_final
  • The priority works like the orher parameter of SYSINIT()
  • Event is invoked as function(argument)

Performance and monitoring system tunning (tools)

  1. Tunning filesystem

    -m minimum percentage of free space
    -e maximum number of blocks (2048 = 32 Megabytes)
    -a enable or disable access control list (disable)
    -l enable or disable MAC
    -j enable or disable journaling
    -s specify avarage number of files in a directory (64)
    -f specify avarage file size (16384)
    -o optimization to minimize system time or to minimize space used on disk (time)
    
  2. System monitoring with systat

    Output of "systat -vmstat"

    Load avg = should be less the number of CPU

  3. systat -ifstat

    Output of "systat -ifstat"

Kernel tracing

Ktrace and Truss this will be about ktrace

  • Provide kernel tracing logging
    • system calls
    • pathname translation
    • signal processing
    • I/O
  • Tracing can be flexibly applied
    • selectively for running processes
    • for individual commands
    • for process groups
    • through inheritance to all current or future children

Kernel trace display

  • Trace files are generated in binary format
  • Translation to readable form is done by kdump
  • Useful translation are optionally added
    • system calls number to name
    • ioctl values
    • system errors with standard error string
  • output can display running time between and during system calls
  1. trace example

Administration of kernel information

  • most information available through sysctl()
  • extensible MIB-style interface
  • allows selective retrieval of kernel data structures
    • process table
    • routing table
    • vnode table
  • allow selective setting of kernel variables
    • networking: IP forwarding, redirect, enable, time-to-live
    • debugging: filesystem cluster enabling
    • limits: maxproc, maxfiles, maxvnodes

MIB = Management Information Base

Getting process information

  • Process structures no longer continguous, other info scattered.
  • Solution: define kinfo_proc structure:
    • useful field from proc
    • xucred
    • useful fields from vmspace
    • session and pgrp info (login, pgid, tty)
    • wchan message
  • Interface provides selective retrieval of the process table
    • by proc id
    • by proc group id
    • by session
    • by effective user id
    • by real user id

Program interface

  • Goals
    • Want all programs that acess kernel table to work on running or dead kernel
    • Want to keep machine dependent information in a single place
  • Provide a "kernel virtual memory" (kvm) library
    • uses sysctl for running kernels
    • knows how to extract tables for dead kernel
    • maintains a cache of kernel namelist entries for running kernel to avoid repeated nlist() calls
  • Provided that programs do note use machine dependent fields in the table they extract, they are portable from one architecture to another once the kvm library has been ported

Kernel debugger

FreeBSD does kernel debugging remarkable well

  • Better than Solaris
  • Better than linux (mostly because of Linus attitude of "read the code" and figure it out

Low-level built-in debugger runs on console

High-level kernel debugging facility provides complete symbolic debugging using kgdb(1) or lldp(1)

  • symbolically examine and modify memory locations
  • set breakpoint in the kernel
  • single step the kernel

Usage - enable the debugger when booting, then stop:

  • at first instruction
  • when debugger is attached
  • on kernel trap/panic

Kernel crashes

Enable crash dumps

  • location specified using dumpon(8)
  • usually configured for primary swap area
  • default is "mini-dump" which saves only kernel memory

Recovering a crash dump

  • after reboot run savecore(8) to extract
  • can then run kgdb(1) or lldp(1) to perusee the crash
  • general hints given in crash(8)
  • wealth of information collected using crashinfo(8) and textdump(4)

Netstat memory statistics

netstat -m -N kernel.80 -M vmcore.80

System malloc memory statistic

vmstat -m -N kernel.80 -M vmcore.80

System malloc memory statistics (part II)

REST OF vmstat -m

System zone memory statistics

vmstat -z -N kernel.80 -M vmcore.80

Example ps output

  • ps -axlw

Exaple fstat output

  • fstat | more

Example of netstat output

  • netstat -N kernel.80 -M vmcore.80

DTrace

DTrace provides detailed information throughout the application stack

  • system calls
  • libraries
  • application code

many handy one-line DTrace scripts: https://wiki.freebsd.org/DTrace/One-Liners

Performance monitoring counters

  • pmc
    • details of all activity on a CPU
    • activity of a process
  • Things that can be measured:
    • interrupts taken
    • instructions executed
    • branches executed and mispredicted
    • number of non-idle processor cycles
    • instructions cache misses and prefetches
    • data cache misses, prefetches and writebacks

Using PMC

  • pmc(3) describes types of counters
  • pmccontrol -L - lists names of available counters for current machine
  • pcmstat -T -S event-spec - gives a top-like display of the listed counters

Week 15

System measurements and performance tune

Background

  • Understand the operating system
  • While reading about it do not turn a page without understanding it completely

Observability tools

  • Good observability tools will help you find unecessary work
  • It helps you to eliminate this unecessary work and tune parameters

Recap tools:

  • systat
  • pidstat
  • vmstat
  • iostat

Tracing tools

  • truss
  • tcpdump

Methodologies

  • Anti-pattern
  • Workload Identification
    • who
    • what
    • why
    • how
  • USE Method
    • Utilization
    • Saturation
    • Errors
  • Off-CPU

Benchmark

  • 100% of benchmarks are wrong
  • Use only for experimental analysis
  • Types:
    • micro benchmarks
    • macro benchmarks
    • bad benchmarks

Profiling

  • Great to understand what is a benchmark is doing
    • pmcstat
    • dtrace
  • pmcstats
    • Cycles per instructions
      • unhalted count
      • instructions count
  • dtrace

Tracing

  • truss
  • ktrace
  • DTrace

Tuning

  • sysctl -aW
  • /etc/sysctl.conf
  • Treat tunable parameters as someone else's medicine cabinet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment