rsampaio/freebsd_notes.md

## freebsd_notes.md

      
    Raw
  

              freebsd_notes.md
            
          
Table of Contents


1. Week 1

1.1. Introduction

1.1.1. Logistics
1.1.2. Material overview


1.2. Introduction to the kernel

1.2.1. Unix VM
1.2.2. Process
1.2.3. System calls
1.2.4. Entering the kernel
1.2.5. Returning from the kernel


2. Week 2

2.1. Locking

2.1.1. Historic Synchronization
2.1.2. Lock Hierarchy
2.1.3. Turnstiles
2.1.4. Turnstiles implementation
2.1.5. Sleep queues
2.1.6. Sleep interface
2.1.7. Critical Sections
2.1.8. Hardware requirements for lock
2.1.9. Spin mutex
2.1.10. Blocking mutex
2.1.11. Pool Mutex
2.1.12. Reader and Writer locks
2.1.13. Read-Mostly lock
2.1.14. Shared-Exclusive locks
2.1.15. Condition variables
2.1.16. Locking Manager locks
2.1.17. Witness


2.2. Processes

2.2.1. Process Resources
2.2.2. BSD Process structures
2.2.3. Open file information
2.2.4. Other Process Substructures
2.2.5. Processes versus Threads


3. Week 3

3.1. Process

3.1.1. Process States
3.1.2. Finding processes entries
3.1.3. Process entries of related processes
3.1.4. Scheduling classes


4. Week 4

4.1. Security

4.1.1. Security mindset
4.1.2. Trusted Computing Base
4.1.3. Process credentials
4.1.4. Set-User-Identification Functionality
4.1.5. Immutable and Append-only flags
4.1.6. Jails
4.1.7. Jails rules
4.1.8. Random number generation
4.1.9. Access control lists (ACL)
4.1.10. Access control list semantics
4.1.11. Privileges (Kernel only, no userland interface)
4.1.12. Privileges applied
4.1.13. Auditing
4.1.14. Auditing handling
4.1.15. Capsicum
4.1.16. Sample Capsicum capability


4.2. Virtual Memory

4.2.1. Virtual Memory Layout
4.2.2. Virtual memory hardware operation
4.2.3. Virtual Memory Design
4.2.4. User interface to Virtual Memory


5. Week 5

5.1. Virtual Memory

5.1.1. Physical memory and swap space
5.1.2. Virtual Memory organization
5.1.3. Process Address Space
5.1.4. Process Address Space Sharing
5.1.5. Shadow object chains
5.1.6. Machine-dependent virtual-memory description
5.1.7. Kernel memory allocator hierarchy
5.1.8. Kernel Maps ans Submaps
5.1.9. Vmem Data Structures (Arena)
5.1.10. The Slab Allocator
5.1.11. The Keg Allocator
5.1.12. The Zone Allocator
5.1.13. Fast Kernel malloc implementation
5.1.14. Inteface look and feel
5.1.15. Types of pages
5.1.16. Paging strategies
5.1.17. Paging strategies (part II)
5.1.18. Superpage Motivation
5.1.19. Superpage Solution
5.1.20. Superpage Reservation
5.1.21. Superpage promotion
5.1.22. Superpage fragmentation control
5.1.23. Filesystem interaction with the virtual memory
5.1.24. Kernel structure


6. Week 6

6.1. I/O System Overview

6.1.1. Kernel I/O Structure
6.1.2. I/O Descriptors
6.1.3. Descritor I/O Structure
6.1.4. File Entry
6.1.5. File Entry Types
6.1.6. Multiplexing I/O
6.1.7. Select Top-Level
6.1.8. Select Bottom-Level
6.1.9. Select Notification
6.1.10. Kqueues and Kevents
6.1.11. Types of kqueue events
6.1.12. Kevent notification
6.1.13. Kevent Implementation
6.1.14. Semantics of Non-Blocking
6.1.15. Uio Structure
6.1.16. Level of Filesystem Interface
6.1.17. Contents of a Vnode
6.1.18. Vnode interface - Name Translation
6.1.19. Vnode interface - Vnode Management
6.1.20. Vnode interface - Name Cacheing
6.1.21. Vnode linkages
6.1.22. Vnode Mount interface
6.1.23. Buffers
6.1.24. Stackable filesystems
6.1.25. Loopback mounts
6.1.26. Union mounts
6.1.27. Union mount naming
6.1.28. Union mount issues
6.1.29. Union implementation
6.1.30. Dead file system - DFS


7. Week 7

7.1. Devices

7.1.1. Special Devices
7.1.2. Device Naming and Access
7.1.3. Character devices entry points
7.1.4. Line displines entry points
7.1.5. Termios
7.1.6. Changing the termios structure
7.1.7. Allocation of controlling terminal
7.1.8. Revocating of controlling terminal
7.1.9. Network driver entry points
7.1.10. Network driver control messages
7.1.11. Network packet reception
7.1.12. Disk volume management
7.1.13. Mirroring a filesystem
7.1.14. GEOM Operation
7.1.15. Disk sorting
7.1.16. Communication from the top to the bottom of the kernel
7.1.17. From bottom to the top
7.1.18. FreeBSD Disk I/O Subsystem
7.1.19. Physical Drive Management
7.1.20. Autoconfiguration
7.1.21. Autoconfiguration Functions
7.1.22. Device driver bidding
7.1.23. Autoconfiguration data structures for pci0
7.1.24. Device virtualization
7.1.25. Bhyve virtualization
7.1.26. Xen virtualization
7.1.27. Xen Grant tables


7.2. Fast Filesystem

7.2.1. File filesystem history
7.2.2. Directory structure
7.2.3. Extended attributes
7.2.4. File locking
7.2.5. File locking implementation


8. Week 8

8.1. Filesystems

8.1.1. Quota
8.1.2. Semantics of quota
8.1.3. Current UNIX Filesystems
8.1.4. Inode
8.1.5. A small filesystem
8.1.6. UNIX Filesystem
8.1.7. I/O in UNIX
8.1.8. Contents of a cylinder group
8.1.9. Optimizing storage utilization
8.1.10. Filesystem parametrization
8.1.11. Layout policies
8.1.12. Layout policies (part 2)
8.1.13. Dynamic block reallocation
8.1.14. Implementation of dynamic block reallocation
8.1.15. Optimized Metadata Layout
8.1.16. Implementation of allocation
8.1.17. Filesystem Consistency
8.1.18. Adding journaling to soft updates
8.1.19. Additional requirements of journaling
8.1.20. Tracking file removal dependencies
8.1.21. Journaled Crash recovery
8.1.22. Snapshots
8.1.23. Snapshot implementation


9. Week 9

9.1. ZFS filesystem

9.1.1. ZFS overview
9.1.2. ZFS Modules
9.1.3. Functional organization
9.1.4. Structural organization
9.1.5. Uberblock
9.1.6. Dnode
9.1.7. ZFS Block pointers
9.1.8. ZFS block management
9.1.9. ZFS Structure
9.1.10. ZFS Relationships
9.1.11. ZFS Checkpoints
9.1.12. Writing new data
9.1.13. Flushing dirty data
9.1.14. Logging
9.1.15. RAIDZ
9.1.16. RAIDZ Recovery
9.1.17. Storage pool allocator
9.1.18. Snapshots
9.1.19. Freeing filesystem blocks
9.1.20. Deduplication
9.1.21. Remote replication
9.1.22. ZFS strenths
9.1.23. ZFS weakness


10. Week 10

10.1. NFS

10.1.1. Sun Microsystems Network File System
10.1.2. NFS v3
10.1.3. NFS v4
10.1.4. NFS File identification
10.1.5. NFS v3 daemons
10.1.6. NFS v3 mount operation
10.1.7. NFS v3 I/O
10.1.8. NFS v4 goals
10.1.9. NFS v4 changes
10.1.10. NFS v4 daemons
10.1.11. NFS v4 mount operation
10.1.12. NFS v4 sessions
10.1.13. NFS v4 delegation/recall
10.1.14. NFS v4 crash recovery


11. Week 11

11.1. Networking and Interprocess Communication

11.1.1. Graph of communicating processes
11.1.2. Name server
11.1.3. The ISO networking model
11.1.4. Packet traffic
11.1.5. Socket system call interface
11.1.6. Types of network address
11.1.7. Use of IPC/Networking facilities
11.1.8. System layering
11.1.9. UDP
11.1.10. TCP
11.1.11. SCTP
11.1.12. IP
11.1.13. Mbufs
11.1.14. Mbufs continued
11.1.15. Socket data queuing
11.1.16. Socket interface protocols
11.1.17. Network interface addresses
11.1.18. Network interface to link layer
11.1.19. Net to protocol interface


12. Week 12

12.1. The Network Layer

12.1.1. Internet Addressing - IP Version 4
12.1.2. Internet Addressing - Subnets
12.1.3. IP Multicast
12.1.4. Address Resolution Protocol
12.1.5. Internet Addresseing - IP Version 6
12.1.6. IPv6 addressing
12.1.7. IPv6 autoconfiguration
12.1.8. Internet Protocols - ICMP
12.1.9. ICMP Redirect processing
12.1.10. IP Forwarding
12.1.11. Routing goals
12.1.12. Routing design
12.1.13. Kernel routing table
12.1.14. Routing: radix search algorithm
12.1.15. Kernel routing lookup
12.1.16. Kernel routing socket
12.1.17. Routing strategies
12.1.18. Routing Daemon (routed)
12.1.19. OSPF routing protocol
12.1.20. IPSec Goals
12.1.21. IPSec transport mode
12.1.22. IPSec tunnel mode
12.1.23. IPSec key management
12.1.24. Packet-processing frameworks
12.1.25. Berkeley Packet Filter
12.1.26. IPFW
12.1.27. Dummynet
12.1.28. Packet filter (PF)
12.1.29. Netgraph
12.1.30. Netmap
12.1.31. Netmap receive ring
12.1.32. Netmap transmit ring


13. Week 13

13.1. The Transport Layer

13.1.1. Protocol Control Block
13.1.2. Internet Protocol control block
13.1.3. TCP Sequence space
13.1.4. TCP Packet header
13.1.5. TCP timers
13.1.6. TCP Output states
13.1.7. TCP Round tripe timing
13.1.8. Connection establishment
13.1.9. Socket connection queueing
13.1.10. Handling SYN attacks
13.1.11. Connection metrics
13.1.12. Silly window syndrome avoidance
13.1.13. Delayed acknoledgement
13.1.14. Small packet avoidance (Nagle algorithm)
13.1.15. Slow start
13.1.16. Acknowledgement clocking
13.1.17. Congestion avoidance
13.1.18. TCP retransmission
13.1.19. Selective acknoledgement
13.1.20. Modular congestion control
13.1.21. Modular congestion control (II)
13.1.22. Basic TCP send policy
13.1.23. Stream Control Transmission Protocol
13.1.24. SCTP chunks
13.1.25. SCTP multiple paths


14. Week 14

14.1. System startup and shutdown

14.1.1. Booting overview
14.1.2. Firmware bootstrap
14.1.3. First level bootstrap
14.1.4. Second level bootstrap
14.1.5. Final sate bootstrap
14.1.6. Kernel loadable modules
14.1.7. Kernel boot
14.1.8. Kernel module initialization
14.1.9. Kernel modules (part I)
14.1.10. Kernel modules (part II)
14.1.11. Kernel threads
14.1.12. Mouting the root and /dev filesystems
14.1.13. User-space startup
14.1.14. System shutdown
14.1.15. Performance and monitoring system tunning (tools)
14.1.16. Kernel tracing
14.1.17. Kernel trace display
14.1.18. Administration of kernel information
14.1.19. Getting process information
14.1.20. Program interface
14.1.21. Kernel debugger
14.1.22. Kernel crashes
14.1.23. Netstat memory statistics
14.1.24. System malloc memory statistic
14.1.25. System malloc memory statistics (part II)
14.1.26. System zone memory statistics
14.1.27. Example ps output
14.1.28. Exaple fstat output
14.1.29. Example of netstat output
14.1.30. DTrace
14.1.31. Performance monitoring counters


15. Week 15

15.1. System measurements and performance tune
15.2. Background

15.2.1. Observability tools
15.2.2. Tracing tools
15.2.3. Methodologies
15.2.4. Benchmark
15.2.5. Profiling
15.2.6. Tracing
15.2.7. Tuning


Week 1

Introduction

Logistics


Info about Hillside
Class times

Material overview


Man pages
Outline

Listed in the printed material


Introduction to the kernel

Unix VM


Unix was about portability

The vm is built uppon 5 properties

paged virtual addr space

If you want to allocate more than physical memory


software interrupts

Define hardware events and turn it into a software event
Catch signal interrupt and deal with things like division by zero
Call your signal handler
Unified interface for hardware events


timers and counters
identifiers for accounting, protection and scheduling
descriptor

Handle of some kind of I/O
Before that each thing had it own descriptor


Process


Fundamental to the unix philosophy
Input/Output connected thru pipes creates a pipeline of execution
Process is made of 4 main pieces:

CPU Time

Scheduling made by kernel


Asynchronous Events

Interruptions have to be dealt with, signals, network traffic


Memory

How virtual memory is managed
How to make a hardware provide an address space


I/O Descriptors

Any useful process is going to have I/O Descriptors
Most fundamental service provided by UNIX


System calls


Top half of the kernel

Top half has its own stack because when the kernel wants it is there in a real physical memory


More malloc and free in the kernel stack because there is a limited stack available
Bottom half

Scheduled by interrupts
Top-half will be preempt when bottom-half interrupts
You can block but generaly don't block on bottom-half


Entering the kernel


Part 1:

Hardware traps
System calls

Special case of hardware traps


Hardware interrupts

Async (disk io complete, user typed a key)
Higher priority


Software-initiated traps or interrupts

Lower priority


Part 2: Switch to kernel mode

Push PC and PSW into the kernel per-process stack
Push trap type os syscall number into the stack
Save general purpose registers
Call handler (syscall()) for a syscall, trap() for a trap


Returning from the kernel


Timers

3 per process interval timers, optional value to be loaded when it expires
ITIMER_REAL

Decrements in real time
Sends SIGALRM on expiration


ITIMER_VIRTUAL

Decrements in user virtual time
Sends SIGVTALRM on expiration


ITIMER_PROF

Decrements in user + system virtual time
Sends SIGPROF on expiration


Select timer is independent from the interval timers
Kqueue can also create timers


Timouts

Arrange a timeout in the kernel

handle = timeout(function, arg, ticks)


Cancel it

untimeout(handle)


Compute the number of ticks

ticks = hzto(timevalp)


Timout routines are called from softclock at splsoftclock


Week 2

Locking

Historic Synchronization


When a lock is not available sleep on it

wchan or wake_channel receive a pointer to the resource
process waiting for a specific resource will be waken


Lock Hierarchy


From the most simple to the most complex

Hardware
Spin Locks
Locks that blocks briefly

blocking mutex
pool mutex
reader-writer locks
read-mostly


Locks using the sleep-queue interface
Witness


Turnstiles


Sys/sys_proc.h

data structures using locks to protect the access into the fields


Designed for short periods
Used to protect read and write
Track current lock holder

If you hold a short term lock and goes to sleep the kernel will panic


Priority propagation

Turnstiles implementation


Hash header for quickly find locks
A turnstile is needed each time a thread blocks

A thread can only block on one thing


Sleep queues


Used by shared-exclusive locks, condition variables
No propagation
You're are not allowed to own a turnstile when requesting a sleep-queue lock
The queue track the exclusive lock holder
May be recursive

Sleep interface


wchan and priority

msleep


sleep on lbolt (lightning bolt)
wait syscall wait for children to exit and put the parent to sleep in its own context
sigpause

The only way to exit a sigpause is a longjmp


sleep implementation

lock sleep hash header
record wchan, hashs to find slepe queue
set td_priority (for scheduling)
place on approp sleep queue
call sched_switch() to run the next thread
sleep thread is not running until awakened
if PCATCH is set check for signals


wakeup implementation

awaken ALL sleepers on wchan starting from the front


wakeup_one

awaken first sleeper on wchan


To awaken a thread

remove from sleep queue
recompute kg_user_pri for long sleepers
sleeping, make runnable, place on run queue
stopped: "unsleep", remain stopped


request CPU rescheduling by setting NEEDRESCHED

Critical Sections


protected by locks

Hardware requirements for lock


Lock is not possible without assitance from hardware
Test-and-set

Very expansive instructions


FreeBSD uses a compare and swap

instead of a memory bucket it is a pointer
Owner field for a free lock has MTX_UNOWNED
Owner field for a held lock has the pointer to owning thread


Spin mutex


Exclusive access only
Loops waiting for the mutex to become available
Run inside the critical section
More expensive than a blocking mutex

Blocking mutex


Exclusive only
Adaptive spining which only spins if the owner is running
All waiters are awekened when lock is released

Cheaper to release an uncontested lock
Often end up scheduling sequentially


Pool Mutex


Used for small short-lived data structures

Just need a pointer to a mutex rather than a large mutex itself
Mutex is preallocated so avoid high creation and destroy times


Example is pool syscall that needs a struct to track poll requests

Reader and Writer locks


Exclusive and shared mutex
Uses a turnstile so cannot be held when you sleep
Priority propagation for exclusive not for shared access
May specify permission to recurse

Read-Mostly lock


Same properties as reader-writer
Priority propagation for shared
Designed for fast access

Read without locking then check if write happened
if so redo the reading


The routing table is a good example of this lock
IBM holds patent on RCU

Allow GPL code to use
FreeBSD is not GPL


Shared-Exclusive locks


fastest ans simplest of the locks that can sleep
Shared and exclusive
Permission to recurse
May request interruptions (PCATCH)
No priority propagation

Condition variables


wrapper on traditional sleep and wakeup
Allow wake one or all waiters
Must hold a mutex before awakening or waiting

Locking Manager locks


Most full-featured of the locks that can sleep
Downgrade (from exclusive to shared)
Upgrade (from shared to exclusive)
Upgrade exclusive prevent other exclusive lock
Can pass ownership from a lock from a thread to the kernel
You need to clean up before leaving the kernel
Ability to drain all accessing threads for deallocation
No propagation

Witness


Partial ordering requires:

Thread acquires only one lock in a class
Thread may acquire only a lock in a higher-numberred class

Class 1 > Class 2 - thread with locks on class 1 can acquire class 2 locks


Programmers can define lock classes
Witness code observes actual lock ordering and complains when either rules is violated
Very very expansive
Debate exists about enabling in production

Processes

Process Resources


CPU Time

What runs when


Async events

timers
hardware exceptions
external events
I/O


Memory

text/data
heap
runtime stack


I/O Descriptors

files
pipes
socket
stream


BSD Process structures


Process entry

All unix has one, solaris and linux process entries may be different


Process structures dynamically allocated
Placed in linked list

Open file information


filedesc structure
Open file array is reallocated when overflow


Other Process Substructures


Process limits (plimit)

array of rlimit
copied lazily


Process accounting/statistics (pstats)

rusage (Resource usage)
virtual-time timer
profiling
start time


Signal actions

Pending signals

sigacts per signal actions,
signal stack info,
current state during sigpause


Processes versus Threads


kthread model (started in BSD world)

One process structure
Multiple threads


Fork model (started in the Linux world)

Each thread has its own process structure
same model as that used by Linux


kthread context switch is wayyy cheaper
rfork slower for context switch

Week 3

Process

Process States


runnable
sleeping
stopped

stopped by job control or tracing
return to sleep or run


minor_states

new
zombie

Zombies won't close socket descriptors because tcp may have data sitting around
Init's job is to wait for orphans left by other processes


Finding processes entries


by process ID

hashed, linked through p_hash index
pfind()


by state

run queue: td_runq
sleep queue: td_slpq
lock queue: td_lockq


by type

on one of allproc or zombproc


Process entries of related processes


p_ is always the process_struct

Scheduling classes


REALTIME
ULE
4BSD
IDLE

Week 4

Security

Security mindset


UNIX was designed with some security in mind
Users exist in UNIX since the begining

Windows didn't had users until NT


Trusted Computing Base


Things that have to be secure

Kernel
Boot scripts
Core utilities (shell, login, ifconfig)
Libraries used by core utilities (libc)


Solid crypto base

Process credentials


process credentials (ucred structure)

effective uid (setuid)
group set (gid)
real uid, gid
saved uid, gid from set*id program


exported credentials (xucred structure) - outside kernel

effective uid
group set


Set-User-Identification Functionality


seteuid
setegid

Immutable and Append-only flags


Immutable limitation

Immutable files can be changed only on singleuser
Append-only files can only be rotated in single user
Direct hardware is retricted
All startup activities needs to be protected


Jails


Started from chroot (xiruu)
Jails was created by a demand of the user base (ISP's)
vnet
vem0a - vem0b (master slave virtual net)

Jails rules


Permitted

running or signaling process within the jail
change files within jail
bind ports
access raw, divert or routing sockets jail's virtual network


Not permitted

get info from processes outside jail
change kernel variables
mount or unmount filesystem
modify phys net interface
reboot


Random number generation


/dev/random apps access
Uses yarrow, PRNG
Yarrow

reuse existing crypto primitives such as hash and block encryption
framework for entropy acquisition
an entropy accumulator
a reseed mechanism
a generation machanism (SHA256 AND AES)


Yarrow will be replaced by Fortuna on FreeBSD 11


Access control lists (ACL)


File permission bits
ACL capabilities

Look up
Administration permissions

Change acl on the file
List of users
List of groups


Default ACL Inheritable


Access control list semantics


ZFS uses NFSv4 ACL model (compatible with windows ACL)
NFSv4 uses inheritable ACL ranther than default ACL in POSIX

Privileges (Kernel only, no userland interface)


Nearly 200 privileges
In the kernel checks for uid=0 also calls priv_check()

Privileges applied


Jails uses privileges to control what a process can do
MAC is a loadable kernel module

Auditing


Used when you know when things go bad
Good auditing should have Intrusion Detecting
OpenBSM
Looks like truss or strace

Auditing handling


auditd daemon

starts a kernel thread that control record distribution


Capsicum


Sandbox

Powerful than jails


Untrusted code run in separated process with access limited
cap_enter()
No access to filesystem

Sample Capsicum capability


Defined at /sys/sys/capability.h

Virtual Memory

Virtual Memory Layout


Top part of the address space used by the kernel

malloc() memory kernel thread stack
no page faults on kernel
load everything, really


Bottom part for userland

Context switch happens here
Trampolin code for signals
User stack no executable page
First page left invalid for null-pointer dereference


Heap grows up, Stack grows down
Shared library seat a couple pages after Heap limit

Virtual memory hardware operation


Virtual address to Physical address
MMU

L1 cache will probably have a page cached
TLB - Translation Lookaside Buffer
Page_fault - interrupt
MMU should be informed that a new page is load


Context-swich is expensive because a whole new pagedir needs to be loaded

Virtual Memory Design


Process as regions

Collection of address space that is threated the same way
Shared memory
File Mapping
Copy on write after fork


vm design is archtecture independent

User interface to Virtual Memory


mmap()

maps file into memory


munmap()

removes mapping from memory


msync()

sync hardware caches to main memory


mprotect()

control access to parts of the address space


madvise()

advise the kernel on how an area of memory will be used
garbage colletor advice random


mlock()

force page to be resident no paged out
used in database for transactions, prevent pages to go away
call msync() to sync main memory from caches


Week 5

Virtual Memory

Physical memory and swap space


when malloc fails you probably used all your address space
linux does not enforce address space size

Virtual Memory organization


Data Structures

Per process: shared per address space, per thread
Per address space: collection of objects


Process Address Space


most of the code is machine independent
vmspace

vm_map
vm_pmap (p for physical)
statistics (about the vm system, page faults, workset size)


vm_map_entry

anonymous object
vnode / object
start and end address


vm ask object if it have a page

Process Address Space Sharing


Private mapping

Shadow objects between the vm_entry and the file object/vnode
On low memory scenarios page are taken from vnode and put into memory


Shadow object chains


When a process with MAP_PRIVATE mapping forks, its child must get it own shadow map

Machine-dependent virtual-memory description


32bit the kernel can only have 2gb of address space

Kernel memory allocator hierarchy


Memory in kernel space

submaps
vmem arena
slabs out of the arena
bucket out of slabs
bucket are per CPU


Kernel Maps ans Submaps


kernel described by map entries

Vmem Data Structures (Arena)


bt boundary tag
power-of-2 free list
BESTFIT or INSTANTFIT
hash list for used pages pointed by the bt

The Slab Allocator


the slab allocator is responsable for allocating a zone of particular thing
all objects are the same size

The Keg Allocator


manages a particular zone
track slab objects
list for full, partially allocated and empty slabs
request a new slab when everything is in use

The Zone Allocator


Lock per zone
Each CPU has it own lockless list protected by a critical section
Each CPU has it own bucket of memory
Two buckets per CPU

Fast Kernel malloc implementation


Memory pool characteristics

Size is fixed
Max size is no bigger than phys mem


Memory mapping hardware can be directly controlled
Known usage patterns

Most allocations are small
Many allocations are for a power-of-2 sized piece of memory


Inteface look and feel


Simple and familiar similar to C malloc() and free()
Small allocation

use power of two
called using malloc and free


Large common structures

use zone-based allocation
more compact than power-of-two
avoid cache line problems
avoid initializing structures
must reclaim unused memory from zones


Types of pages


Wired - kernel
Active - In use
Inactive - valid contents, dirty
Cache - valid contents, clean
Free - invalid contents, ready for use
Necessary to have higher priority page access

Paging strategies


free pages for the free list

no longer explicity
process exit anon pages go to the freelist


To add inactive pages to the cache list

remove oldest page of the inactive list
if first inspection mark scanned
if dirt put back to the inactive list
pages on cache needs to be clean


Inactive and cache are LRU

Paging strategies (part II)


To add to inactive list

Remove oldest page from active list
if in use, increment active count and put on the end of active list
else not in use, decrement active count

if active count greater than zero, put on end of active list
else clear scanned flag and move to end of inactive list


Superpage Motivation


The problem is the limited size of the TLB
if TLB takes more than a cycle it is slow

Superpage Solution


4k pages
2Mb pages on PAE
4Mb pages without PAE

Superpage Reservation


On FreeBSD the kernel decides when to use superpages
Others allow programs to decide
Starts on the begining (first fault) because small pages are already in use

Annon memory is always elegible for superpage
mapped file at least superpage size is eligible


Only save TLB misses

Superpage promotion


if your are touching all pages on your reservation
promotion to read-only superpage

Superpage fragmentation control


Kernel memory is aways superpage

Kernel do page out


Cache and free are kept in buddy lists (pair)

Aggregate small pages until a full superpage is available


Filesystem interaction with the virtual memory


filesystem own "buffer cache" pages
annon object owns it memory page
annon objects don't have a filesystem
pages may be both mapped and cached (copy on write)
management of swap space

page cluster


Kernel structure


p 315 (book reference picture)

Week 6

I/O System Overview

Kernel I/O Structure


GEOM handles the disk IO

Can do all sort of stuff, RAID, Stripping, etc


CAM layer handles in a generic way the disks I/O

Individual drivers are much smaller
Based on SCSI but extended to ATA and SATA


I/O Descriptors


File Descriptor

An array of bytes
Move file descriptor
Read   "     "
write  "     "


Access to other hardware

Terminals
Tape drivers
Printers
Disks
Modems


Pipes

Reliable byte stream


Descritor I/O Structure


Per-process file descriptor index (integer array)

File Entry


One per open file descriptor, shared across dups()

File Entry Types


VNODE
SOCKET
PIPE
FIFO

.
.

KQUEUE, MQUEUE, ETC

Multiplexing I/O


Use select() or poll() system call

Useful when the I/O usually blocks
Expansive when I/O succeeds
Stateless


Use non-blocking mode

Useful when the I/O usually succeeds
Expansive when I/O blocks


Use signal driven I/O

Useful for infrequent stuff
Expansive for high volume I/O
Context switch for each signal


Use kqueue to monitor kevent

Useful for monitoring many events
Only in BSD and MacOS


Select Top-Level


Select will sit on top of the underlying bottom half to ask about states
Pool of mutex

Select Bottom-Level


Select on async devices

Select Notification


Select notification when I/O becomes possible
selwakeup selinfo is passed to be used to traverse the linked list
Device remove horizontol list, thread remove the vertical

Kqueues and Kevents


General event notification facility
Stateful interface
Register events of interest

May modify the event (add, delete, enable, disable)
Tracked events are deleted when resource is gone


Poll for events using kevent()

Types of kqueue events


Supported kevents on FreeBSD

R/W as in select
AIO Async I/O events
VNODE files information has changed
PROC status of processes
SIGNAL posted to processes
TIMER an event-based timer
USER application defined events


Each event has an identifier

Kevent notification


Each event specifies a filter flag
An event may be another kqueue descriptor

Kevent Implementation


Hash list for free form descriptors
Descriptors indexed array by its number
List of pending events

Semantics of Non-Blocking


Non-block flag is stored in the file table

Uio Structure


IO vector
Increment Offset
Decrement Resid
Segment flag tells you where/from the I/O is coming to

USERSPACE
SYSSPACE
NOCOPY


Pointer to the thread that requested the I/O

Level of Filesystem Interface


system call

open/close, read/write, link/unlink


file table

open/close, read/write, ioctl


virtual node (vnode)

VOP_


inode

namei,iget/iput


block I/O

bread/bwrite


Contents of a Vnode


type
pointer to type specific control block
flags
reference count
pointer to operations
pointer to containing file system
list of associated clean buffers
list of associated dirt buffers
content of pending I/O operations

Vnode interface - Name Translation


Allow but does not require state
Steps in translating pathname

Vnode interface - Vnode Management


Single system-wide vnode table

vn_inactive() references are back to zero
vn_reclaim() recycle vnode


Genral set of utility routinges are provided to manage vnodes for a all filesystems

Vnode interface - Name Cacheing


Single system-wide soft-reference name cache with common routines

cache_lookup()
cache_enter()
cache_purge()
support negative cacheing


Vnode linkages

Vnode Mount interface


/etc/mtab info is now maintaned in the kernel

statfs(2)
getfsstat(2) return how many things are mounted
getmntinfo(3) saner than ^


Generic mount information can be updated at any time

Buffers


Getting buffers

bp = bread(vnode_ptr, file_offset, size);


Releasing buffers

brelse(bp)
bwrite(bp)
bawrite(bp)
bdwrite(bp)


Stackable filesystems


Born in UCLA

Loopback mounts


allow arbritary directories to be mounted anywhere else
implemented as a filesystem layer
bases for other filesystems implmentations

Union mounts


Merge filesystems together instead of hide the underlying fs

Union mount naming

Union mount issues


Remove files from lower layers

whiteout write in the top layer


Removed directory must be marked as opaque for whiteout of directories

Union implementation


it basically deals with naming
all other operations are passed down to the lower layers

Dead file system - DFS

Week 7

Devices

Special Devices


Character devices

pseudo-terminals
frame buffer
printer


Network devices

network interfaces


Disk devices

disk management (GEOM)
filesystem interface


Device Naming and Access


Historic device nodes
/dev filesystem
devd

Character devices entry points


Operations supported by any character device

Line displines entry points


Used by tty devices
Divided into 2 parts

One from things going down to the kernel
One from things coming out of the device


Termios


Programatic interface to terminal drivers
All states contained in one structure
struct termios{}

Changing the termios structure


tcgetattr()
tcsetattr()
command flags passed to tcsetattr()

TCSANOW - make change immediate
TCSADRAIN - drain output, than change
TCSADFLUSH - drain output, flush input


TCSASOFT - only interested in soft state

Allocation of controlling terminal


Historically a side effect of open
Now and explicit ioctl: TIOCSTTY
Only session leaders can acquire controlling terminal
only one per session

Revocating of controlling terminal


SIGHUP when the controlling process disapear
revoke the vnode that references the pty

Network driver entry points


Entry points in the ifnet structure

if_ operations


Network driver control messages


Control messages sent using if_ioctl

SIO* various operations of the network interface


Network packet reception


D buffer owned by device may be filled
K buffer owned by kernel and may be read
When k is done kernel stops, when d is done network drops packets

Disk volume management


GEOM layer

disk partitioning and labelling
disk aggregation
collection of I/O statistics
I/O optimization such as disk sorting


Mirroring a filesystem


GEOM mirror layer

mirror underlying partition
can be inseted to copy filesystem


GEOM Operation


Downwards is handled by g_down thread (single threaded)
Upwards is handled by g_up thread
Module cannot sleep or compute excessively
Modules providing locking can allow direct dispatch
When the provider goes away, error is propagated
When provider changes, change is propagated up the stack

Disk sorting

Organization of buffers awaiting I/O for a disk

currently working on buffer for block 25
next buffer to be processed is at block 40
blocks to be done in next pass start at switch_point_list

Communication from the top to the bottom of the kernel


Acquire a queue mutex and block on CPU
Place request on queue
If device is idle, start it
If synchronous request, sleep
Release mutex

From bottom to the top


Acquire the mutex
Pop next request from queue
Release the mutex
Do required work
If synchronous, wakeup thread(s)
If asynchronous, free resources

FreeBSD Disk I/O Subsystem


Disk can connect in several ways
Non-disk USB and Firewire devices bypass CAM

Physical Drive Management


CAM Layer handles disk devices

routing throught I/O busses to drive

PCI
Cardbus
Firewire


tag queueing
serializing requests
allocating and freeing DMA resources


DMA map works similar to a TLB but much simpler

Autoconfiguration


policy

support widest possible range of CPU types and configuration
minimize information that specified in advance


strategy

decode CPU type
check memory controller type
find types and locations of possible main I/O busses
descend I/O busses, testing for presence of controllers and devices
attach and configure discovered controllers and devices


Autoconfiguration Functions


device_probe
device_identify
device_attach
device_detach
device_shutdown
device_suspend
device_resume

Device driver bidding


Bid process for drivers
Highest bid will get it
probe - bid - attach

Autoconfiguration data structures for pci0


the pci0 device scans its bus and finds an ATA disk controller
scanning its drivers, its "atapci" drivers wins the auction


Sample autoconfiguration hierarchy (devinfo)

Device virtualization


full virtualization lets guest operating system run without change
paravirtualization requires guest operating system to use virtualized devices and CPU features
vertio model splits device driver

guest operating system front end sends device commands to hypervisor
hypervisor back-end operates physical devices
communication is often done using shared memory rings


Bhyve virtualization


FreeBSD embeds bhyve to provide virtualized devices vesus Xen which is a standalone hypervisor that allows guest operating system to interact with each other
Bhyve exposes a virtual PCI bus to the guest

guest can probe and attach back-end devices


Xen virtualization

Xen Grant tables


grant tables are shared memory used to communicate between a guest operating system and Xen or another guest
grant table operations

permit_access
accpet_transfer
may limit export or import page to read-only


grant table states

reading
writing


Fast Filesystem

File filesystem history

Directory structure


all entries start on a 4-byte boundary
active entries are linked together

Extended attributes


It is a place to put metadata about the file
Header of each attribute has:

4-bytes length
1-byte namespace class
1-byte content pad lenght
1-byte name lenght
name padded so that the contents start an 8-byte boundary


File locking


Advisory file locking

no lock
shared access
exclusive access


locks are not tied to file access mode
locks may be ignored
locks may change after the file is open

File locking implementation

Week 8

Filesystems

Quota


each "uid" may have quotas associated
quotas may be specified by filesystems

Semantics of quota


quota may be imposed on both users and groups
quota policy

Users should stay below their "quota"


Current UNIX Filesystems


process descriptor

open file entry

vnode

inode

disk
inode
data


Inode


Contains all the metadata about the I/O
Symlink stored directly into the inode
direct blocks

single indirect (pointers to pointers to data)
triple indirect (pointer to pointers to pointers to data)


A small filesystem

UNIX Filesystem

I/O in UNIX


align operations to block boundaries
inodes map logical block numbers to disk blocks
write to system buffer and copy back to blocks

Contents of a cylinder group


Superblock - static parameter of the filesystem

block size
fragment size
disk characteristics
layout policy


Cylinder group map - dynamic parameters of the cylinder group

free blocks and fragments
free inodes
allocation summaries
recent allocation history


Inode - per file information

type/owner
access/mode times
array of pointers to data blocks


Data blocks

Optimizing storage utilization


Time/space tradeoff

Time - big blocks reduce the number of reads/writes to access a file
Space - big blocks waste space for small files


UNIX File are predominantly small and so use a hybrid block size

Filesystem parametrization


Maintain system characteristics

time to schedule an I/O operation
ability to chain I/O operations


Use this information to do block layout ASSUMING sequential access.
Laying out blocks in a totationally optimal fashion.

Layout policies


need to localize allocation of resource that are accessed together
need to spread out unrelated allocations
Inode layout policy

localize

try to place all files contained directory in the same cylinder group as the directory


spread out

Place new directories in a cylinder in a cyliner group with a greater than avarege number of free inodes but fewest number of directories. Cluster leaf directories.


Layout policies (part 2)

Data block layout policy

Localize

Place data blocks together in the same cylinder group as the inode resides


Spread

Redirect block allocation for a file to a new cylinder group every ten to twenty Megabytes


Policy aim

Keep some free blocks in every cylinder group
Localize all references when many small files


Dynamic block reallocation


Size of file is not known when the file is open

Implementation of dynamic block reallocation

Optimized Metadata Layout


first 4% percent of data area of each cylinder group is held for metadata

Implementation of allocation


Use requested block
Use closest forward block within cylinder group
Use quadratic rehash to pick a different cylinder group
Brute force search of cylinder group

To maintain reasonable optons, 5%-8% of the disk is kept free
Filesystem Consistency


Metadata must be consistent

directories
inodes
bitmaps


Rules

Never point to a structure before it is initialized
Never reuse a resource before nullifying all previous pointers to it
Never reset an old pointer to a live resource before the new pointer has ben set


Keep metadata consistent 1

Synchronous writes

simple and effective
create/delete intensive is slow


Non-volatile RAM

usually runa all operations at memory speed
expensive hardware unalavailable on many machines


Atomic updates (journaling and logging)

create/remove do not slow down under heavy load
extra I/O generated, little speed-up for light loads


Keep metadata consistent 2

Copy-on-write filesystems (LFS, ZFS, WAFL, etc)

write throughput, cheap snapshots, always consistent
disk fragmentation, memory overhead


Soft updates

most operations run at memory speed, reduce system I/0, instant recovery after a crash
complex code, background fsck, and increased memory loading


Adding journaling to soft updates


Only need to journal operations that orphan resources
Journal needs only 16Mbyte independent of filesystem size
Filesystem operations that require journaling

Increased link count
Decreased link count
Unlink while referenced
Change of directory offset
Cylinder group updates of reed blocks and inodes


Additional requirements of journaling


Additional soft updates tracking
Reclaiming journal space

Tracking file removal dependencies


Ordering constraints

log of the location in the directory that has the name to be deleted
the filename in the on-disk copy of the directory must be deleted
log the blocks to be deleted an deallocated the file zeroing out its on-disk dinode.
the blocks formerly referenced by the inode for the file must be released to the free-space bitmap
log the successful completion of removal


Journaled Crash recovery


Crash recovery is done by fsck
Recovery steps

Scan journal
Link count increases
Link count decreases
Free inodes with zero link count
Free inodes that were unlinked but busy
Free unallocated blocks


Snapshots


Create a copy-on-write image of a filesystem image

suspend processes initiating system calls that modify the filesystem
allow all modifications in progress to complete
write out all dirty buffers to disk
create an empty snapshot file the size of the filesystem partition
mark the block that are currently in use
resume write operations on the filesystem
on each disk write, check to see if it has been copied making copy if the write is for an in-use block that has not yet been copied


Snapshot implementation

Week 9

ZFS filesystem

ZFS overview


Never over-write an existing block
It is always consistent
State atomically advances at checkpoints
Snapshots(read-only) and clones(read-write) are cheap
Metadata redundancy and data checksums
Don't need fsck
Selective data compression and deduplication
Filesystem shared pool
Quota
Mirroring and multiple parity RAIDZ

ZFS Modules


ZFS is monolithic

Functional organization


GEOM used to provide access to disks
GEOM used also to present zvol as disks

Structural organization


Multiple layers

Meta-object-set
Object-set

Array of inodes or dnodes


Uberblock anchor the pool
Meta-object-set (MOS) describes array of filesystems, clones, snapshots, and ZVOLs
Each MOS object references an object-set that describes its objects
Space map is at the MOS layer, if Object-set needs more space just ask MOS

Uberblock


Uberblock describe the entire ZFS pool
Areas are used as a circular queue

Dnode


Equivalent of inode in FFS
Describe files and directories
But also snapshots, clones and etc.
Dnode is embedded with a znode when it describes a file
Blocks range in size from 4 to 128Kbyte

ZFS Block pointers


it is huge, much larger than FFS block pointer

ZFS block management


Disk blocks are kept in the pool

ZFS Structure


Huge picture page 532-535

ZFS Relationships


Clones can only be taken of a snapshot
Can make multiple clone of same snapshot
Clones can be promoted to the filesystem
Can also take snapshots and clone ZVOLs

ZFS Checkpoints


Checkpoint affects all filesystems, clones, snapshots and ZVOL

Writing new data


Checkpoints are taken when one of:

five seconds have passed
64Mb of dirty data accumulate
administrative task (like a snapshot)


To take a checkpoint

finish in-progress system calls
flush all dirty data
complete by writing new uberblock


Flushing dirty data

Logging


Handled by ZIL
Used to record changes between checkpoints
Not only metadata but content (log no journal)
Data over 32-kb are written to disk and pointer put in log

RAIDZ


Traditional RAID has fixed stripe size
RAIDZ has variable stripe size

RAIDZ Recovery


build by traversing all MOS objects and rebuild their blocks
never need to recalculate and write parity

Storage pool allocator


Manages the space in the pool
It is a bitmap
Allocation proceeds as:

Find disk with the most free space
select the least fragment fixed-size map
allocate space closest to previous allocation


SPA also handles

compression
deduplication


Snapshots


take a checkpoint
allocate new dnode in the MOS
copy the block pointer from the fs dnode to the snapshot dnode
link the snapshot dnode into the head of filesystem's snapshot list
move the filesystem deadlist to the snapshot

Freeing filesystem blocks


Blocks are tracked using space maps, birth time and deadlist
When a block is allocated, its birth time is set to the current transaction number
Over time snapshots are taken which also reference the block
When a file is overwritten, truncated, or deleted, its blocks are released
For each freed block, the kernel must determina if a snapshot still references it.

if born after most recent snapshot, it can be freed
otherwise it is added to the filesystem's deadlist


Deduplication


It is done in a pool wise basis
Block identify duplicated by hash (using the checksums)

Remote replication


backup of pool either locally or remotely
can do full backup

stored as a single file
expanded back to its original form


can do incremental backup since backup of earlier snapshot
tracks the progress of partial completed backup

ZFS strenths


high write throughput
fast RAIDZ reconstruction on pools with less than 40% usage
avoid RAID "write hole"
blocks move between filesystem as needed
integration eases administration (mount points, exports, etc)

ZFS weakness


Slowly written files scattered on the disk

That is why it uses a lot of memory


Slow reconstruction on pools with greater than 40% usage
Block cache must fit in the kernel's address space, thus works well only on 64-bit
Needs under 75% utilization for good performance
RAIDZ has high overhead for small block size such as 4kb blocks typically used by databases and ZVOLs
blocks cached in memory are not part of the unified memory cache so inefficient for files and executables using mmap() or when sendfile()

Week 10

NFS

Sun Microsystems Network File System


Widely available
Divided into two halves

fileserver exports locally attached disks to client hosts
client hosts that import filesystems from central server


NFS basically started open source
Sun promoted connectatons to make sure evey implementation were interoperable

NFS v3


stateless protocol
benefits

can continue to operate when some systems fail
fast recovery when failed components return


drawbacks

sync writes
high network traffic


NFS v4


stateful protocol

state is bounded to speed recovery


benefits

can continue limited operations when some systems fail
bounded recovery time when failed components return
minimal network traffic validating clients cache
less need for separate
sessions (added in nfs 4.1) ensure idempotency of requests


drawbacks

sync writes
server required to maintain and manage states


NFS File identification


file handles

created by server to identify files when first looked up
used by client in all NFS requests after open


file handle contents

filesystem identifier
inode number
generation number


NFS v3 daemons


rpcbind (all)
mountd (server)
nfsd (server)
nfsiod (client)
lockd (all)
statd (all)

NFS v3 mount operation

NFS v3 I/O

NFS v4 goals


good performance on the internet
strong security with negotiation built into the protocol
good cross-platform interoperability
designed for protocol extension
provide protocol support for clustered-server deployment

NFS v4 changes


compund RPC's
lock operations included in the protocol
explicit open and close operations
pseudo-filesystem hierarchy to support multiple roots
extensive attribute support

required attributes (16)
recommended attributes (43)
extensible attributes(k/v pair)


access control list
delegation and callbacks(leases)

NFS v4 daemons


rpcbind (all)
nfsd (server)
nfscbd (client)
nfsuserd (client)

NFS v4 mount operation

NFS v4 sessions

NFS v4 delegation/recall

NFS v4 crash recovery


after client crash

client attempt first mount from server
server finds previous client state (if not timed out yet) and releases it


after server crash

server sends BAD_SESSION replies to out of date client IDs
client realizes server has crashed and reestabilishes it state
server gives clients time (default 15 seconds) to rebuild state before handling out new state


Week 11

Networking and Interprocess Communication

Graph of communicating processes


Domain A has P1 and P2 processes
Domain B has P2, P3 and P4
Protocols used for Addressing Domains
Domain addressing is the problem to find a process within a domain

No centrelized authority


Name server


Domain based distributed name server
Gethostbyname() does RPC to name server daemon usually run on gateway
Bind was made by one of his students, got an A because it was deployable

The ISO networking model


Came up to create a standard way for interoperability
OSI layers came out to put in the communication protocol layer
internet in EU born from a guy buying modems and distributing to friends, before internet protocols (UUCP)

Packet traffic


mbuf designed to hold network data

Socket system call interface


socket, bind, listen, write, read, etc…

Types of network address


local domain
IPv4
IPv6

Use of IPC/Networking facilities


Outline of a simple "remote write"

System layering


sockets
protocols
net inerface and hardware

UDP


it is the absolute minimum you can get with
layer 4 transport, datagram protocol
demultiplex on port number
optional checksum

TCP


layer 4, connection oriented
provides everything that UDP does not
flow control (windowing)
port to port connection
acknowledgements
may piggybacked, coalesced
3-way handshake on open, 2-way handshake on close

SCTP


Reliable multiple-stream transport protocol
provides associations between up to 65k independent streams
good for multi-streams applications, no need for multiple handshakes

IP


layer 3
designed for packet-switched network
both IPv4 and IPv6

host-to-host addressing
routing
time to live


only IPv4

fragmentation and reassembly
type of service
options: source route, record route, timestamp


Mbufs


limited to 224k of data
can have data written in the begining and in the end of the buffer

Mbufs continued


use mbufs pool, allocated in a separate page map
pages are allocated when the mbuf is allocated and returned free of last reference
mbuf data segment was converted into other fields
mbufs data will sit outside of the mbuf for larger data sizes
external mbufs

Socket data queuing


stream socket

packets are linked to the next
no boundaries without application interaction


datagram socket

are record based
if you read a datagram you get the entire datagram


30 is the number of root servers because it is the amount of servers that fit in a UDP packet

Socket interface protocols


pru_*

Network interface addresses


ifnet
pf1_addr
pf2_addr

Network interface to link layer


driver, one per device type
link layer are linke normal drivers but easier

there is not much in link layer we really had to do


Net to protocol interface


input

common routines, copy or map data into mbufs
save interface id with packet


Week 12

The Network Layer

Internet Addressing - IP Version 4


IHL = Ip Header Length

number of 32-bit words after the header


ID line is mostly used be fragmentation
TTL
Header checksum is for the IP header only
3 layer routing, network -> subnet -> host

Internet Addressing - Subnets


22bit network

IP Multicast


Class D reserved for multicast
Interesting how multicast is created on gateways on the path of the publisher
Gateways should have to have multicast support if they are doing multicast
IGMP

add, leave group
programming interface: setsockopt/getsockopt
on Ethernet group address is added to interface filter


Multicast agent

listens to IGMP
fowards multicasts to remote members


Address Resolution Protocol


Problem

how to map network-protocol-level addresses to a link-level address


Solution

broadcast an "ARP" request containing the host being sought
all the machines on the wire get the request
machines that matches sends response back to broadcaster containing link-level address to use
netowk keeps cahe of "ARP" translations (in routing table)


Internet Addresseing - IP Version 6


128-bits was planned way ahead
Autoconfiguration

router discover
neighbor discovery


Security (IPSec)

IPv6 addressing


Version
traffic class and flow
next-header used to describe the type of routing among a lot of other things like authentication and encapsulation

IPv6 autoconfiguration


Router discovery

send router solicitation message to router "anycast" address
receive router advertisement messages containing the router IP address and router link-level address
mimicked by IPv4 by DHCP


Neighbor discovery

send neighbor solicitation
receive neighbor advertisement containing the neighbor IP address and neighbor link-level address
Mimicked in IPv4 by ARP


Internet Protocols - ICMP


part of IPv4, IPv6 has something similar
error messages (include header of problem + 8 bytes)

destination unreachable
service unavailable
time-to-live expired
header problem


control messages

redirect
source quench


low level network functions

echo (ping)
net address request/response
net mask resquest/response
timestamp request/response


ICMP Redirect processing


ICMP redirect cause direct modification of kernel route, changing gateway address
Redirects on FreeBSD accepted only from current router

IP Forwarding


First thing is to check if forwarding is enabled
check destination, no loopback, no broadcast
routing lookup

if from the same wire, ICMP redirect message will be sent


Routing goals

From the 1980s:

most machine should only need one network interface
network topology should not be visible to users
routing should have dynamic reconfiguration
multi-interface machines should potentially be able to act as gateways

Separation of policy and machanism

kernel implements mechanism
intelligent routing process determines policy
simple hosts may be simple
sophistication should be easily configurable
routing can be arbitrarily complex

Routing design


User level process: routed
routing daemons exchange information to determine best routes
routing daemons add, delete and replace routes in the kernel table to reflect best routing
kernel limited to looking up routes in its table

Kernel routing table


all routes are in a single hierarchical table organized as a redix tree
routes include variable-length destination, mask and host or gateway
routing table holds/cache round-trip time, hop count, packet size
link layer may allocate spaces for link-layer information/cache (ifaddr structure argumented with link-layer calls for route management)
"cloning" routes create host-specific routes dynamically; may use external resolution agents

Routing: radix search algorithm

Example radix tree:

circles are internal node
bit position tested is shown in circle
leaf nodes are rectagles containing a key ans mask
same interior nodes are associated with masks

– Picture here of radix tree routing algorithm
Kernel routing lookup


Callers set destination in route structure, then rtalloc() looks up an rtentry, returning a reference
rtallocl() does a lookup, optionally cloning the target route if necessary

Kernel routing socket


Routing ioctl commands replaced with message-oriented routing socket
Changes to the kernel routing table are requested by messages via routing sockets
Responses include message ID, errno, are sent to all listeners
Routing socket also reports other routing changes, redirects, lookup failures, interface up/down

Routing strategies


DHCP

assigns IP address
installs route to gateway


Routed (RIP)

dynamic local routing
best strategy for clusters of local networks


Gated

OSPF
EGP
BGP


Routing Daemon (routed)


dynamic local routing using variant of Xerox NS routing information protocol
Bellman-Ford (vector-distance) algorithm
Gateways broadcast destination and hop for known routes
Routes may "count to infinity" when destinations become unreachable; inifinity is chosen to be 16
Monitors network link for proper functions, drops routes through down link
On host with one interface no updates are sent

OSPF routing protocol


Open Shortest-Path-First routing protocol

shortest path first, or link-state routing algorithm
each router maintains topology database, monitor link state
two-level hierarchy: routing areas connected by backbone
internal routers have complete topology for their area
backbone routers have complete information about areas and backbones


IPSec Goals

Goals:

Authentication - knowledge of a key proves you are who you claim to be
Confidentiality - data is encrypted across the connection
Replay protection - packets are sequenced at the application level so the replay attempts can be detected

IPSec transport mode

Key:

AH - authentication header
ESP - encapsulating-security payload
SPI - security-parameter index

IPSec tunnel mode


Router to router connection
Encryption happens only on both connected routers
Host to router connection is possible, the host has to have the key

IPSec key management


Keys stored in kernel managed with key socket with messages:

getspi
update
add
delete
get
acquire
register
expire
flush
dump


Packet-processing frameworks


Packet processing frameworks used to:

implements firewalls
NAT
debug network problems
provide framework for network-research testbeds


Available filter frameworks:

BPF - identify and copy selected packets
IP firewall (IPFW) - modify or drop seleccted packets
Dummynet - traffic shaping
Packet Filter (PF) - drop selected packets
Netgraph - provides data-flow between network processing modules
Netmap - direct access to network packet rings


Berkeley Packet Filter


Generaly used via tcpdump which uses pcap to express filter
Scans all incoming packets and copies one that match selection criterion
Never drops or modifies packets but does not allow packets to be injected into input stream
pcap compiles high-level description to simple byte codes(which a JIT turns into native machine code)
Can avoid copying by using shared user-space/kernel buffers

IPFW


Single central dispatch receives all incoming packets and each packet can be:

passed through unchanged
copied to additional packet streams
diverted to an alternate packet stream
subject to NAT
reassembled for further inspection
send to dummynet and/or netgraph
dropped


Decision on what to do with packets described by rulesets loaded by the system admistrator
Dummynet

Packet-processing framework that provides:

traffic shaping
packet delay emulation
packet scheduling

Pipe emulates a communications link with a scheduler arbitrating multiple independent queues that features:

programmable bandwidth and delay
selectable queue priorities
variable number of queue
selectable queue priorities
scalable to many thousand pipes and queue


Dummynet scheduling policies

FIFO - constant running time with no service guarantees
weighted round robing - contant run time with limited service guarantees
quick fair queueing - constant run time with bounded service guarantees
weighted fair queueing - logarithmic run time (in number of queues) with optimal service guarantees


Packet filter (PF)


FreeBSD version is PFv1, PFv2 still to be ported
Similar functionality to IPFW
Developed under OpenBSD and ported to FreeBSD
Packets are normalized then run through a set of rules provided by the system administrator
Packets that pass the first set of rules run through another ruleset specific to their protocol type (TCP, UDP, ICMP, etc)
Packets that pass second set of rules are passed through for normal processing

Netgraph

Built around a data-flow model:

Nodes provide data processing
Edges provide packet flow between nodes

Currently 50 nodes available ranging from simple node that echos its input to a complex node that implements Bluetooth
Nodes have a seprate control channel to receive configuration (as well as input and output)
Nodes can be added as loadable kernel modules


Netgraph example

Ethernet bridge


Netmap


network interface packet rings directly mapped into the application address space
system calls to request packet transmission and be notified of packet reception
access via command to/from: /dev/netmap

NIOCGREGIF command to attach to interface
attaching stop flow of packets into kernel stack
software ring provided to selectively copy packets into stack and accept packets from the stack


Netmap receive ring


tail and head sync with kernel/hardware -> userland

Netmap transmit ring


similar to receive ring, sync with the kernel periodicaly to sync head and tail on the list

Week 13

The Transport Layer

Protocol Control Block


TCP example
No other connection will have the tuple (local addr -> local port / foreign addr -> foreign port)
Hash table for connections using the four tuple

Internet Protocol control block


Create control block
Bind local address and port
Set remote address (connect, sendto)

source is chosen according to outgoing network interface after routing lookup


Look up control block with matching address and port
Deallocate control block

TCP Sequence space


so_snd.sb_hwat - Socket buffer high water mark

max number of bytes accepted from applications


so_snd.sb_cc - Socket buffer character count

bytes already written from the application


snd_wnd - Send window

flow control uses the window, if it is full do not expand the window


snd_nxt - data sent but no ack received
All flow is controlled by sequence numbers
Window cannot shrink, you can move forward though
Sequence number negociated during handshake

TCP Packet header


source port
destination port

IP already has the local and external address


sequence number
acknowlodge number
data offset
Urgent pointer only used when URG flag set
TCP protocol limits options to 40 bytes
Windows scaling can be negociated

pick a scaling factor, windows will be multiplied by


Checksums

TCP timers


Five tcp timers

2ML (twice maximum segment lifetime) 60 seconds, used while waiting for close
KEEP (keep alive) 75 seconds, send keep alive or drop dormant connection
PERSIST (persist) 5 to 60 seconds, force a connection to persist
REXMT (retransmit), 30 milliseconds to 64 seconds, called when retransmission is necessary
DELACK (delayed acknowledgement), 100 milliseconds, send a delayed ack to the peer


TCP Output states


Idle

no data has been sent that has not yet been acked, send queue is empty


Transmitting

data have been sent and not yet acked
REXMT timer is running


Persisting

Window is zero or too small to send
PERSIST timer is running


TCP Round tripe timing

RTT timing

Counter (rtt) is started when frist data segment sent from idle state
Counter is sampled when timed sequence number is acked
Smoothed round-trip time (srtt) and variance (rttvar) are computed as moving avarege of rtt samples using fixed point arithmetic
Sample is discarded if timed data is retransmitted
If timestamps are in use, round-trip time of every packet can be measured

Connection establishment


Opening 3-way handshake

client sends SYN to server
server sends SYN+ACK to client
client sends ACK to server


Closing 2-way handshake

send FIN
received ACK
FIN_WAIT_2 until other side does 2-way


During opening handshake each side defines:

initial sequence number
max segment size (MSS)
windows size


Options both sides must offer to be usable

window scaling
timestamp request with units used
selective acknowledgement (SACK)


Socket connection queueing


Socket in accept mode
Complete and Incomplete list
sonewconn1() duplicates head socket
soqinsque() adds to a queue
soqremque() removes from queue
so_incomp links partially set up connections
so_head links connections read for acceptance
Need to cope with syn-flooding attacks

Handling SYN attacks

SYN Cache is hash table holding SYN state

lookup when SYN received

if not found create entry
if found, resend SYN/ACK


never ACK any data sent in initial SYN packet

SYN Cookies used when SYN Cache is full

maintain no state
encode local and foreign addresses and ports and MSS and place in initial sequence number in SYN/ACK
if timestamp supported then encode send receive window scaling and SACK support and place and timestamp field
must respond within 16 seconds with valid ACK (subtract 1 from initial sequence number and encode to see if they match)

Connection metrics

Metrics cached about a host

rmx_mtu - MTU for this path
rmx_ssthresh - outbound gateway buffer limit
rmx_rtt - estimated rtt
rmx_rttvar - rtt variance
rmx_bandwidth - estimated
rmx_sendpipe - outbound delay bandwidth product
rmx_recvpipe - inbound delay bandwidth product

Looked up at connection establishment
Updated at connection shutdown
Cached entries kept for one hour after last use
Silly window syndrome avoidance


Silly window syndrome - offering or accepting windows less than maximum segment size when max window is large
Receiver silly window avoidance: if offered windows would be less than 1/4 of receive buffer and less than max segment, offer zero window
Sender silly window avoidance: if window is too small for all data to be sent, and window is less than max segment, wait for window update or persist timeout.

Delayed acknoledgement


Delayed ack is used to piggyback ack on window updates and application-level responses
Delayed ack also allows multiple packets to be acked at once
Implementation:

Delayed-ack flag is set when new data arrives
After process takes data, if window update would move window by 2 max-size segments, window update and ack are sent.
If data is returned, ack is piggyback
DELACK runs every 100 ms forces out ackowledgements not yet sent


Small packet avoidance (Nagle algorithm)


Problem: how to coalesce small segments into larger one without delaying transmission too long
Other solutions: set a "short" timer when first character is received, transmit when timer expires. (Timer isn't long enough to help slow nets, is just enough to irritate users on local nets.)
Nagle algorithm: send immediately when a smal amount of data is presented; then, hold additional data until a full packet can be sent or until the first segment is acked
Certain clients send small segments that receive no responsee, expect rapid action; need option to defeat

Slow start


Van Jacobson *
Observations from Van: in steady state, ACKs from receiver should tell transmitter when to send new data
Problem: how to reach steady state.

Acknowledgement clocking

Congestion avoidance


Slow start doesn't solve congestion
Solution: maintain estimated window size aceptable for network (ssthresh),
Whenever packet is lost, ssthresh to half
Slow start opens windows exponentially up to ssthresh
Window is opened linearly after slow start, probing network capacity

TCP retransmission


REXMT timer set to initial value, when sending data
If timer expires:

snd_nxt is set to snd_una
sshthresh is set to half the current window
cwnd is set to 1 segment
tcp_output is called, transmits one segment


Timer is restarted, backed off at exponential rate
After 12 transmission packet for each additional packet for each additional duplicated ack


Fast restransmission

keep acking the same previous packet until the missing packet is retransmitted


Selective acknoledgement


Tells the sender packets it has received beyond currently ACK'ed data

Acked to 500
SACK indicated that 1000-2000 and 3000-2400 have been received
Sender will respond by sending 500-1000 and 2000-3000, then resume at 3500


Modular congestion control


Until FreeBSD 7 kernel had a single congestion control algorithm(New Reno from 4.4BSD)
Congestion control algorithms can now be added as a loadable kernel module
Currently five available:

New Reno (default)
Vagas
CUBIC
Hamilton Institute's delay based
H-TCP


Modular congestion control (II)


Each connection gets to choose its algorithm (using TCP_CONGESTION option to setsockopt())
Every time ACK received you get this information


Algorithms

New Reno

detection based on retransmission timer
exponential backoff


Vegas

On each ACK measures the difference between expected and actual rate
if difference below threshold, increase the congestion window
if above, decrease
if within threshold, make no change


CUBIC

Designed to quickly find network capacity on long fast link (1Gbit US to Japan has 110ms RTT and takes 10 min to reach capacity)


Basic TCP send policy

Stream Control Transmission Protocol


Reliable multiple-stream transport protocol
Provides associations between hosts with up to 65k independent streams
Each stream can be message- or stream-based
Associations setup with 4-way handshake to avoid SYN attack
No server state required until third handshake
Data can start flowing on third handshake
Streams setup within an association do not require handshake

SCTP chunks


look like tcp packets
with the addition of the multi-stream handling
transmission sequence number
stream identifier tells which stream
stream sequence number tells which packet within stream
though length 16-bits, messages may be made up of multiple chunks

SCTP multiple paths


SCTP allow association to have multiple address/route using stcp_bindx() and sctp_connectx()
One route selected for use
Backup routes send heatbeat packets to confirm viability
If selected route fail, switch to an alternate

Week 14

System startup and shutdown

Booting overview


Firmware initialize hardware and loads first-level bootstrap
First level bootstrap identifies disk label and loads second-level bootstrap (8kb to 64kb)
Second level bootstrap interprets filesystem to load full bootstrap
Final stage loads kernel and kernel modules and starts the kernel
Kernel runs autoconfig and starts remaining CPUs
Process 1 (/sbin/init)

Firmware bootstrap


Initialize low-level hardware
Optionialy run hardware diagnostics
Pre-boot services such as RAID configuration and remote management
Select boot CPU and put remaining CPUs in suspend mode
Build list of attached I/O infracture to export to kernel using ACPI or FDT
Identify possible boot devices and select one
Read in first-level boot strap from front of boot device and branch to it to start it running
Provide rudimentary I/O capabilities until kernel can operate devices by itself

First level bootstrap


Contains either a (historic) MBR or more modern GPT label
Use lebel to identify location of FreeBSD bootable partition
Load first 64K of FreeBSD partition and branch to it to start it running

Second level bootstrap


gptboot to load from UFS
gptzfsboot to load from ZFS

Actions taken by the second-level bootstrap

enable access to full memory
fetch boot.config and interpret any flags
load /boot/loader

Final sate bootstrap

Full featured interactive loader

choices of kernel to load
inspect and set kernel env vars
preload kernel modules
preload security keys
preload memory-disk images
filesystem access
forth script interpreter

Default at startup is to run /boot/loader.4th

load /boot/kernel/kernel and any modules in /boot/loader.conf
at end of countdown, boot loaded kernel

Kernel loadable modules

Modules can be loaded into the kernel

when it is compiled
when it is loaded
when it is running (if not disabled)

Module register an event handler

called at startup to initialize the module
called at shutdown to clean up the module

Has version number to allow field upgrades
Can register new system calls that it supports
Kernel boot

Stages of kernel startup

Assembly-language setup and enabling memory management unit
Platform-specific C-language startup of architecture specific interrupt vectors, stacks, and crafting first kernel process and thread
Setup basic kernel services: memory allocation, scheduling, locking
Run through kernel module initialization to initialize higher-level services (filesystem, network stack, start /sbin/init)

Process 1 will be forked from process 0 (usually swapd)
Kernel module initialization

Register with:

SYSINIT(name, subsystem, order, function, indentifier) to start
SYSUNINIT(name, subsystem, order, function, identifier) to stop

Ordering based on 2-level hierarchy:

subsystem is first level listed in /sys/sys/kernel.h
order is second level to differentiate modules in same subsystem

Module invoked as function(identifier)
Sorted at startup, then executed in order
Kernel modules (part I)

Defined in /sys/sys/kernel.h

SI_SUB_TUNABLES

set subsystem parameters from loader tunable


SI_SUB_VM

early virtual memory


SI_SUB_KMEM

kernel malloc() initialization


SI_SUB_HYPERVISOR

Xen HVM hypervisor detection


SI_SUB_WITNESS

configure witness lock-debugging facility


SI_SUB_MTX_POOL_DYNAMIC

initialize dynamically allocated mutex pools


SI_SUB_LOCK

initialize statically declared or allocated locks


SI_SUB_EVENTHANDLER

register statically declared event handlers


Kernel modules (part II)


SI_SUB_VNET_PRELINK
SI_SUB_KLD
SI_SUB_CPU
SI_SUB_RACCT
SI_SUB_RANDOM
SI_SUB_KDTRACE
SI_SUB_MAC
SI_SUB_MAC_POLICY
SI_SUB_MAC_LATE

Kernel threads


SI_SUB_INTRINSIC - hand craft proc 0


SI_SUB_AUDIT - create audit worker thread


SI_SUB_CREATE_INIT - fork process 1


SI_SUB_SCHED_IDLE - create idle thread


SI_SUB_INTR - create device-driver threads


SI_SUB_SOFTINTR - create callout thread, netisr thread


SI_SUB_DRIVERS - GEOM thread, random device


SI_SUB_CLOCKS - create and start system clock threads


SI_SUB_KTHREAD_PAGE - create vm_pageout daemon thread


SI_SUB_KTHREAD_VM - create vm_daemon thread


SI_SUB_KTHREAD_BUF - create buf_daemon


SI_SUB_KTHREAD_UPDATE - create vnlru_proc daemon


SI_SUB_KTHREAD_UPDATE - create sched_sync daemon


SI_SUB_KTHREAD_INIT - schedule init process to run


Mouting the root and /dev filesystems

When init thread starts running it kernel it must get root filesystem mounted

Mount devfs filesystem as root ("/") filesystem
Search through "rootdevnames" for a viable root filesystem
Mount selected root filesystem on /root
Swap filesystems: /root => / and / => /dev

Find and load /sbin/init
Exit kernel into init process
User-space startup

Load additional kernel modules
In single-user mode init spawns a shell
In multi-user mode, init spawns a shell to run commands in /etc/rc

Check filesystems as needed and then mount then
Run rcorder to select order that scripts in /etc/rc.d and /usr/local/etc/rc.d should run based on REQUIRE and PROVIDE declarations in each script
Run the script using /etc/rc.conf and /etc/default/rc.conf to configure subsystems to be run (such as sshd) and the options with which they should run

System shutdown

Shutdown happens in 3 steps

Shutdown services that need to store data in the filesystem
Shutdown the filesystem
Shutdown of the remaining services

Service register using EVENTHANDLER_REGISTER(name, function, argument, priority)

The name identifies which shutdown step

shutdown_pre_sync
shutdown_post_sync
shutdown_final


The priority works like the orher parameter of SYSINIT()
Event is invoked as function(argument)

Performance and monitoring system tunning (tools)


Tunning filesystem
-m minimum percentage of free space
-e maximum number of blocks (2048 = 32 Megabytes)
-a enable or disable access control list (disable)
-l enable or disable MAC
-j enable or disable journaling
-s specify avarage number of files in a directory (64)
-f specify avarage file size (16384)
-o optimization to minimize system time or to minimize space used on disk (time)


System monitoring with systat
Output of "systat -vmstat"
Load avg = should be less the number of CPU


systat -ifstat
Output of "systat -ifstat"


Kernel tracing

Ktrace and Truss
this will be about ktrace

Provide kernel tracing logging

system calls
pathname translation
signal processing
I/O


Tracing can be flexibly applied

selectively for running processes
for individual commands
for process groups
through inheritance to all current or future children


Kernel trace display


Trace files are generated in binary format
Translation to readable form is done by kdump
Useful translation are optionally added

system calls number to name
ioctl values
system errors with standard error string


output can display running time between and during system calls


trace example

Administration of kernel information


most information available through sysctl()
extensible MIB-style interface
allows selective retrieval of kernel data structures

process table
routing table
vnode table


allow selective setting of kernel variables

networking: IP forwarding, redirect, enable, time-to-live
debugging: filesystem cluster enabling
limits: maxproc, maxfiles, maxvnodes


MIB = Management Information Base
Getting process information


Process structures no longer continguous, other info scattered.
Solution: define kinfo_proc structure:

useful field from proc
xucred
useful fields from vmspace
session and pgrp info (login, pgid, tty)
wchan message


Interface provides selective retrieval of the process table

by proc id
by proc group id
by session
by effective user id
by real user id


Program interface


Goals

Want all programs that acess kernel table to work on running or dead kernel
Want to keep machine dependent information in a single place


Provide a "kernel virtual memory" (kvm) library

uses sysctl for running kernels
knows how to extract tables for dead kernel
maintains a cache of kernel namelist entries for running kernel to avoid repeated nlist() calls


Provided that programs do note use machine dependent fields in the table they extract, they are portable from one architecture to another once the kvm library has been ported

Kernel debugger

FreeBSD does kernel debugging remarkable well

Better than Solaris
Better than linux (mostly because of Linus attitude of "read the code" and figure it out

Low-level built-in debugger runs on console
High-level kernel debugging facility provides complete symbolic debugging using kgdb(1) or lldp(1)

symbolically examine and modify memory locations
set breakpoint in the kernel
single step the kernel

Usage - enable the debugger when booting, then stop:

at first instruction
when debugger is attached
on kernel trap/panic

Kernel crashes

Enable crash dumps

location specified using dumpon(8)
usually configured for primary swap area
default is "mini-dump" which saves only kernel memory

Recovering a crash dump

after reboot run savecore(8) to extract
can then run kgdb(1) or lldp(1) to perusee the crash
general hints given in crash(8)
wealth of information collected using crashinfo(8) and textdump(4)

Netstat memory statistics

netstat -m -N kernel.80 -M vmcore.80

System malloc memory statistic

vmstat -m -N kernel.80 -M vmcore.80

System malloc memory statistics (part II)

REST OF vmstat -m
System zone memory statistics

vmstat -z -N kernel.80 -M vmcore.80

Example ps output


ps -axlw

Exaple fstat output


fstat | more

Example of netstat output


netstat -N kernel.80 -M vmcore.80

DTrace

DTrace provides detailed information throughout the application stack

system calls
libraries
application code

many handy one-line DTrace scripts:
https://wiki.freebsd.org/DTrace/One-Liners
Performance monitoring counters


pmc

details of all activity on a CPU
activity of a process


Things that can be measured:

interrupts taken
instructions executed
branches executed and mispredicted
number of non-idle processor cycles
instructions cache misses and prefetches
data cache misses, prefetches and writebacks


Using PMC

pmc(3) describes types of counters
pmccontrol -L - lists names of available counters for current machine
pcmstat -T -S event-spec - gives a top-like display of the listed counters

Week 15

System measurements and performance tune

Background


Understand the operating system
While reading about it do not turn a page without understanding it completely

Observability tools


Good observability tools will help you find unecessary work
It helps you to eliminate this unecessary work and tune parameters

Recap tools:

systat
pidstat
vmstat
iostat

Tracing tools


truss
tcpdump

Methodologies


Anti-pattern
Workload Identification

who
what
why
how


USE Method

Utilization
Saturation
Errors


Off-CPU

Benchmark


100% of benchmarks are wrong
Use only for experimental analysis
Types:

micro benchmarks
macro benchmarks
bad benchmarks


Profiling


Great to understand what is a benchmark is doing

pmcstat
dtrace


pmcstats

Cycles per instructions

unhalted count
instructions count


dtrace

Tracing


truss
ktrace
DTrace

Tuning


sysctl -aW
/etc/sysctl.conf
Treat tunable parameters as someone else's medicine cabinet