Skip to content

Instantly share code, notes, and snippets.

@eklitzke
Created February 17, 2011 01:04
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save eklitzke/830703 to your computer and use it in GitHub Desktop.
Save eklitzke/830703 to your computer and use it in GitHub Desktop.
percona live notes

Percona Server with XtraDB

Introduction

  • 2010 InnoDB 5.1 announced, will first ship with MySQL 5.5.
  • Historically InnoDB development lags while MySQL is trying to GA
  • lots of things fixed in InnoDB since MySQL 5.0
  • important note -- plugin version of InnoDB is not the default in 5.1
  • percona server uses innodb plugin by default
  • xtradb is a fork of the innodb plugin
  • the on-disk format of xtradb does not change by default, you can switch back and forth; some extra features require changing the on-disk format, however
  • percona xtradb releases closely tied to upstream mysql releases
  • release model is short release cycle, rebased against new mysql releases
  • historically there is an average of a new release every 1-2 months

How InnoDB Works

  • in general, innodb has less opportunities for reordering writes than an operating system, since it must be more reliable
  • innodb uses fsync, myisam does not
  • for a basic select, first innodb queries the buffer pool, then the tablespace
  • for a basic update, first buffer pool is updated, then the log file is updated, and the update to the tablespace is delayed
  • optimization here for updates is based around the theory that sequential updates to log files is cheap
  • logs are only used during recovery, to figure out which pages need to be evicted there is a flush list and LRU, both of which are in memory
  • background process for syncing dirty pages is normally referred to as a "checkpoint"
  • innodb buffer size can be increased to cause innodb to flush to disk less frequently (and store more data in the logs)
  • on a commit, only log files need to be synced, nothing else

Percona Server with XtraDB

  • Fast Index Creation (inherited from the plugin)
  • adding an index just adds the index, doesn't rebuild the table
  • same with dropping index
  • does not work for utf-8 (bug #???) or primary key
  • I/O scalability (via plugin)
  • --innodb_io_capacity, tell mysql how many io operations per second your disk is capable of; 100 iops is typical for a 7200rpm disk
  • --innodb_read_io_threads and --innodb_write_io_threads -- tell innodb how many threads to use for reading and writing
  • CPU scalability (via plugin)
  • mostly consists of 3-4 patches that transparently make innodb work better with multi-core CPUs
  • lots of x-locks converted to read/write-locks
  • lots of locks broken into more fine-grained locks
  • Adapting Flushing (invented by percona server)
  • teaches mysql to do background work more aggressively when log space is about to run out, so that TPM (transactions per minute) doesn't suddenly drop right when a log if rotated
  • see also: adaptive checkpointing, which is similar but different
  • Fast Crash Recovery (invented by percona)
  • see mysql bug #29847, related to innodb recovery being slow (e.g. a larger buffer pool can increase innodb recovery time)
  • in many cases 10x faster than builtin
  • Improved Buffer Pool Scalability
  • buffer pool mutex becomes four mutexes: flush list, lru, free, and hash mutexes
  • Data Dictionary control
  • once an innodb table is open it's never freed from the in-memory data dictionary
  • adds --innodb_dict_size_limit
  • Undo Slots
  • by default only 1024 transactions can be open at once, some statements require 2 undo slots
  • in xtradb it's expanded to up to 4072 with --innodb_extra_undoslots=1
  • Rollback Segments
  • makes it possible to have more htan one rollback segment, and also reduces mutex convention on the rollback segment
  • this is also fixed upstream in mysql 5.5
  • Fast Checksums
  • binary format incompatible
  • Different Page Sizes
  • instead of 16k, you can have 4k or 8k page sizes (probably useful only for older systems)
  • Separate Purge Thread
  • cleans up long history list length faster
  • speaker thinks that the best features are usability improvements, not performance enhancements
  • ability to show buffer pool contents
  • ability to save contents of the buffer pool (actually saves offsets in the innodb files instead of data contents, so 8 bytes not 16k bytes for each thing); can help with restarting mysql faster; stores to file called 'ib_lru_dump'; it's possible to dump the lru from one server and load it on another server, but be super careful
  • transactional replication (also available in mysql 5.5); slaves store position in relay-log.info, which isn't fsynced; somehow makes slave crashes more reliable
  • import/export tables, makes it easy to just restore one table at a time from the ibd files; import/export still has to be done with xtrabackup, however
  • better handling of corrupt tables, a corrupt table (i.e. totally bogus data) won't crash mysql, instead the table is just marked as corrupt and the server continues starting up
  • the slow query log for percona has hella more stuff if you increase verbosity; mk-query-digest also understands the new percona columns
  • user statistics, new tables that show up in the information_schema database. for instance, allows you to find indexes that are unused
  • xtrabackup can report on fragmentation of indexes (runs very slowly, you should do offline); if an index is super fragmented, you can drop and re-add the index

Improvements To Performance in MySQL 5.5

  • main areas of improvement: sl level, innodb, windows, performance schema

SQL Level

  • LOCK_open fixed, used in a lot of places where there are global locks
  • MySQL 5.5 introduces MDL (Meta Data Locking) subsystem; optimizes for both number of operations, and reduces contention
  • smaller fixes
  • LOCK_alarm mutex, replaced by posix-specific apis where available
  • LOCK_thread_count mutex reduced usage to connect/disconnect from 1-2 times per query
  • THR_LOCK_charset mutex, hotspot for setting up character sets, completely removed

InnoDB Optimizations

  • improved log sys mutex; had been used for both log records and lsn, and also something about mtr
  • this has been changed to 2 separate mutexes, but contention can still be observed for some heavy log write workloads
  • separate flush_list mutex, so that the flush list mutex is different now from the buffer pool mutex; the buffer pool mutex is very hot
  • ability to create multiple buffer pools, using innodb_buffer_pool_instances; however, no ability for users to specify how to map table/index constructs to buffer pools
  • some internal operations need to query all buffer pools, so for certain things internal overhead can be increased
  • multiple rollback segments
  • now there are 128 segmetns, giving up to 128k transactions
  • innodb_fast_shutdown=0 needed for this change
  • asynchronous i/o on linux
  • for many years only windows was using native async i/o
  • on linux it was simulated using helper threads
  • can make crash recovery up to 1.5x faster
  • assumes usage of O_DIRECT (which is good to use anyway)
  • change buffering
  • insert buffer has existed for a long time, speeds up inserts for i/o bound workloads
  • however, updates and deletes were unable to use this feature
  • mysql 5.5 supports "delete mark" and "purge" change buffering
  • it handles update case as well as delete because secondary indexes are never updated in place
  • up to 40x performance for large deletes
  • separate purge thread
  • before mysql 5.5 purging of old records is done by the main thread
  • use innodb_purge_threads=1 to use separate purge thread
  • adaptive flushing
  • not a new feature in mysql 5.5, but was updated to deal with multiple bufer pools, and got lots of small tunings
  • mysql 5.5 should be a lot "smoother" on many workloads
  • will probably continue to improve as mysql 5.5 matures, as the mysql team is still looking at it
  • performance schema
  • includes many innodb probes
  • can be used by developers to better understand sources of performance problems
  • is mysql 5.5 faster? peter says no it's not across the board, but there are no big specific regressions for now

  • remaining issues:

  • per index lock; use partitioning as partial workaround
  • kernel mutex; seems to be a big focus for innodb in mysql 5.6
  • dropping tablespace with innodb file per table
  • log mutex
  • group commit
  • peter notes that probably mysql 5.5 is generally better than percona server 5.1

InnoDB Internals

  • innodb plugin is default in 5.5

  • improved cpu scalability, fast index creation, buffer pool tablescan resistance, fast crash recovery, multiple buffer pools

  • 5.5 plugin is innodb 1.1.x, 5.1 plugin is 1.0.x

  • communicates via the row-based relationship

  • foreign keys handled by engine backend

  • features: crash save, mvcc, full index reading, change buffering

  • new plugin supports compression and full hot backup support

  • commands:

  • SHOW ENGINE INNODB STATUS
  • SHOW ENGINE INNODB MUTEX

Show Engine Innodb Status

  • make sure that the averages calculated time is at least 20-30 seconds, you have no control over what value you get
  • rollback is determined based on which transaction has the least number of "rollback" operations to do
  • whenever you see two numbers together, they could be really representing one 64-bit integer (all ints are 32 bit in output)
  • note to self: read more about innodb concurrency tickets
  • http://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
  • make sure you have noatime enabled, so the redo logs aren't constantly having their atime update
  • data in tablespace is organized at page level, not row level
  • communication always happens with the buffer pool before it hits the tablespace
  • on a wite, buffer pool is marked dirty, the log file is updated, and the log file is fsynced
  • so: pages in the tablespace are written in a delayed manner
  • logs are only read from when doing a recovery
  • contrast to oracle which uses log files for reading and doing fancy things like replication
  • log contains two kinds of data: the flush list, and the LRU
  • all log activities are assign a LSN (log sequence number), which is monotonically increasing
  • LSN value is stored in tablespace after each checkpoint
  • oracle calls their logs "redo logs", but it meanst the same thing
  • mysql does fuzzy checkpointing, because after a checkpoint there may still be dirty pages in the buffer file
  • outside of the buffer pool, changing innodb_log_file_size is one of the most important settings, to smooth out background I/O for longer
  • by default up to 90% of the data in the buffer pool can be dirty
  • if log files are larger than the buffer pool, you're going to be fucked when doing a recovery
  • you can also change innodb_flush_log_at_trx_commit to 0 or 2 to reduce durability (good for slaves!)
  • innodb_log_buffer_size is useful for writing big TEXT/BLOB changes
  • look for innodb log weights going up over time
  • tuning log buffer time is a tradeoff between performance and recovery time; tuning this must be contingent on your requirements for recovery time
  • these numbers are show in show innodb status:
  • log sequence number
  • log flushed up to
  • last checkpoint at
  • you can track the amount of log you're generating (e.g. per hour) by tracking the delta of the LSN
  • status variables in percona server also show this
  • rollback segments are used to get consistent read view, and also used to rollback uncommitted changes during recovery
  • innodb_read_io_threads changes how many threads the buffer pool can use to read data from tablespace
  • innodb_max_dirty_page_pct variable to change how much of the buffer pool can be marked dirty
  • innodb_buffer_pool_wait_free in "show global status" means innodb is spending time waiting for free pages in the buffer pool
  • when you write to a file you are writing to a buffer; the buffer exists in your pocess; flushing a buffer moes it to the os cache; anything flushed will survive the process crashing; data synced will survive power outage
  • by default, innodb_flush_log_at_trx_commit=1. this is really expensive, which is why it's important to have a battery backed write cache or ssd. This is the only mode in which innodb is fully ACID compliant
  • if we have innodb_flush_log_at_trx_commit=2 then on every commit (or once each second), innodb will flush to the innodb log buffer; every second innodb will sync the log buffer to disk.
  • if you run on the master you could have slaves receiving rows that were actually rolled back on the master; manifests itself as a "duplicate key error" on all slaves
  • innodb_flush_log_at_trx_commit=0 is bad times
  • some synchronization thing called XA: http://dev.mysql.com/doc/refman/5.1/en/xa.html
  • everything is stored in innodb table space
  • by default, tablespace is one contiguous logical file called ibdata1; you can instruct it to be multiple physical files, but only the last one can be autoextend
  • every page is filled 15/16 for innodb, no way to configure it (called the "fill factor" in oracle)
  • ext3 bad for mysql, xfs better, due to fsync behavior
  • using raw partitions is "here be dragons" and makes backups really fucking hard
  • you can configure --inodb-file-per-table to have one physical file per table, but some things are still stored in the main in the main thing
  • per table is usually better since you can reclaim space whenever you TRUNCATE/DROP a table, and therefore reduce back size
  • some of the new options like compression and off-page storage of logs from mysql 5.5 only work with file per table
  • cons:
  • uses more file descriptors
  • need to tune something
  • may frbe more gragmentation
  • complicated internal synchronization when starting up, so starting up mysql can take a long time (but would probably on be an issue if you had like 50k or 100k tables)
  • but pro: works better with ext3 and O_DIRECT
  • as well as table data, table space contains:
  • tablespace header
  • innodb-specific data dictionary
  • undo information
  • the doublewrite buffer
  • the insert buffer
  • you can turn on/off file-per-table, but existing tables will remain how they were (in ibdata1 or in separate file) until you do alter table or optimize table
  • most internal allocations are done using extents
  • innodb_autoextend_increment affects how big to expand the tablespace at a time
  • everything fits into pages, which are 16 Kib
  • a data page has a header, rows, and a trailer
  • rows are normally in variable width format, there is a fixed width format but it is very rarely used
  • header sys what kind (data or index), has a checksum, the LSN, the last value of the LSN as flushed to disk
  • some things in the trailer: part of the lsn, checksum
  • checksum is always checked when reading page to innodb, if checksum is incorrect then innodb will crash (unless using percona server)
  • checksums are updated when writing pages back to disk
  • xtradb has fast checksum option for super fast disks (e.g. fusion i/o), but changes xtradb compatibility
  • data row format is like:
  • transaction id (for when row was created)
  • roll pointers
  • field pointers
  • fields
  • most important things: transactoin id, storage of pointer to older version of row in the UNDO space
  • exact format is different -- in 5.0 there is a COMPACT row format where null values are stored more efficiently
  • all field are stored inline except Blobs, which are a special case
  • rows cannot be split across pages
  • innodb sets a restriction that two rows must fit in one page
  • so a single row cannot be more than 8 KiB
  • this can cause VARCHAR to move to off-page locations
  • 20 byte pointer to overflow page for VARCHAR/TEXT/BLOB for case where blob does not fit on the same page as the row
  • for lots of internal buffers (including buffer pool), space allocated for varchar or whatever is based on the maximum possible size of that field
  • 800 Kb row can take up to 1.5M on disk for various technical reasons
  • if the whole row firts in ~8000 bytes then the blob is always stored on the page, otherwise just first 768 bytes (antelope) or 20 byte pointer (barracuda)
  • what does log file look like? records are 512 byte aligned, does not store a complete change, just the parts that changed, i.e. Space ID + Page ID + Offset + Payload
  • undo information is stored in the rollback segment
  • xtradb supports multiple, innodb just supports one
  • the rollback segment is always in the global tablesapce
  • if you get error message about too many open transactions then you ran out of undo slots, you should increase number of rollback segments
  • besides buffer pool, in memory:
  • meta data for accessed tables (never released, happens whenever any table is copied in), a.k.a. data dictionary
  • lock information
  • adaptive hash index
  • sql statements
  • thread stacks
  • innodb_use_sys_malloc (default 1) disavles the additional memory pool
  • regardless of built-in or plugin, leaving innodb_additional_mem_pool_size at the default setting is usually fine
  • best thing evar: --innodb-buffer-pool-size
  • you may be surprised at how much space insert buffer and undo buffers take up!
  • warning: leave space for metadata, alwasy consumes an additional 5-10% from buffer pool
  • some things like binary logs and relay logs depend on OS caching, so don't use all memory
  • you can change innodb_max_pct_dirty_pages
  • with innodb plugin you can set io capacity to also affect how quickly dirty pages are flushed
  • show innodb status shows how many "modified db pages"
  • for blobs maybe tweak innodeb_log_buffer_size to be as big as the largest blob you ever expect to write
  • if innodb_log_waits in GLOBAL STATUS is high then the log buffer isn't big enough
  • locks are held for the duration of a transaction
  • for updates/deletes, more rows than you may expect may end up being locked
  • "index level locking" more apt name than "row level locking"
  • mysql creates next-key locks to prevent inserts from happening on slaves when statements are committed out of order
  • two common lock types: shared and exclusive
  • an S lock never ocfurs for reads because of MVCC; S locks are normally only visible with foreign key constraints
  • in inndob everything is an index
  • data is stored in a clustered index organized by the primary key; in the absence of a primary key, the first unique not null key is selected
  • other indexes are stored as secondary indexes
  • clustered index means that data is stored alongside index, in myisam the index just stores a pointer to the actual data
  • in innodb primary key lookup is faster than myisam, but secondary index is slower: secondary index points to primary key, so need to do two index lookups
  • all access to the data is always done by primary key
  • once mysql uses a RANGE, it cannot use anything else in the index (e.g. example of (date, customer_id) where usually date is more selective, but if you ever want to do a query by customer_id and a date range then you're screwed)
  • just says don't use UUIDs for primary key on innodb
  • never piggy-back the primary key index into the end of a composite index or covering index - it is already included for free
  • mk-duplicate-key-checker will find this
  • innodb has an adaptive hash
  • if you have a workload that has "hotspots" then innodb can internally build a hash index for values that you access frequently. this hash index exists only in memory, and there's no control over it from the end-user standpoint.
  • the hash index does not cover the whole index (out of time)
  • for insert buffer, when you make multiple updates to a single page, they can be buffered into the insert buffer, and then data on disk is merged with the in-memroy insert buffers
  • mysql 5.5 also has this for deletes
  • may want to disable for ssd
  • can measure efficiency for innodb 5.1 and later with show innodb status
  • double write buffer helps to protect against partially written/corrupt pages
  • always 2MB, two sets of 64 pages
  • you can turn the double write buffer off if you have a transactional filesystem like ZFS
  • adaptive flushing is in mysql 5.5
  • by default 200 iops (assumes raid 10 and 4 disks), change with innodb_io_capacity

Building a MySQL HA Solution on EC2

  • this is a more generic solution, not MySQL solution; and there are other ways of doing this
  • speaker (yves trudeau) is a former mysql and sun consultant, has a phd, english is not his first language
  • the plan
  • the challenges
  • the solution
  • building the solution
  • other solutions
  • challenges:
  • cost! people don't want to have a big standby instance since it costs money; different from traditional HA solutions where people have two big servers, one standby and one active;
  • no static ips on the private network, and you have no control over which ip it receives
  • no vip / no {broad,multi}cast
  • the solution

  • there's a small monitoring instance, typically the smallest instance type available

  • there's also a larger instance

  • farm of web or application servers connected to the private ip of the large instance

  • the large instance stores the mysql data on an ebs volume that is used as permanent storage

  • if there is a failure, the small instance will detect the failure and will initiate the failover procedure

  • small instance will create a new large instance, the large instance will mount the EBS volume, and recovery will initiate

  • web/app servers still don't know where to connect

  • small instance will connect to each web/app server and update their configuration to point to the new large instance

  • technology pieces

  • cluster manager

  • unicast communication
  • lack of broadcast/multicast
  • setup cluster partner ip for new host
  • reconvfiguring service customers
  • Pacemaker

  • allow port 694 UDP within security group

  • heartbeart or corosync for communication, unicast based

  • scripts type resource on the monitoring host (type anything), on remote end is a no-op script

  • localiztion weight favoring the large instance

  • contents of /etc/ha.cf (left side of screen is cut off) autojoin any ucast eth0 warntime 5 deadtime 15 initdead 60 keepalive 2 crm respawn

  • be careful to set values high, since amazon's network can be laggy which can cause spurious heartbeat failures

  • custom ec2 images that are pre-configured with pacemaker/heartbeat and mysql

  • aws tools; parsing can be difficult --> helper scripts

  • failover steps:

  • detect failure
  • kill the faulty instance
  • free the shared resource
  • launch the new instance
  • reassign shared resource
  • reconfigure heartbeat
  • spread connection info
  • RDS, pros/cons, only mysql
  • continuent
  • rightscale (eip based)
  • scalarium
  • scalebase (beta)
  • others?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment