morgo/percona.live.notes.md

## percona.live.notes.md

      
    Raw
  

              percona.live.notes.md
            
          
    Percona Server with XtraDB

Introduction


2010(2009?) InnoDB 1.1 announced, will first ship with MySQL 5.5.
Historically InnoDB development lags while MySQL is trying to GA
lots of things fixed in InnoDB since MySQL 5.0
important note -- plugin version of InnoDB is not the default in 5.1
percona server uses innodb plugin by default
xtradb is a fork of the innodb plugin
the on-disk format of xtradb does not change by default, you can switch back
and forth; some extra features require changing the on-disk format, however
percona xtradb releases closely tied to upstream mysql releases
release model is short release cycle, rebased against new mysql releases
historically there is an average of a new release every 1-2 months

How InnoDB Works


in general, operating system has less opportunity to merge/reorder innodb writes, since InnoDB keeps durability.  InnoDB itself must change it's algorithms to produce more sequential IO.
innodb uses fsync, myisam does not
for a basic select, first innodb queries the buffer pool, then the tablespace
for a basic update, first buffer pool is updated, then the log file is
writen to, and the update to the tablespace is delayed as part of a background thread.
optimization here for updates is based around the theory that sequential
updates to log files is cheap
logs are only used during recovery, to figure out which pages need to be
evicted there is a flush list and LRU, both of which are in memory
background process for syncing dirty pages is normally referred to as a
"checkpoint"
innodb buffer size can be increased to cause innodb to flush to disk less
frequently (and store more data in the logs).  Log file size must also be larger enough that it can hold all the pending changes.
on a commit, only log files need to be synced, nothing else

Percona Server with XtraDB


Fast Index Creation (inherited from the plugin)


adding an index just adds the index, doesn't rebuild the table
same with dropping index
does not work for utf-8 (bug #33650) or primary key


I/O scalability (via plugin)


--innodb_io_capacity, tell mysql how many io operations per second your
disk is capable of; 100 iops is typical for a 7200rpm disk
--innodb_read_io_threads and --innodb_write_io_threads -- tell innodb how
many threads to use for reading and writing


CPU scalability (via plugin)


mostly consists of 3-4 patches that transparently make innodb work better with multi-core CPUs
lots of x-locks converted to read/write-locks
lots of locks broken into more fine-grained locks


Adapting Flushing (invented by percona server - also in InnoDB plugin)


teaches mysql to do background work more aggressively when log space is about
to run out, so that TPM (transactions per minute) doesn't suddenly drop right
when a log runs out of free space.
see also: adaptive checkpointing, which is similar but different


Fast Crash Recovery (invented by percona; current implimentation independently written by InnoDB team)


see mysql bug #29847, related to innodb recovery being slow (e.g. a larger
buffer pool can increase innodb recovery time)
in many cases 10x faster than builtin


Improved Buffer Pool Scalability


buffer pool mutex becomes four mutexes: flush list, lru, free, and hash mutexes


Data Dictionary control


once an innodb table is open it's never freed from the in-memory data dictionary
adds --innodb_dict_size_limit


Undo Slots


by default only 1024 transactions can be open at once, some statements require 2 undo slots


in xtradb it's expanded to up to 4072 with --innodb_extra_undoslots=1
Rollback Segments


makes it possible to have more htan one rollback segment, and also reduces mutex convention on the rollback segment
this is also fixed upstream in mysql 5.5


Fast Checksums


binary format incompatible


Different Page Sizes


instead of 16k, you can have 4k or 8k page sizes (probably useful only for older systems)


Separate Purge Thread


cleans up long history list length faster


speaker thinks that the best features are usability improvements, not
performance enhancements


ability to show buffer pool contents


ability to save contents of the buffer pool (actually saves offsets in the
innodb files instead of data contents, so 8 bytes not 16k bytes for each
thing); can help with restarting mysql faster; stores to file called
'ib_lru_dump'; it's possible to dump the lru from one server and load it on
another server, but be super careful


transactional replication (also available in mysql 5.5). [by default] slaves store position
in relay-log.info, which isn't fsynced.  Transactional replication stores these coordinates in an InnoDB table to make slave crashe proof.


import/export tables, makes it easy to just restore one table at a time from
the ibd files; import/export still has to be done with xtrabackup, however.  Destination must be percona server, source can be MySQL/Percona.


better handling of corrupt tables, a corrupt table (i.e. totally bogus data)
won't crash mysql, instead the table is just marked as corrupt and the server
continues starting up


the slow query log for percona has hella more stuff if you increase verbosity;
mk-query-digest also understands the new percona columns


user statistics, new tables that show up in the information_schema
database. for instance, allows you to find indexes that are unused


xtrabackup can report on fragmentation of indexes (runs very slowly, you
should do offline); if an index is super fragmented, you can drop and re-add
the index


Improvements To Performance in MySQL 5.5


main areas of improvement: sl level, innodb, windows, performance schema

SQL Level


LOCK_open fixed, used in a lot of places where there are global locks
MySQL 5.5 introduces MDL (Meta Data Locking) subsystem; optimizes for both
number of operations, and reduces contention
smaller fixes


LOCK_alarm mutex, replaced by posix-specific apis where available
LOCK_thread_count mutex reduced usage to connect/disconnect from 1-2 times
per query
THR_LOCK_charset mutex, hotspot for setting up character sets, completely
removed

InnoDB Optimizations


improved log sys mutex; had been used for both log records and lsn, and also
something about mtr
this has been changed to 2 separate mutexes, but contention can still be
observed for some heavy log write workloads
separate flush_list mutex, so that the flush list mutex is different now from
the buffer pool mutex; the buffer pool mutex is very hot
ability to create multiple buffer pools, using innodb_buffer_pool_instances;
however, no ability for users to specify how to map table/index constructs to
buffer pools
some internal operations need to query all buffer pools, so for certain things
internal overhead can be increased
multiple rollback segments


now there are 128 segmetns, giving up to 128k transactions
innodb_fast_shutdown=0 needed for this change


asynchronous i/o on linux


for many years only windows was using native async i/o
on linux it was simulated using helper threads
can make crash recovery up to 1.5x faster
assumes usage of O_DIRECT (which is good to use anyway)


change buffering


insert buffer has existed for a long time, speeds up inserts for i/o bound
workloads
however, updates and deletes were unable to use this feature
mysql 5.5 supports "delete mark" and "purge" change buffering
it handles update case as well as delete because secondary indexes are never
updated in place
up to 40x performance for large deletes


separate purge thread


before mysql 5.5 purging of old records is done by the main thread
use innodb_purge_threads=1 to use separate purge thread


adaptive flushing


not a new feature in mysql 5.5, but was updated to deal with multiple bufer
pools, and got lots of small tunings
mysql 5.5 should be a lot "smoother" on many workloads
will probably continue to improve as mysql 5.5 matures, as the mysql team is
still looking at it


performance schema


includes many innodb probes
can be used by developers to better understand sources of performance
problems


is mysql 5.5 faster? peter says no it's not across the board, but there are no
big specific regressions for now


remaining issues:


per index lock; use partitioning as partial workaround
kernel mutex; seems to be a big focus for innodb in mysql 5.6
dropping tablespace with innodb file per table
log mutex
group commit


peter notes that probably mysql 5.5 is generally better than percona server
5.1

InnoDB Internals


innodb plugin is default in 5.5


improved cpu scalability, fast index creation, buffer pool tablescan
resistance, fast crash recovery, multiple buffer pools


5.5 plugin is innodb 1.1.x, 5.1 plugin is 1.0.x


communicates via the row-based relationship


foreign keys handled by engine backend


features: crash save, mvcc, full index reading, change buffering


new plugin supports compression and full hot backup support


commands:


SHOW ENGINE INNODB STATUS
SHOW ENGINE INNODB MUTEX

Show Engine Innodb Status


make sure that the averages calculated time is at least 20-30 seconds, you
have no control over what value you get
rollback is determined based on which transaction has the least number of
"rollback" operations to do
whenever you see two numbers together, they could be really representing one
64-bit integer (all ints are 32 bit in output)
note to self: read more about innodb concurrency tickets
http://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
make sure you have noatime enabled, so the redo logs aren't constantly having
their atime update
data in tablespace is organized at page level, not row level
communication always happens with the buffer pool before it hits the
tablespace
on a wite, buffer pool is marked dirty, the log file is updated, and the log
file is fsynced
so: pages in the tablespace are written in a delayed manner
logs are only read from when doing a recovery


contrast to oracle which uses log files for reading and doing fancy things
like replication


log contains two kinds of data: the flush list, and the LRU
all log activities are assign a LSN (log sequence number), which is
monotonically increasing
LSN value is stored in tablespace after each checkpoint
oracle calls their logs "redo logs", but it meanst the same thing
mysql does fuzzy checkpointing, because after a checkpoint there may still
be dirty pages in the buffer file
outside of the buffer pool, changing innodb_log_file_size is one of the most
important settings, to smooth out background I/O for longer
by default up to 90% of the data in the buffer pool can be dirty
if log files are larger than the buffer pool, you're going to be fucked when
doing a recovery
you can also change innodb_flush_log_at_trx_commit to 0 or 2 to reduce
durability (good for slaves!)
innodb_log_buffer_size is useful for writing big TEXT/BLOB changes


look for innodb log weights going up over time


tuning log buffer time is a tradeoff between performance and recovery time;
tuning this must be contingent on your requirements for recovery time
these numbers are show in show innodb status:


log sequence number
log flushed up to
last checkpoint at


you can track the amount of log you're generating (e.g. per hour) by tracking
the delta of the LSN


status variables in percona server also show this


rollback segments are used to get consistent read view, and also used to
rollback uncommitted changes during recovery
innodb_read_io_threads changes how many threads the buffer pool can use to
read data from tablespace
innodb_max_dirty_page_pct variable to change how much of the buffer pool can
be marked dirty
innodb_buffer_pool_wait_free in "show global status" means innodb is
spending time waiting for free pages in the buffer pool
when you write to a file you are writing to a buffer; the buffer exists in
your pocess; flushing a buffer moes it to the os cache; anything flushed will
survive the process crashing; data synced will survive power outage
by default, innodb_flush_log_at_trx_commit=1. this is really expensive,
which is why it's important to have a battery backed write cache or ssd. This
is the only mode in which innodb is fully ACID compliant
if we have innodb_flush_log_at_trx_commit=2 then on every commit (or once
each second), innodb will flush to the innodb log buffer; every second innodb
will sync the log buffer to disk.


if you run on the master you could have slaves receiving rows that were
actually rolled back on the master; manifests itself as a "duplicate key
error" on all slaves


innodb_flush_log_at_trx_commit=0 is bad times
some synchronization thing called XA: http://dev.mysql.com/doc/refman/5.1/en/xa.html
everything is stored in innodb table space
by default, tablespace is one contiguous logical file called ibdata1; you
can instruct it to be multiple physical files, but only the last one can be
autoextend
every page is filled 15/16 for innodb, no way to configure it (called the
"fill factor" in oracle)
ext3 bad for mysql, xfs better, due to fsync behavior
using raw partitions is "here be dragons" and makes backups really fucking
hard
you can configure --inodb-file-per-table to have one physical file per
table, but some things are still stored in the main in the main thing
per table is usually better since you can reclaim space whenever you
TRUNCATE/DROP a table, and therefore reduce back size
some of the new options like compression and off-page storage of logs from
mysql 5.5 only work with file per table
cons:


uses more file descriptors
need to tune something
may frbe more gragmentation
complicated internal synchronization when starting up, so starting up mysql
can take a long time (but would probably on be an issue if you had like 50k
or 100k tables)
but pro: works better with ext3 and O_DIRECT


as well as table data, table space contains:


tablespace header
innodb-specific data dictionary
undo information
the doublewrite buffer
the insert buffer


you can turn on/off file-per-table, but existing tables will remain how they
were (in ibdata1 or in separate file) until you do alter table or optimize
table
most internal allocations are done using extents
innodb_autoextend_increment affects how big to expand the tablespace at a
time
everything fits into pages, which are 16 Kib
a data page has a header, rows, and a trailer


rows are normally in variable width format, there is a fixed width format but
it is very rarely used
header sys what kind (data or index), has a checksum, the LSN, the last value
of the LSN as flushed to disk
some things in the trailer: part of the lsn, checksum


checksum is always checked when reading page to innodb, if checksum is
incorrect then innodb will crash (unless using percona server)
checksums are updated when writing pages back to disk


xtradb has fast checksum option for super fast disks (e.g. fusion i/o), but
changes xtradb compatibility


data row format is like:


transaction id (for when row was created)
roll pointers
field pointers
fields


most important things: transactoin id, storage of pointer to older version of
row in the UNDO space
exact format is different -- in 5.0 there is a COMPACT row format where null
values are stored more efficiently
all field are stored inline except Blobs, which are a special case
rows cannot be split across pages
innodb sets a restriction that two rows must fit in one page


so a single row cannot be more than 8 KiB
this can cause VARCHAR to move to off-page locations


20 byte pointer to overflow page for VARCHAR/TEXT/BLOB for case where
blob does not fit on the same page as the row
for lots of internal buffers (including buffer pool), space allocated for
varchar or whatever is based on the maximum possible size of that field
800 Kb row can take up to 1.5M on disk for various technical reasons
if the whole row firts in ~8000 bytes then the blob is always stored on the
page, otherwise just first 768 bytes (antelope) or 20 byte pointer (barracuda)
what does log file look like? records are 512 byte aligned, does not store a
complete change, just the parts that changed, i.e. Space ID + Page ID + Offset + Payload
undo information is stored in the rollback segment


xtradb supports multiple, innodb just supports one


the rollback segment is always in the global tablesapce
if you get error message about too many open transactions then you ran out of
undo slots, you should increase number of rollback segments
besides buffer pool, in memory:


meta data for accessed tables (never released, happens whenever any table is
copied in), a.k.a. data dictionary
lock information
adaptive hash index
sql statements
thread stacks


innodb_use_sys_malloc (default 1) disavles the additional memory pool
regardless of built-in or plugin, leaving innodb_additional_mem_pool_size at
the default setting is usually fine
best thing evar: --innodb-buffer-pool-size


you may be surprised at how much space insert buffer and undo buffers take
up!


warning: leave space for metadata, alwasy consumes an additional 5-10% from buffer pool
some things like binary logs and relay logs depend on OS caching, so don't use
all memory
you can change innodb_max_pct_dirty_pages


with innodb plugin you can set io capacity to also affect how quickly dirty
pages are flushed
show innodb status shows how many "modified db pages"


for blobs maybe tweak innodeb_log_buffer_size to be as big as the largest
blob you ever expect to write
if innodb_log_waits in GLOBAL STATUS is high then the log buffer isn't big
enough
locks are held for the duration of a transaction
for updates/deletes, more rows than you may expect may end up being locked
"index level locking" more apt name than "row level locking"
mysql creates next-key locks to prevent inserts from happening on slaves when
statements are committed out of order
two common lock types: shared and exclusive
an S lock never ocfurs for reads because of MVCC; S locks are normally only
visible with foreign key constraints
in inndob everything is an index


data is stored in a clustered index organized by the primary key; in the
absence of a primary key, the first unique not null key is selected
other indexes are stored as secondary indexes


clustered index means that data is stored alongside index, in myisam the index
just stores a pointer to the actual data
in innodb primary key lookup is faster than myisam, but secondary index is
slower: secondary index points to primary key, so need to do two index lookups
all access to the data is always done by primary key
once mysql uses a RANGE, it cannot use anything else in the index
(e.g. example of (date, customer_id) where usually date is more selective,
but if you ever want to do a query by customer_id and a date range then
you're screwed)
just says don't use UUIDs for primary key on innodb
never piggy-back the primary key index into the end of a composite index or
covering index - it is already included for free


mk-duplicate-key-checker will find this


innodb has an adaptive hash
if you have a workload that has "hotspots" then innodb can internally build a
hash index for values that you access frequently. this hash index exists only
in memory, and there's no control over it from the end-user standpoint.


the hash index does not cover the whole index
(out of time)


for insert buffer, when you make multiple updates to a single page, they can
be buffered into the insert buffer, and then data on disk is merged with the
in-memroy insert buffers
mysql 5.5 also has this for deletes
may want to disable for ssd
can measure efficiency for innodb 5.1 and later with show innodb status
double write buffer helps to protect against partially written/corrupt pages
always 2MB, two sets of 64 pages
you can turn the double write buffer off if you have a transactional
filesystem like ZFS
adaptive flushing is in mysql 5.5
by default 200 iops (assumes raid 10 and 4 disks), change with
innodb_io_capacity

Building a MySQL HA Solution on EC2


this is a more generic solution, not MySQL solution; and there are other ways
of doing this
speaker (yves trudeau) is a former mysql and sun consultant, has a phd,
english is not his first language
the plan


the challenges
the solution
building the solution
other solutions


challenges:


cost! people don't want to have a big standby instance since it costs money;
different from traditional HA solutions where people have two big servers,
one standby and one active;
no static ips on the private network, and you have no control over which ip
it receives
no vip / no {broad,multi}cast


the solution


there's a small monitoring instance, typically the smallest instance type
available


there's also a larger instance


farm of web or application servers connected to the private ip of the large
instance


the large instance stores the mysql data on an ebs volume that is used as
permanent storage


if there is a failure, the small instance will detect the failure and will
initiate the failover procedure


small instance will create a new large instance, the large instance will mount
the EBS volume, and recovery will initiate


web/app servers still don't know where to connect


small instance will connect to each web/app server and update their
configuration to point to the new large instance


technology pieces


cluster manager


pacemaker, other could work too http://www.linux-ha.org/wiki/Pacemaker


unicast communication


lack of broadcast/multicast
setup cluster partner ip for new host
reconvfiguring service customers


Pacemaker


allow port 694 UDP within security group


heartbeart or corosync for communication, unicast based


scripts type resource on the monitoring host (type anything), on remote end
is a no-op script


localiztion weight favoring the large instance


contents of /etc/ha.cf (left side of screen is cut off)
autojoin any
ucast eth0
warntime 5
deadtime 15
initdead 60
keepalive 2
crm respawn


be careful to set values high, since amazon's network can be laggy which can
cause spurious heartbeat failures


custom ec2 images that are pre-configured with pacemaker/heartbeat and mysql


aws tools; parsing can be difficult --> helper scripts


failover steps:


detect failure
kill the faulty instance
free the shared resource
launch the new instance
reassign shared resource
reconfigure heartbeat
spread connection info


ec2-run-instance has an interesting feature for user data script which is
run by root on startup; used to seed the new instance with cluster parnter ip


don't use pacemaker to move resources because of some weird synchronization
thing


be careful about timing


series of blog posts at http://www.mysqlperformanceblog.com/2010/06/17/high-availability-for-mysql-on-amazon-ec2-part-1-intro/


other solutions:


RDS, pros/cons, only mysql
continuent
rightscale (eip based)
scalarium
scalebase (beta)
others?