The InnoDB Storage Engine for MySQL

<Insert Picture Here>

The InnoDB Storage Engine for MySQL 
Morgan Tocker, MySQL Community Manager 
http://www.tocker.ca/

Safe Harbor Statement
The
following
is
intended
to
outline
our
general
product
direction.

It
is
intended
for
information
purposes
only,
and
may
not
be

incorporated
into
any
contract.
It
is
not
a
commitment
to
deliver

any
material,
code,
or
functionality,
and
should
not
be
relied
upon

in
making
purchasing
decisions.

 

The
development,
release,
and
timing
of
any
features
or

functionality
described
for
Oracle’s
products
remains
at
the
sole

discretion
of
Oracle.

4 Years of MySQL Innovation
MySQL Cluster 7.3

MySQL Workbench 6.0

MySQL Migration Wizard
<Insert Picture Here>

MySQL 5.5

MySQL 5.6

Windows installer & Tools MySQL 5.7 M y S Q L
Cluster
MySQL
MySQL Enterprise Monitor 2.3 & e r
M a n a g 3.0
Applier for
Hadoop

MySQL Enterprise Backup
Security
MySQL Utilities
MySQL Workbench 5.2 & 6.0
Scalability
MySQL Cluster 7.2
HA
MySQL Enterprise
MySQL Cluster 7.1
O r a c l e C e r t i f i c a t i o n s Audit

Hello and Welcome!
• I will be talking about InnoDB’s internal behaviour.
• Not talking (much) about MySQL.
• Aim of this talk is to give you X-ray vision.
• i.e. not so many direct takeaways, but one day it
will help you debug a problem.

Prerequisites

Copyright
©
2012
Oracle
and/or
its
affiliates.
All
rights
reserved.

IO Performance
L1 cache reference
Branch mispredict
L2 cache reference
Mutex lock/unlock
Main memory reference
Compress 1K bytes with Zippy
Send 2K bytes over 1 Gbps network
Read 1 MB sequentially from memory
Round trip within same datacenter
Disk seek
Read 1 MB sequentially from disk
Send packet CA->Netherlands->CA

0.5 ns!
5 ns!
7 ns!
25 ns!
100 ns!
3,000 ns!
20,000 ns!
250,000 ns!
500,000 ns!
10,000,000 ns!
20,000,000 ns!
150,000,000 ns

See: http://www.linux-mag.com/cache/7589/1.html and Google http://
www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

IO Performance (cont.)
• 5-10ms per disk IO.
• Maybe 50us for a high end SSD.
• Still not “memory speed”.

Buffered IO
• Operating Systems compensate well already.
• Reads are cached with free memory.
• Writes don’t happen instantly.
• A step is introduced to rewrite and merge.
Block 9, 10, 1, 4, 200, 5.

Block 1, 4, 5, 9, 10, 200

fsync

Synopsis
#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);

!

Description
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by
the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks
until the device reports that the transfer has completed. It also flushes metadata information associated with the file
(see stat(2)).

Basic
Operation

Copyright
©
2012
Oracle
and/or
its
affiliates.
All
rights
reserved.

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB High Level Overview

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

In Memory
On Disk

Query (pages not in buffer pool)
mysqld

SELECT * FROM a
WHERE id = 10;

Buffer Pool

Buffer Pool LRU




Page Cache

Log Group
ibdata1
space 0

Storage

InnoDB
Additional Mem Pool

Log Buffer

Transaction
System

Caching

Not Found

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

Query (pages in buffer pool)
mysqld

SELECT * FROM a
WHERE id = 10;

Buffer Pool




Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

Update Query in a Transaction (simplified)
mysqld

UPDATE a SET col1 =
‘new’ WHERE id = 10;
commit;

Buffer Pool




Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

Log files
• Provide recovery.
• Only written to in regular operation.
• Read only required if there is a crash.
• Are rewritten over-and-over again.
• Think of it like a tank tread.

Log files (cont.)
• Are an optimization!
• 512B aligned sequential writes.
• Tablespace writes are 16KiB random writes.
• Tablespace writes to same pages in close time
window can be merged.
• Just need a large enough log file.

Checkpoint (Background Activity)
mysqld

(nothing)

Buffer Pool




Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

FAQ
• Q: What do we write to the log - is it committed data
only, or can we write uncommitted data as well?
• A: Both.

FAQ
• Q: How do you unapply transactions?
• A: UNDO space. 
Think of it like a hidden table internally stored in
ibdata1.

Update Query (More Accurate*)
mysqld

UPDATE a SET col1 =
commit;

Page Cache

Buffer Pool




Additional Mem Pool

Redirect older version
Update any indexes
Modify row in place
of row to undo space
to hold both versions.
Log Group

ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

Update Query (cont.)
• Background purge process is able to clean old rows
from UNDO as soon as oldest transaction advances
forward.

Summarized Performance
Characteristics
• Log Files:
• Are short sequential writes.
• They permit InnoDB to delay tablespace writes enabling more merging/optimization.
• Buffer Pool:
• “In memory version of the tablespace”.
• Loading/unloading via modified LRU algorithm.

Index Structure
• Indexes and “data” in InnoDB are B+Trees.
• Clustered Index design means that data itself is
stored in an index.

Index Structure (cont.)

Empty root
Inﬁmum
Level 0
Root

Page 3
Next
Record

Supremum


Insert: 1
Inﬁmum
Level 0
Root

Page size is 16KB
B* is 2KB

Page 3

1
B*

Supremum


Insert: 1 to 7
I
Level 0
Root

S

Page 3

1
B*

2
B*

3
B*

4
B*

5
B*

6
B*

7
B*

Insert: 8
Inﬁmum

Page 3

Level 1
Root

1
4

I
Level 0
Leaf

Allocate new page and link in root
Move records to new page
Split new page

Supremum

S

Page 4

1
B*

2
B*

3
B*

4
B*

5
B*

6
B*

7
B*

Insert: 8 (Cont.)
Inﬁmum
Level 1
Root

1
4

I
Level 0
Leaf

S

Page 4

1
B*

Split at the middle of original page

2
B*

3
B*

Supremum

Page 3

4
5

I

S

Page 5

4
B*

5
B*

6
B*

7
B*

Insert: 9 and 10
Inﬁmum
Level 1
Root

1
4

I
Level 0
Leaf

S

Page 4

1
B*

2
B*

3
B*

Supremum

Page 3

4
5

I

S

Page 5

4
B*

5
B*

6
B*

7
B*

8
B*

9
B*

10
B*

Insert: 11
Inﬁmum
Level 1
Root

4
5

1
4

I
Level 0
Leaf

Page 3

S

Page 4

I

Supremum
11
6

S

Page 5

I

S
Page 6

1
B*

2
B*

3
B*

Insert leads to a split at the insertion point

4
B*

5
B*

6
B*

7
B*

8
B*

9
B*

10
B*

11
B*

Index Structure
Level 2
Root

Inﬁmum

Level 1
Internal

Page 6

0
A

1
B

S

Page 4

≥0
→
6

I
Level 0
Leaf

Next
Record

≥0
→
4

I

S

Supremum

Page 3

≥4
→
5

I

Next Page

≥2
→
7

≥4
→
8

Prev Page

I

Page 7

2
C

3
D

S

I

Page 8

4
E

5
F

S

Page 5

S

≥6
→
9

I

Page 9

6
G

7
H

S

Page Format

0
38

FIL Header (38)
Other headers and page data,
depending on page type.
Total usable space: 16,338 bytes.

16376
16384

FIL Trailer (8)

Row Format

N-5

N-4

N-2
N
N+k
N+k+6
N+k+13
N+k+13+j

Variable ﬁeld lengths (1-2 bytes per var. ﬁeld)
Info Flags (4 bits)
Number of Records Owned (4 bits)
Order (13 bits)
Record Type (3 bits)
Next Record Offset (2)
Cluster Key Fields (k)
Transaction ID (6)
Roll Pointer (7)
Non-Key Fields (j)

Conclusion
• Page is basic unit of storage.
• Default is 16KiB
• Rows of variable length.

Two more useful features
• Adaptive hash - Partial hash index to accelerate
secondary key lookups.
• Change buffering - when non-unique indexes are
not in memory, changes can be temporarily buffered
until they are.

Query (by secondary key)
mysqld

SELECT * FROM a
WHERE b_key = 10;

Buffer Pool




Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

Update Query (large table)
mysqld

UPDATE a SET col1 =
commit;

Buffer Pool

Caching

InnoDB
Buffer Pool LRU




Page Cache

Additional Mem Pool

Not Required
Log Group
ibdata1
space 0

Storage

Transaction
System

Log Buffer

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

MySQL 5.5+
•
•
•
•
•
•
•
•
•

IO Scalability
Async IO
Multiple Buffer Pools
Adaptive Flushing
Scan Resistant LRU
Compressed Pages
CPU Scalability
Improved Atomics
Spin Loops with PAUSE

MySQL 5.6+
• LRU Dump and
Restore
• Improved Group
Commit
• Fulltext Search
• Fast Read-only
Transactions
• Memcached Interface
• Information Schema
metadata tables

•
•
•
•

Persistent Statistics
Variable Page Size
Online DDL
Transportable
Tablespace
• Transactional
Replication 
Using InnoDB

MySQL 5.7+
• Faster Temporary Tables
• Index Lock Contention Reduction
• More Online DDL
• Extend VARCHAR
• Rename Index
• Improved Read-Only Transactions
• Improved CPU Scalability

The Top 3
1. innodb-buffer-pool-size
2. innodb-log-file-size
3. innodb_flush_log_at_trx_commit

innodb-buffer-pool-size
• Really only one major buffer/cache settings to set.
• Responsible for all pages types (data, indexes, undo,
insert buffer..)

innodb-buffer-pool-size (cont.)
• Recommendation is 50-80% of RAM.
• Default is 128M of RAM.
• Please allow 5-10% on top for other meta data to
grow.

innodb-log-file-size
• Log files are on disk, but this contributes to how
many unflushed (dirty) pages you can hold in
memory.
• In theory larger log files = longer crash recovery.
• In MySQL 5.5 -2G max.
• In MySQL 5.6 - 4G is usually safe.
• Early versions should be much smaller.
• Default is 48M Log Files.

innodb_flush_log_at_trx_commit
• Default is full ACID Compliance (=1)
• Can be set to 0/2 if you do not mind some data loss.

innodb_flush_log_at_trx_commit
• 0 = Log buffer written + synced once per second.
Nothing done at commit.
• 1 = Log buffer written + synced once per second +
written and synced on commit.
• 2 = Log buffer written + synced once per second +
written (not synced) on commit.
!
2 is a slightly safer version of 0.

Requires about 5-10% of buffer pool
size as overhead (not directly
configurable).

Basic Configuration

Buffer Pool

InnoDB
Buffer Pool LRU


Additional Mem Pool

innodb_log_file_size. Typical
values 256M+. Default of 2 files
(innodb_log_files_in_group).
Log Group

iblogfile0

ibdata1
space 0

innodb_log_buffer_size. Typical
Doublewrite Buffer
values 1-8M. Flushing
IBUF_HEADER
IBUF_TREE
Block 1 (64 pages)
behaviour influenced by
TRX_SYS
Block 2 (64 pages)
FIRST_RSEG
innodb_flush_log_at_trx_commit.
DICT_HDR
Storage



Page Cache

Log Buffer

Transaction
System

Caching

Innodb_buffer_pool_size.
Recommendation is 50-80% RAM.

iblogfile1

iblogfile2

Tables with
file_per_table

Data Dict.
SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

A.ibd
B.ibd
C.ibd

innodb_file_per_table
(Default: ON in 5.6+)

The ~Top 10
1. innodb-buffer-pool-size
2. innodb-log-file-size
3. innodb-log-buffer-size
4. innodb_flush_log_at_trx_commit
5. innodb_flush_method
6. innodb_flush_neighbors
7. innodb_io_capacity, innodb_io_capacity_max,
innodb_lru_scan_depth
8. innodb-buffer-pool-instances
9. innodb_read_io_threads and innodb_write_io_threads

Less-likely to need configuration
•
•
•
•
•

innodb_thread_concurrency
innodb_concurrency_tickets
innodb_max_pct_dirty_pages
innodb_use_native_aio (always on)
innodb_old_blocks_time (5.6 default: 1000)

Deprecated Settings
• Typically “remove on sight” from config files:
• innodb_additional_mempool_size
• innodb_use_sys_malloc

Credits
• InnoDB Architecture Diagrams via https://github.com/
jeremycole/innodb_diagrams
• Available under (3-clause) BSD license 
Copyright (c) 2013, Twitter, Inc. 
Copyright (c) 2013, Jeremy Cole <jeremy@jcole.us> 
Copyright (c) 2013, Davi Arnaut
<darnaut@gmail.com>

The InnoDB Storage Engine for MySQL

The InnoDB Storage Engine for MySQL

More Related Content

What's hot

What's hot (20)

Similar to The InnoDB Storage Engine for MySQL

Similar to The InnoDB Storage Engine for MySQL (20)

More from Morgan Tocker

More from Morgan Tocker (20)

The InnoDB Storage Engine for MySQL