Benchmarking Tungsten Parallel Replication

Tungsten parallel apply on slaves, or parallel replication for short, has been available for about a year.   Until recently we did not have many formal benchmarks of its performance.  Fortunately the excellent Percona Live Conference in London accepted my talk on Tungsten parallel replication (slides available here), so Giuseppe Maxia and I finally allocated a block of time for systematic performance testing.

In a nutshell, the results were quite good. In the best cases Tungsten parallel apply out-performs single-threaded native replication by about 4.5 to 1.  Both Giuseppe and I have verified this using slightly different test methodologies, which helps avoid dumb counting mistakes.  Our results also match field tests at a customer site over the previous summer, so we regard them as fairly robust.  In the remainder of this article I would like to expand a bit on the details of the benchmarks as well as the results.  The results shown here are from my tests.

Benchmark Test Design

Both Giuseppe and I used a similar testbed for replication testing:    
  • HP Proliant server, dual Xeon L5520 CPUs with hyper-threading enabled, 72Gb of RAM
  • 1TB HP Smart Array RAID 1+0 
  • Centos 5.6
  • XFS file system
  • MySQL 5.1.57 with InnoDB buffer pool set to 10Gb and using O_DIRECT purge method 
  • Tungsten Replicator 2.0.5 build 347  
For convenience we use MySQL sandbox to set up a master with two slaves, as shown in the following diagram.  It turns out that for measuring replication throughput there is no reason to set up on separate hosts, as the master does little or nothing during the test and we only operate one slave at a time.  



The Tungsten Slave is configured as described in a previous article in this blog, except that there are 30 channels instead of 10.   The exact installation command is given at the end of this article.

The test run uses sysbench to spread transactions evenly across 30 databases of identical size, then measure time to process them.  This is also known as a replication catch-up test. 
  1. Load all MySQL servers with an identical dataset consisting of 30 databases pre-populated with data from sysbench.  Giuseppe wrote a new tool called Large Data Generator that is very helpful for capturing and loading such datasets.  
  2. With the slaves shut down, store the master binlog start position and then run 30 sysbench oltp test processes against the master to update and read from all schemas simultaneously for one hour.  
  3. Start the MySQL slave from the stored master binlog position and measure time to process the sysbench transactions. Shut down the MySQL slave at the end of the test. 
  4. Start the Tungsten slave from the stored master binlog position and measure time to process the sysbench transactions using Tungsten Replicator with 30 channels (i.e. threads). 
Test Results

Database performance is substantially different depending on whether data are fully resident in the buffer pool (cache-resident) or largely read from disk (I/O-bound).  Tungsten parallel replication over 30 databases varies from 1.8 to 4.5 depending on which case you look at, as shown in the following table.   Processing times are in minutes (m).

Test Scenario
Rows/Db
Data Size
MySQL Slave
Tungsten Slave
Ratio
Cache-resident
10K
430Mb
30m 
17m
1.8
I/O-Bound
10M
68Gb
228m
51m
4.5

Let's look at the results in detail.  In the cache-resident test the base dataset is relatively small and loads fully into the buffer cache within a minute or two.  Both MySQL and Tungsten slaves complete in well under an hour.  Here is a graph showing throughput as measured in bytes of binlog processed per 10 second increment.  

Cache-Resident Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
In the cache-resident case there are virtually no reads from disk as all data are fully resident in the InnoDB buffer pool.  Tungsten Replicator is faster because multiple writes can occur in parallel but the speed-up versus native replication is not especially large.  Note that Tungsten processes around 40Mb every 10 seconds or about 1Gb of binlog every four minutes.

With I/O bound workloads, on the other hand, we see a profound difference in performance.  Tungsten Replicator is at least 6x slower than in the cache-resident case, but still processes updates faster than the master (51 minutes on the slave vs. 60 minutes on the master).   Buffer cache loading is correspondingly fast and Tungsten reaches steady-state performance within about 20 minutes.  MySQL native replication on the other hand is far slower. The slave not only does not catch up, but it would quickly lag far behind  the master under this workload.  It takes about 90 minutes for native replication even to achieve steady state performance after buffer pool loading.
I/O-Bound Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
Overall we can safely say that single-threaded native replication is likely non-workable in the I/O-bound case without going to some combination of SSDs and/or slave pre-fetch.  


Further Improvements and Caveats

The current results of parallel replication benchmarks on Tungsten are gratifying especially when you consider that two years ago Tungsten Replicator performance was around 10% of the speed of MySQL replication.  Nevertheless, these benchmarks are not the final word.  It is clear there is room for optimization as we observe that Tungsten processes the cache-bound binlog at least 6 times faster than the I/O bound workload.  Much of the difference seems to be time spent reading from disk.   If this could be improved, Tungsten would go even faster.

During the London conference Yoshinori Matsunobu published some excellent performance results using slave pre-fetch, which has encouraged us to build pre-fetch into Tungsten as well.   I am curious to see if we can further boost throughput by adding pre-fetching on each parallel thread, though other people at the conference such as Domas Mituzas were not optimistic.  Either way, I am certain we will improve performance, if not using pre-fetch then with other tricks like batching inserts.

Finally, some caveats.  Our sysbench load is nice because it is evenly distributed across schemas of exactly the same size.  Most application workloads do not behave this way, though some do come very close.  The slides for my talk discuss practical issues in maximizing performance in real applications.  I suspect that a combination of parallelization with pre-fetch will in fact turn out to be a very good solution for a wide variety of workloads.

Fine Print

If you would like to repeat our results (or attack them as fraudulent), here are some parameters that may help.  The database settings in the MySQL sandbox instances are as follows:
default-storage-engine=InnoDB
innodb-additional-mem-pool-size=100M
innodb-flush-method=O_DIRECT
innodb-log-buffer-size=4M
innodb-log-file-size=50M
innodb-thread-concurrency=0
innodb_buffer_pool_size=10G
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_strict_mode=1log-bin=mysql-binmax-connections=500
max_allowed_packet=48M
skip_slave_start
sync_binlog=0

Next, here is the sysbench command used to generate load on each schema.  We run 30 of these simultaneously varying the database name for each invocation.  This example is for the I/O-bound case.

sysbench --test=oltp --db-driver=mysql --mysql-db=${db} \  --mysql-user=msandbox --mysql-password=msandbox \
  --mysql-host=127.0.0.1 --mysql-port=33306 \
  --oltp-read-only=off --oltp-table-size=10000000 \
  --oltp-index-updates=4 --oltp-non-index-updates=2 \
  --max-requests=200000 \
  --max-time=3600 --num-threads=5 run

The replicator configuration is given in the slides for the talk, but here it is again.   Options in red are required for sandboxes.  Production installations are therefore simpler than what is shown here.

tools/tungsten-installer tools/tungsten-installer --direct -a \
  --service-name=parallel --native-slave-takeover \
  --master-host=127.0.0.1 --master-port=33306  \
  --master-user=msandbox --master-password=msandbox  \
  --slave-host=127.0.0.1 --slave-port=33307  \
  --slave-user=msandbox --slave-password=msandbox  \
  --home-directory=/opt/continuent \
  --property=replicator.store.parallel-queue.maxOfflineInterval=5 \
  --svc-parallelization-type=disk --buffer-size=100 \
  --channels=30 --thl-port=2115 --rmi-port=10010 \
  --skip-validation-check=MySQLPermissionsCheck \
  --skip-validation-check=MySQLApplierServerIDCheck \
  --start-and-report

To equal the results shown above you will also need to assign databases explicitly to channels in the shard.list file.   Otherwise, databases will be assigned  channels using a hashing function, which tends to result in somewhat uneven distributions.  Look in the comments of the shard.list file for instructions on how to do this. 

Finally, all of our tests depend on two excellent tools from Giuseppe Maxia:  MySQL Sandbox and the new Large Data Generator program in the Tungsten Toolbox.  Once you get the hang of them you will become completely addicted as they make test setup both reliable as well as quick.