Let's start by picking a representative software-as-a-service (SaaS) application: call center automation. Call center software integrates with a local PBX or VOIP to allow agents to answer and make phone calls for telemarketing campaigns in a systematic and automated way using standard procedures known as "agent scripts." Admins set up agent scripts and define lists of people to call as well as marketing campaigns. Finally and perhaps most importantly, managers receive a wide variety of detailed reports that allow them to optimize current work, examine historical performance, and make predictions about the future for planning purposes. Here is a typical application architecture.
Bear in mind that this is a greatly simplified view. Like most business applications, call center automation systems may contain hundreds of database tables and many types of user services. There are also practical complications that go beyond the application itself. Call center automation is vital to the businesses that use it. Customers want assurance they can continue processing on another site if a SaaS vendor site goes dark. This means we have to think about maintaining applications and data on multiple sites.
The ideal solution for most SaaS vendors would be to have call center data and applications for all customers live on multiple sites at all times. Multiple live sites mean that failover is instantaneous since both applications and database servers are already up and running. Constant update means there is little or no data loss on failure. Customers could connect to the nearest site. Here is a picture of that dream that includes two sites and two customers, Acme Inc. and Pinnacle, Ltd.
This solution has only one problem. It is impossible to build. Readers may nod wisely and say this is because of CAP Theorem limitations, but that somewhat misses the point. Let's say we use a NoSQL DBMS like Cassandra that permits updates in multiple locations and reconciles the data using eventual consistency. However, there's a catch: as we saw above, much of the value of call center automation is in operational queries and reports. That drives you back to an RDBMS with cross-table joins, aggregation functions, referential integrity, and convenient SQL-based report writing tools. For this reason alone, Cassandra is a non-starter for call center automation.
What about a SQL DBMS? MySQL obviously has all the features you need for query-intensive solutions on smallish data sets (e.g. hundreds of millions of rows, not many billions or trillions). The problem is multi-master replication. Updating the same table from two or more places on a LAN is already quite difficult: witness the complexity of Oracle RAC or MySQL Cluster. The problem becomes intractable when you combine complex SQL transactions, referential integrity, and high-latency WAN connections. If you want full SQL semantics you cannot have updates on multiple sites. This is a serious dilemma and not just for call center automation. The same problems or worse affect a multitude of valuable business applications including market automation, credit card processing, customer relationship management (CRM), time/expense tracking, accounting, and many others.
Fortunately we are not really stuck. If we give up some requirements customers do not really want anyway, there is a perfectly good solution that will work for a wide range of problems. Data warehousing architects long ago developed the notion of a system of record. Bill Inmon's classic Building the Data Warehouse defined system of record as follows:
System of record thus meets the original requirement of having data on multiple sites, which was to handle a site failure. We can store data economically using off-the-shelf MySQL. We can update copies within and across sites using master/slave replication. We can shard customer data into independent schemas. The result looks like the following. Acme has a master in San Jose, whereas Pinnacle has its master in New York.
Using the system of record approach simplifies other problems as well. Standard backup and restore techniques still work. If you mess up a customer copy, you re-provision from the master shard. You can implement failover across sites and also fail over locally onto slave copies, which can be complete copies containing data for all customers.
Meanwhile, most users are fine with a single site. Pinnacle is close to New York, which is why the SaaS vendor puts Pinnacle's data there and gives them the New York site DNS for login. It is also possible to run reports on the cross-site copies as well. You can even run full applications provided you forward writes to the system of record, as shown above for Acme.
The real issue in implementing system of record architectures is that existing replication and clustering tools are not quite up to the job of handling cross-site applications build on system of record. We are extending Tungsten to handle some of the obvious problems in building these types of systems using MySQL.
The need for availability is pushing an increasing number of SaaS vendors and other application providers to operate systems across multiple sites. Applications like call center automation depend on the features of SQL and cannot be implemented using NoSQL DBMS's like Cassandra. The system of record architecture eliminates replication conflicts and enables multi-master updates to work on ordinary SQL databases between sites. If you are building complex SQL applications and thinking about going multi-site, this design pattern should be in your toolbox.
Figure 1: Call Center Application Architecture |
The ideal solution for most SaaS vendors would be to have call center data and applications for all customers live on multiple sites at all times. Multiple live sites mean that failover is instantaneous since both applications and database servers are already up and running. Constant update means there is little or no data loss on failure. Customers could connect to the nearest site. Here is a picture of that dream that includes two sites and two customers, Acme Inc. and Pinnacle, Ltd.
Figure 2: Dream Architecture for Call Center Automation |
This solution has only one problem. It is impossible to build. Readers may nod wisely and say this is because of CAP Theorem limitations, but that somewhat misses the point. Let's say we use a NoSQL DBMS like Cassandra that permits updates in multiple locations and reconciles the data using eventual consistency. However, there's a catch: as we saw above, much of the value of call center automation is in operational queries and reports. That drives you back to an RDBMS with cross-table joins, aggregation functions, referential integrity, and convenient SQL-based report writing tools. For this reason alone, Cassandra is a non-starter for call center automation.
What about a SQL DBMS? MySQL obviously has all the features you need for query-intensive solutions on smallish data sets (e.g. hundreds of millions of rows, not many billions or trillions). The problem is multi-master replication. Updating the same table from two or more places on a LAN is already quite difficult: witness the complexity of Oracle RAC or MySQL Cluster. The problem becomes intractable when you combine complex SQL transactions, referential integrity, and high-latency WAN connections. If you want full SQL semantics you cannot have updates on multiple sites. This is a serious dilemma and not just for call center automation. The same problems or worse affect a multitude of valuable business applications including market automation, credit card processing, customer relationship management (CRM), time/expense tracking, accounting, and many others.
Fortunately we are not really stuck. If we give up some requirements customers do not really want anyway, there is a perfectly good solution that will work for a wide range of problems. Data warehousing architects long ago developed the notion of a system of record. Bill Inmon's classic Building the Data Warehouse defined system of record as follows:
The definitive and singular source of operational data. If data element abc has a value of 25 in a database record but a value of 45 in the system of record, by definition the first value is incorrect and must be reconciled.System of record applies to multi-master systems in the form of a simple rule. We just assert that every customer has master data in one and only one location and copies everywhere else. When particular customers update information they do so on their own master. Customers can have masters on different hosts or sites, but the system of record rule says that no customer has one in two places. This eliminates conflicts between masters, and multi-master replication now works without a lot of difficulty.
System of record thus meets the original requirement of having data on multiple sites, which was to handle a site failure. We can store data economically using off-the-shelf MySQL. We can update copies within and across sites using master/slave replication. We can shard customer data into independent schemas. The result looks like the following. Acme has a master in San Jose, whereas Pinnacle has its master in New York.
System of Record Architecture for Call Center Automation |
Meanwhile, most users are fine with a single site. Pinnacle is close to New York, which is why the SaaS vendor puts Pinnacle's data there and gives them the New York site DNS for login. It is also possible to run reports on the cross-site copies as well. You can even run full applications provided you forward writes to the system of record, as shown above for Acme.
The real issue in implementing system of record architectures is that existing replication and clustering tools are not quite up to the job of handling cross-site applications build on system of record. We are extending Tungsten to handle some of the obvious problems in building these types of systems using MySQL.
- Locating the customer master and connecting applications to it.
- Moving the customer master from one location to another. This happens more often than you would think, for example to minimize multi-master replication which can introduce problems beyond conflicts.
- Detecting accidental updates to copies and preventing them from either reaching the DBMS and/or preventing them from propagating to other locations.
- Proving a clean failover model that works on both cross-site as well as local copies of data.
- Recovering corrupted copies of customer data from the master.
The need for availability is pushing an increasing number of SaaS vendors and other application providers to operate systems across multiple sites. Applications like call center automation depend on the features of SQL and cannot be implemented using NoSQL DBMS's like Cassandra. The system of record architecture eliminates replication conflicts and enables multi-master updates to work on ordinary SQL databases between sites. If you are building complex SQL applications and thinking about going multi-site, this design pattern should be in your toolbox.