AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Databases

Multi-paradigm Data Storage Architectures

We often have clients ask about one or more of the NoSQL technologies as potential replacements for their RDBMS (typically MySQL) to simplify scaling. What I think makes much more sense with regard to these NoSQL and SQL storage systems is an AND instead of an OR discussion. Consider implementing a multi-paradigm data storage layer that provides the appropriate persistent storage system for the different types of data in your application. This approach is similar to our RFM approach to data storage. Consider questions such as how often do you need the data, how quickly do you need it, how relational is the data, etc. There are at least four benefits of this multi-paradigm approach: simpler scaling, improved application performance, easier application development, and reduced cost.

The AKF Scale Cube provides a straightforward way to scale any relational database through the three axes but we know that splitting data relationships once they’ve been established isn’t easy. It requires work and lots of coordination between teams. By limiting what gets stored relationally to only the minimum that is required means fewer splits along any axis. Many of the NoSQL technologies provide auto sharding and asynchronous replication. Re-indexing keys across another node is much simpler than migrating tables into another database.

While relational databases can have great performance, unless the table is pinned in memory or the query results are cached in memory, an in memory data store will always outperform SQL. In many applications we could make use of in memory solutions like Memcache or MongoDB to improve performance of retrieving high value data.

Application Development
As Debasish Ghosh states in his article Multiparadigm Data Storage for Enterprise Applications, storing data the way it is used in an application, simplifies programming and makes it easier to decentralize data processing. If the application treats the data as a document why break it apart to store it relationally when we have viable document storage engines. Storing the data in a more native format allows for quicker development.

For data that’s not needed often, cache it in other places (such as a CDN) or lazy load it from a low cost storage tier such as Amazon’s S3. This might work well for applications hosted in the cloud. The benefit of this a lower cost per byte stored, especially when considering all costs including administrators for the more complex data storage systems such as relational databases.

A final step in implementing a multi-paradigm data storage layer is an asynchronous message queue for data that needs to move up and down the stack. Implementing ActiveMQ or RabbitMQ to asynchronously move data from one layer to another as needed relieves the application of this burden. As an example consider an application that routes picking baskets for inventory in a warehouse. This is typically thought of as a graph with bins of inventory as nodes and the aisles as edges. For faster retrieval you could store this in a graph database such as Neo4J for ease of development and performance reasons. You could then asynchronously persist these maps in a MySQL database for reporting and older versions into an S3 bucket for historic archiving. This combination provides faster performance, easier development, simpler scaling, and reduced cost.

Comments Off on Multi-paradigm Data Storage Architectures

Why We Hate Stored Procedures

Here's a little color on our love-hate relationship with stored procedures.

Okay we really don’t hate stored procedures and have actually used them  extensively on past projects but we so often tell people to remove them  that we’ve been accused of hating them. Here’s the deal, as we mentioned  in a past post, two scalability best practices are 1) put as little  business logic in the database layer as possible and 2) use as few  features of the RDBMS as possible. Both of these are violated by the use  of stored procedures, if they contain any business logic. If the stored  procedures are simply instantiations of SQL data statements then you’re  probably okay.

Expanding on the two reasons for avoiding stored procedures mentioned  above, we’ll first discuss why business logic doesn’t belong in the  database layer. The biggest reason for this is that the database server  is likely the most expensive server in your system and the most  difficult to scale. Adding more computational processing on it makes it  the most expensive layer to perform these and requires you to upgrade  the hardware or split the database sooner than necessary. It is much  cheaper from a cost per computation perspective to have this business  logic processed on a much cheaper application server.

Avoiding the use of RDBMS features is important when it comes to  remaining vendor agnostic and getting the best possible price during  negotiations. Using features specific to certain database vendors locks  you in to that vendor. While you may never want to switch database  vendors it does reduce your negotiating strength. When you are locked in  to a vendor your best alternative to a negotiated agreement, BATNA, is  not good and their sales staff knows this.

While we don’t hate stored procedures we do seem them as an impediment  to scaling. If you already have them in place you may be hesitant to  spend the money moving them into the application logic. One method of  doing so at minimal incremental cost is to start today with the policy  that if any engineer touches a stored procedure they must move it. This  way the cost is small, because you’re already touching the code, and  spread out because it’s not done all at once.


Data Access Layers

Ways of abstracting the storage of data have been around for a long time.  In data warehouses engineers abstract data into business or domain objects that can be manipulated in reports.  For object oriented programming, engineers can use the active record pattern to create a wrapper class representing a table and methods for inserting, updating, or deleting. Thus the manipulation of the database rows are abstracted into the object oriented parlance of classes and methods. This layer of computing is known as a Data Access Layer (DAL) and hides the complexity of the underlying data store from engineers who do not need to be bothered with those details.

There are many frameworks or object-relational mapping (ORM) tools for creating DAL’s for different programming languages such as Active Record for Ruby and Hibernate for Java.  We often hear from development teams that have adopted an ORM that the pros include: shorter development time, better designed code and reduction in amount of code.

However, a quick search will show you that not everyone is sold on the benefits of a DAL. The reduction in code is debatable when constructing complex queries or considering that Hibernate is over 800 KLOC of Java and XML.  There are also concerns about the ability to scale effectively when using DALs.  While it is possible with an ORM to scale on the X-axis such as with MySQL master-slave replication, the Y and Z axes splits can become much more complicated.

I am a fan of DAL’s for their centralization of data objects and their abstraction of relational data into objects. To me these advantages speed development and testing time and improve quality. Additionally, given the sophistication and open source of ORM’s today, I think it makes sense to consider using one as a framework. However, if you choose to do so, you need to consider ahead of time how you would shard your data along other axes. The time for those considerations is moved up when using an ORM. Think of the D-I-D approach where the cost to make a change during the Design phase is negligible compared to changes made after Deployment.

Comments Off on Data Access Layers