Friday, March 2, 2012

At the edges: Relational, Hierarchical, and Key-Value models in LSMB using PostgreSQL

Introduction

It is no secret that I think of NoSQL as a kind of cool set of technologies applicable to a fairly narrow niche market.  My view is that relational databases are still (and will be for the long term) the bread and butter of data management.  NoSQL provides a useful set of tools for cases which have very different data requirements.  There are edge cases where many NoSQL techniques make sense in a relational environment though.


Data in an ERP application can be conceptualized as sets of sections of tables, each of which has various write vs read load balances.  Some sections may be updated only rarely, others may prevent updates and deletes, but allow inserts and selects, and receive data constantly.  Data may migrate from one area to another over time, such as when the books are closed.

LedgerSMB tends to currently use a variety of models for different sections of the application.  Settings and menu attributes (what a menu item does) are modelled using key-value modelling because that best semantically fits the sort of data being stored.  The menu structure is modelled hierarchical, as (in 1.4) are projects, departments, and account classifications.  Hierarchical modelling is done currently through a simple self-join.

Virtually all of the rest of the data is modelled using simple relations with no self-joins.  I don't see that changing.  However, for the sections that either do self-joins or key-value modelling, I am always interested in seeing how these can be optimized, and so when I read a blog post on NoSQL data modelling, it occurred to me that as time goes on some of the models discussed there may eventually become more applicable to relational systems.  And so I will go over my thoughts as to how these may be helpful or not.

As PostgreSQL gets native JSON support, some parts may in fact be moved to that.

Key-Value and Schemaless Modelling

Key-value modelling is generally something to be avoided in relational design.   There are, however, times when key-value pairs are in fact what one is modelling, and hence this is a good way forward.  The best example in LedgerSMB is the settings table (called "defaults").  Here the data is semantically keys and values and hence there is no other clean way to model the data.

There are some other cases where data can be modelled well using key-value modelling, but where this is not quite so semantically clear cut.  Consider the LedgerSMB menu, for example:

CREATE TABLE menu_node (
    id serial NOT NULL,
    label character varying NOT NULL,
    parent integer,
    "position" integer NOT NULL
);

and

 CREATE TABLE menu_attribute (
    node_id integer NOT NULL references menu_node(id),
    attribute character varying NOT NULL,
    value character varying NOT NULL,
    id serial NOT NULL,
    primary key(node_id, attribute)
);

The second table effectively models the keys and values submitted back to the application when  link on the menu is clicked.  While this works ok with KVM, it is not really clean.  Ideally any common attributes would be moved to menu_node and free-form ones would be added to this table.

But here's a case where JSON would be cleaner, per node.  The data here is not getting updated frequently (unlike defaults).  Indeed it is typically written once unless a fix needs to be made.  The fixes would be cleaner if it was all in one field.  So here, a NoSQL model would work very well, albeit incorporated into a relational database.

Hierarchical Modelling

Since LedgerSMB 1.3, we have started to do hierarchical modelling in the database.  See the above menu_node table as an example.  1.4 adds hierarchical business reporting units (for projects and departments) as well.

Currently these are entirely handled through simple self-joins and no clustered indexes are added.  When the menu is generated in 1.3, we use connectby() in tablefunc (we support PostgreSQL 8.3 and above), but for trunk (will be 1.4) we have gone to WITH RECURSIVE which has shown a significant speed improvement (2-70 times depending on various factors).

Initially when I read the article, I thought maybe some of the ideas might be applicable to my work with LedgerSMB, but as I think about it I am not sure.  It may be that PostgreSQL's approach may perform well enough not to ever have to move to something else.  WITH RECURSIVE performs very well, and so the major options for improving performance (namely clustered indexes) may be sufficient.

Outside of clustering on indexes, I am hard-pressed to think of ways to optimize hierarchical lookups in ways that would significantly beat other forms of hierarchical storage.  Perhaps someone could explain what the benefits are to storing child node id's etc in the table.  I sure don't see it.


OLAP

In additional to standard relational operations on a well-normalized database, there are also cases where analytics need to be run relatively quickly and in semi-flexible manners, over increasingly large datasets, and they must perform well.

Unfortunately in an ERP application, there is rarely a hard distinction between analytic and transactional workloads.  Bank account reconciliation may enter data, but the workload depends on getting good data regarding reconciled transactions from the past, and that means aggregating potentially large datasets, and those datasets could increase in size over time.

The first forays into this area for LedgerSMB in fact were aimed to solve mixed transactional/reporting workflows.    Our current approach uses "checkpoints" which effectively close the books on all previous dates.   When one closes the books the checkpoints are created, and from that point forward, no transactions can be entered for previous dates.

This allows us eventually to tie a number of other information on to checkpoints. Various reporting views could be materialized on closing (asynchronously even), and thus provide better performance for all kinds of reports on these static data sets.

These areas have not really been fully explored in LedgerSMB yet.  However for some upcoming features (such as graphs in reports), I think it will be necessary to offer this sort of thing.  Referential integrity constraints can tie a set of reports to a specific checkpoint so that the reports cannot get out of sync when books are re-opened (which happens on rare occasion).

No comments:

Post a Comment