Leaky abstraction, SQL & ORM

Abstraction is a way of building software that generalizes a business process or a functionality without having to write separate logic for specific use cases and without giving away details of implementation. By doing this a module of software can do many things with one, can be reused by many, and can hide complexity.

A data modeler for instance is abstracting away business rules inside a schema. Similarly, the implementation of an SQL engine is an abstraction of datasets that are warped in normalization principles . Regardless of how a database is designed a user can query it with SQL that works elsewhere, just the table names and columns have to be changed. Also, you don’t tell SQL engine how to do something, you tell it what you want. This is the premise of declarative programming — more on this below.

Abstractions is said to leak when it forces the consumer of abstraction to write more code or invest more effort to properly utilize the abstraction, which defeats its own purpose.

ORM is a classic example of having leaky abstraction.

What are ORMs?

Object Relational Mapper is an object-relational interface that implements an object oriented interface (an abstraction) to relational databases. Having this interface helps application developers read/write from databases without having to think in terms of relational data organization.

ORMs are standard accessory for programming language frameworks. Rails for Ruby, Spring for Java all have ORMs.

A typical of ORM example might look like this:

ResultSet = DatabaseObject.QueryAll(“MY_TABLE”).Project(“COL1", “COL2"). Filter (“COL3 < 100")

This interface to data sounds trivial but it’s not — without this it’s a paradigm shift for app developers to think in SQL terms. Tom Kyte and many others have pointed out: SQL requires developers to think in sets. App developers don’t (have to) think that way because they think in a procedural or object oriented fashion (which is perfect for what they do). This is why many applications have poorly written database code. ORMs are supposed to mitigate this by providing an object oriented interface to the set theoretic database operations.

A note on why SQL is different

SQL is different from other programming languages like Java, Ruby or Perl. There are two distinctive features:

1. SQL is a declarative programming language

Declarative programming implies that the programmer doesn’t delve on the algorithm to fetch data. They only tell SQL engine what they want. But you can ask the SQL engine information on how it chose to do the “what” — the algorithm. This is called the query plan. Query plan tells you that the engine chose to do certain things out of the many possible options, for instance:

- index scan based on a where clause
- index range scan based on the quantity or placement of data requested
- nested loop join instead of a hash join due to many possible reasons
- full table scan instead of an index scan perhaps based on the requested volume etc.

As an SQL programmer you don’t specify any of these. You just say “give me the data” by applying a few set theoretic operations.

2. SQL is an abstraction of set theoretic applications on datasets

SQL has a programming paradigm that forces you to think in sets. If this doesn’t occur to you, SQL will never be your favorite tool. And this is where ORMs play a role in helping application developers. I have a theory that the facile adoption of NoSQL and Big Data systems to every performance problem in data is an unfortunate result of application developers getting too accustomed with ORM lexicon — and not necessarily consequential, but — never acquiring the “set theoretic mindset”.

Tables are sets. Result sets are sets. Sub queries are sets. CTEs are sets. You apply relational algebraic operations on these sets to get a different set, and may be apply more of such operations to get another set and so on.. all in a single SQL statement. This is a bit much to stomach if you don’t approach it with set mindset.

Back to ORMs

ORMs hide the complexity (or simplicity, depending on whether you ask an app developer or a database developer) of a relational schema giving you the semblance of NoSQL flexibility on SQL rigidity. However the flexibility that’s abstracted away in ORM gets shaky when you get serious with SQL.

Consider a web service developer wanting to reduce the payload by letting the database engine do much of the data processing so the web service get as less data as possible. This could end up having to write somewhat big or complex SQL.

Ten-way joins, subqueries, CTEs, windowing clauses, all work perfectly well in a single large SQL on a well designed database. But ORMs aren’t just made for set theoretic operations because they are not SQL engines.

In hindsight ORMs could have been the bridge between procedural mindset and set theoretic mindset but I’m not sure. The abstraction presented by ORMs get leaky when the query gets complex, forcing the app developer to either wrap the whole SQL in a string to send to database (essentially overriding ORM’s abstraction) or process the data inside application (by fetching a larger unprocessed data through ORM’s abstraction) which increases payload.

This is not a “do this, not that” post. To quote people smarter than me, “there are no solutions, there are only trade offs”.

For enterprise web services reading from well managed operational data stores or data warehouses, my preference is the database exposing data as a service through stored procedures or other database techniques. Apps handling database code inside their services works well until the smartest full stack developer leaves and the project gets contracted out.

I write because I forget.