19 Sep 2022

SQL as the Data Analysis' Volapük

Lately I’ve started to work with Apache Druid that is “real-time database to power modern analytics applications”. This is great piece of software, being in it’s late version 0.23.0, providing blazing-fast queries to your data. Surely enough, you always pay for the speed: sometimes with old good money (e.g. buy more resources from the powerhouses), sometimes with time needed to prepare the data for being queries (e.g. indexing), sometimes with absence of generality (e.g. make your highly intricate specific DB and go struggle when business needs something unsupported), sometimes just with your brainpower needed to understand and implement new stuff. At the moment, I think Druid is being paid mainly with two latter currencies.

So, having a bit of experience in Data warehouse project (5 years ago? eternity!), I was really interested how exactly the speed is being gained. If you’re interested, please take a look at this FAQ page (spoiler: inverted indexes and intricate metadata model). This post is not about it. This post is about SQL that is currently being used as defacto standard for thinking about data – even when “real” data is not stored in RDBMS.

As most of the industry-grade programming languages, SQL has lots of qualities from different areas: it’s mainly declarative, but has procedural mechanisms. It’s made for queries but creates databases and tables. It’s based on relational algebra, but It’s structured but sometimes it’s hard to find the structure behind 30 screens of the statement. At last, it has been designed for managers and now DB-team is usually sitting very far away from the commercial department. This is how evolution works, nothing unusual here.

But, being widely used, SQL get into the fields, where I was surprised to see it – for instance in the NoSQL world. Check out how similar is Cassandra’s CQL, even though Cassandra is famous NoSQL DB! Excel files could be used as SQL data bases! Even very non-usual languages like Q are calling the some data-manipulating functions q-sql, using different syntax, but at least partially sharing same mental model of the data manipulation.

This leads me to the point, at which I should make a bold statement: I don’t think SQL is a good programming language. I mean, it’s good, but it’s not that good, it’s not the default language when you write or think about the data. Its main feature is universality, sure; but being wide-spread across the industry, it effectively wipes out all the other possible ways of thinking around the data. It’s liek bad english of the internetz: you want to say I love you so much, but sending 🔥🔥🔥 instead; you want to know who wrote the hit and writing

SELECT song_id FROM charts FETCH FIRST 1 ROWS ONLY

instead, panicking how long will become the join to the composers table. It’s really similar to Volapük in many contexts: it’s as artificial, it’s designed to be widely used, it’s good but not necessary best, and it may be scorched out with some new technology (Esperanto in Volapük’s case). Not sure if I will see the dawn of SQL, but it could be a wonderful view!