A friend told me about their sharding scheme last night, and it made me very curious about how others are handling this problem. This question about database design turns into a devops issue, so it’s something really the entire development group and devops and DBAs need to be aware of and concerned about. And it’s not a problem exclusive to Postgres.
They’re using Postgres’ table inheritance to constrain the properties of the sharded tables. And I’m deliberately using ‘sharding’ because this ends up being a functional grouping, rather than, say, partitioning by date. Groups of customers live on each shard, and can be moved around.
In theory, this is awesome. Everything inherited is in lockstep, you never have to worry about one shard’s tables being different from any other shard.
But that’s dubious, because you can change or add columns to child tables. The only columns that are constrained are the ones defined by the parent.
And… the problems I’ve seen with this setup are when you need to make a schema change on a column that’s in a parent table. Typically, devs (and sometimes DBAs) give up, and just add columns to each shard’s table individually. Because they can’t get the downtime they need to modify the tables across all shards.
In this case, we’re talking about 1024 tables for each sharded table, and an ACCESS EXCLUSIVE lock needs to be acquired on them all before the change can be applied.
There are some simple things one can do to get around this, but acquiring that lock is a significant undertaking on a busy system. In one case, the table being modified is an audit table. (why this is problematic, exercise for reader, etc)
And I still have scars from working on a system that had 100k+ inherited tables.
So, my thought was: just don’t use inheritance for sharded designs. For schema changes, not using inheritance gets you:
- Only one ACCESS EXCLUSIVE lock required at a time
- The ability to apply a change per-shard, instead of globally
- Preparation for the day when you move a shard to a separate system entirely
If you’re using 9.0 or later, you can use CREATE TABLE … LIKE instead of using INHERITS, if you’re deploying shards with SQL commands.
Giving up inheritance is a pain because:
- Now you have to ensure that your tables remain in sync across shards without inheritance’s help (but again, dubious help!)
- You can no longer write queries against the parent table that will pull data from all child tables (but I’d say – that’s for your data warehouse, not your prod OLTP database)
- You’re no longer using inheritance, which is a pretty cool feature
I’d really like to know what others are doing. Tell me in the comments.
Some links you might be interested in
And relevant, but doesn’t mention but about 5 months old: Instagram’s sharding technique
I’ve used sharding with inheritance in the past but on a warehousing backend where we did not need to worry about locks.
Using LIKE to define child tables creates a maintenance burden but there are ways to work around it in your application code which can be made ‘version’ aware. In effect you would never update old shards and only add those columns in the newer shards – the application code being smart enough to use defaults. Just a thought.
Currently developing an app where we put each customer in their own schema. It’s a b2b web app, and allows each customer to add their own tables, views and columns. I hadn’t thought about inheritance, but now I’m wondering as we have a base schema, for say a contact, and while customers can add their own columns, the can not change the base (all currently handled at the app layer). I understand the issue with the locks, and that would seem to be a +1 for the schemas, but my hunch is we will run into similar problems with the performance of pg_catalogs. It’s too soon for us to optimize for this, so we just figure at some point a server will only handle x of our schemas and then we just get another one, rinse & repeat. Just wondering if x schemas is less than x inherited tables and I I should even care this early in the app cycle.
Some indexes were added to the catalogs to fix performance issues with the use case you’re describing.. I think in version 8.4. So, it’s unlikely that your limiting factor will be the performance of the catalogs (barring something truly pathological).
I’ve been down this road before of allowing customers to add columns. Dangerous. 🙂 I don’t recommend it.
Robert’s comment below about multi-node is really the most succinct way of telling the whole story I told above! Once you know you’ve got to do multi-node, inheritance is a largely unworkable.
Generally speaking, the efforts needed to maintain a sharded architecture only make sense in the context of a multi-node system, and once you know you are going there, inheritance isn’t going to be available to you, so there is no sense trying to find ways to work around it’s issues; just get rid of it. If you still want multi-tenancy and the ability to move customers from machine to machine, and you want to segment their data out specifically, use schemas for that, and manage them appropriately. (I mentioned this in a blog post a few years back… http://www.xzilla.net/blog/2009/Jul/Three-Cheers-for-the-search_path!!!.html#extended)
Word. Much more succinct way of making my point! 🙂 Thanks, Robert.