I spent some time thinking about what things in the Postgres environment (and specifically for crash-stats.mozilla.com) make me happy, and which things bother me so much that I feel like something is pretty wrong until they are fixed or monitored.
I’m planning to go through each of these items and talk about how we address them in the Web Engineering team, and that will include implementing some new things over the next couple of quarters that we haven’t had in the past.
One thing that didn’t surprise me about this list was how much documentation is needed to keep environments running smoothly. By smoothly, I mean that other people on the team can jump in and fix things, not just a single domain expert.
Sometimes docs come in the form of scripts or code. However, some prose and explanation of the thinking behind the way things works is often also necessary. I frequently underestimate how much domain knowledge I have that I really aught to be sharing for the sake of my team.
There’s an awful lot of knowledge needed to run a large postgresql deployment. Dunno that there’s any way around it.
One workaround could be pay for PG support from good company.
I think there are some great targets for automation/tooling and additions to core Postgres in that list. 🙂
Agreed. I wonder how much of the monitoring is available as open source already? Seems ripe for a script to track the things you mention and push it to graphite.
Nearly all of it is available as nagios plugins or check_postgres.pl queries. I’ll see about documenting what’s available and what we currently have.
the above link to ‘check_postgres.pl’ appears to be broken.
Your mention of Socorro near performance indicators is going to confuse some people.
Ah yeah, I fixed that. 🙂 Thanks.
great post. have you written a blueprint for a dedicated postgresql distro?
Hah! As we add some things to our environment, I’ll see how much we need to do that’s truly custom. A coworker (@limed) wrote a great puppet module, so that’s our starting point for everything.
Did you notice a lot on that list isn’t actually all that PostgreSQL or database specific ?
Most of the things on this list apply to any backing store or service.