Runtime All-Hands June 2017 Summary

All of Mozilla met in San Francisco last week for a work week. Unlike the last few All-Hands, we spent the week mostly informally and not in meetings — hacking in rooms together on near-term work.

The Runtime engineering team was focused on landing patches for Quantum Flow, Quantum DOM and Quantum Networking efforts. We had exciting changes related to Speedometer v2, both in improving how we measure and landing key patches. The Security Engineering team invited the Tor Project to join and deep dive into the Android version of the browser (based on Fennec, and called OrFox). The rest of the Runtime team was landing patches, reconnecting with colleagues across the org, and making exciting, measurable progress toward a great launch of Firefox 57.

I asked several team leads to send me their highlights from the week. I’ve summarized this below. If I missed something that was important to you, please get in touch.

Project Quantum highlights

“Watching my laptop race HTTP network queries against the disk cache and seeing that it was choosing the right transactions to have the network actually be faster.” -Patrick McManus

  • QF team fixed 26 Quantum Flow bugs since last Friday, June 23
  • Landed (preffed off, going to do a pref experiment for rollout) budget-based background tab throttling (meta bug)
  • Joel Maher and his “army of automation” has helped correct Speedometer reporting.
  • Got a bunch of people from different teams in a room and figured out the easiest/best architecture for supporting the moz-page-thumbs protocol in e10s (i.e. the protocol that supports everything you see when you open a new tab). Same, for nsITraceableListener support (which is must for 57: needed to support the NoScript addon).
  • Incremental table sweeping bug fixes landed that should reduce GC pause times.
  • Byte code cache landed and is on for 5% of Nightly population — this project was in progress for more than a year.
  • We now have a name for almost every runnable in Firefox.

Security/Privacy Highlights

“At Mozilla all hands this week. They are excited to work with us.” –Mike Perry, Tor Project

  • Tor Browser for Android was updated during the workweek to be based on Firefox 52 (from 45). The update is in QA now.
  • Patch written (and being rewritten) for constant blinding in the JIT.
  • A [patch for integrating Tor into Focus was hacked up][8] for discussion.
  • Got the TLS Canary (tool for testing changes to our crypto stack on Alexa-top-100 websites) running in TaskCluster.
  • Had first successful use of OneCRL administrative workflow

Other Runtime Highlights

“The culture of focusing on performance is in effect! Performance was a big part of every discussion and review.” -Andrew Overholt

  • “Making my first interoperable handshake and encrypted data for Mozilla’s IETF QUIC.” -Patrick McManus
  • [JavaScript classes][9] are done and fully optimized.
  • [GeckoView example now being tested][9] in automation.
  • Added security certificate information to GeckoView for use in PWA and Custom Tabs.
  • Taught a bunch of people how to profile at the two Quantum Flow profiler office hours sessions.

Thanks everyone for a productive week!

Everyday Postgres: Describing an “ideal” Postgres Operational Environment

I spent some time thinking about what things in the Postgres environment (and specifically for crash-stats.mozilla.com) make me happy, and which things bother me so much that I feel like something is pretty wrong until they are fixed or monitored.

Here’s what I came up with:

Ideal Postgres environment

Documentation

  • Documented replication topology
  • Documented network topology
  • Documented interface topology - including users, passwords, connection estimates, load balancers, connection proxies
  • Documented procedure, schedule for failover and testing
  • Documented procedure, schedule for disaster recovery and testing
  • Documented procedure, schedule for maintenance, upgrades
  • Documented procedure, schedule for data expiration
  • Docuemnted procedure, schedule for backups and testing
  • Documented schedule for system upgrades

Automation

  • Automated maintenance
  • Automated disaster recovery testing
  • Automated backup testing
  • Automated stage environment setup
  • Automated data expiration
  • Automated failover*
  • Automated user management
  • Configuration change management

Monitoring

  • Monitoring of key Postgres performance indicators: query duration, query counts, IO utilization, cache hits, hot tables, locks, cancelled queries, vacuum frequency and duration, large shared buffer "churn" events, size of tables, size of database, bloat, tables with high sequential scan counts (indexing targets), network throughput
  • Monitoring of key application performance indicators: query plans that scan a high % of partitioned tables, duration of stored procedures called regularly
  • System monitoring: CPU, memory, IO, swap, DB connections

Configuration

  • Automated pg_hba.conf and user management through configuration management tool
  • postgresql.conf managed through configuration management tool
  • recovery.conf maintained with configuration management tool

Operations

  • n+1 topology for replication/failover
  • reporting system on 1 hr replication delay (to allow for extended BI queries)
  • PITR archive for recovery from operator/developer error
  • Backup testing completed weekly
  • Failver, Disaster recovery testing completed once per quarter
  • Uptime target for all systems defined and agreed to by users

Re: automation of failover -- Can be varying levels of automation depending on the environment. Most important bit is once the system or operator has decided failover is necessary, next steps should be automatic to avoid errors.

I’m planning to go through each of these items and talk about how we address them in the Web Engineering team, and that will include implementing some new things over the next couple of quarters that we haven’t had in the past.

One thing that didn’t surprise me about this list was how much documentation is needed to keep environments running smoothly. By smoothly, I mean that other people on the team can jump in and fix things, not just a single domain expert.

Sometimes docs come in the form of scripts or code. However, some prose and explanation of the thinking behind the way things works is often also necessary. I frequently underestimate how much domain knowledge I have that I really aught to be sharing for the sake of my team.

Everyday Postgres: Specifying all your INSERT columns

Postgres has so many convenient features, including the ability to not provide a list of columns to an INSERT.

For example:

CREATE TABLE temp_product_versions ( LIKE product_versions );
INSERT INTO temp_product_versions ( SELECT * from product_versions ); 

That’s pretty badass.

However, you may encounter trouble in paradise later if you use this kind of shortcut in production code.

See if you can spot the error in this code sample below.

Here’s the error message:

ERROR:  column "is_rapid_beta" is of type boolean but expression is of type citext
LINE 10:     repository
             ^
HINT:  You will need to rewrite or cast the expression.
QUERY:  INSERT INTO releases_recent
SELECT 'MetroFirefox',
    version,
    beta_number,
    build_id
    update_channel,
    platform,
    is_rapid,
    is_rapid_beta,
    repository
FROM releases_recent
    JOIN products
        ON products.product_name = 'MetroFirefox'
WHERE releases_recent.product_name = 'Firefox'
    AND major_version_sort(releases_recent.version)
        >= major_version_sort(products.rapid_release_version)
CONTEXT:  PL/pgSQL function update_product_versions(integer) line 102 at SQL statement

And here’s the code (long!)

CREATE OR REPLACE FUNCTION update_product_versions(
product_window integer DEFAULT 30
)
RETURNS boolean
LANGUAGE plpgsql
SET work_mem TO '512MB'
SET maintenance_work_mem TO '512MB'
AS $$
BEGIN
-- daily batch update function for new products and versions
-- reads data from releases_raw, cleans it
-- and puts data corresponding to the new versions into
-- product_versions and related tables
-- is cumulative and can be run repeatedly without issues
-- * covers FennecAndroid and ESR releases
-- * now only compares releases from the last 30 days
-- * restricts to only the canonical "repositories"
-- * covers webRT
-- * covers rapid betas, but no more final betas
-- create temporary table, required because
-- all of the special cases
create temporary table releases_recent
ON commit drop
AS
select COALESCE ( specials.product_name, products.product_name )
AS product_name,
releases_raw.version,
releases_raw.beta_number,
releases_raw.build_id,
releases_raw.update_channel,
releases_raw.platform,
(major_version_sort(version) >= major_version_sort(rapid_release_version))
AS is_rapid,
is_rapid_beta(releases_raw.update_channel, version, rapid_beta_version::major_version)
AS is_rapid_beta,
releases_raw.repository
FROM releases_raw
JOIN products ON releases_raw.product_name = products.release_name
JOIN release_repositories
ON releases_raw.repository = release_repositories.repository
LEFT OUTER JOIN special_product_platforms AS specials
ON releases_raw.platform::citext = specials.platform
AND releases_raw.product_name = specials.release_name
AND releases_raw.repository = specials.repository
AND releases_raw.update_channel = specials.release_channel
AND major_version_sort(version) >= major_version_sort(min_version)
WHERE
build_date(build_id) > (current_date - product_window)
AND version_matches_channel(releases_raw.version,
releases_raw.update_channel::citext);
-- fix ESR versions
UPDATE releases_recent
SET update_channel = 'esr'
WHERE update_channel ILIKE 'release'
AND version ILIKE '%esr';
-- insert WebRT "releases", which are copies of Firefox releases
-- insert them only if the FF release is greater than the first
-- release for WebRT
INSERT INTO releases_recent
SELECT 'WebappRuntime',
version,
beta_number,
build_id,
update_channel,
platform,
is_rapid,
is_rapid_beta,
repository
FROM releases_recent
JOIN products
ON products.product_name = 'WebappRuntime'
WHERE releases_recent.product_name = 'Firefox'
AND major_version_sort(releases_recent.version)
>= major_version_sort(products.rapid_release_version);
-- insert WebRTmobile "releases", which are copies of Fennec releases
-- insert them only if the Fennec release is greater than the first
-- release for WebRTmobile
INSERT INTO releases_recent
SELECT 'WebappRuntimeMobile',
version,
beta_number,
build_id,
update_channel,
platform,
is_rapid,
is_rapid_beta,
repository
FROM releases_recent
JOIN products
ON products.product_name = 'WebappRuntimeMobile'
WHERE releases_recent.product_name = 'Fennec'
AND major_version_sort(releases_recent.version)
>= major_version_sort(products.rapid_release_version);
-- insert MetroFirefox "releases", which are copies of Firefox releases
-- insert them only if the FF release is greater than the first
-- release for WebRT
INSERT INTO releases_recent
SELECT 'MetroFirefox',
version,
beta_number,
build_id
update_channel,
platform,
is_rapid,
is_rapid_beta,
repository
FROM releases_recent
JOIN products
ON products.product_name = 'MetroFirefox'
WHERE releases_recent.product_name = 'Firefox'
AND major_version_sort(releases_recent.version)
>= major_version_sort(products.rapid_release_version);
-- now put it in product_versions
-- first releases, aurora and nightly and non-rapid betas
insert into product_versions (
product_name,
major_version,
release_version,
version_string,
beta_number,
version_sort,
build_date,
sunset_date,
build_type,
has_builds,
build_type_enum
)
select releases_recent.product_name,
to_major_version(version),
version,
version_string(version, releases_recent.beta_number),
releases_recent.beta_number,
version_sort(version, releases_recent.beta_number),
build_date(min(build_id)),
sunset_date(min(build_id), releases_recent.update_channel),
releases_recent.update_channel::citext,
(releases_recent.update_channel IN ('aurora', 'nightly')),
releases_recent.update_channel::build_type_enum as build_type_enum
from releases_recent
left outer join product_versions ON
( releases_recent.product_name = product_versions.product_name
AND releases_recent.version = product_versions.release_version
AND releases_recent.beta_number IS NOT DISTINCT FROM product_versions.beta_number )
where is_rapid
AND product_versions.product_name IS NULL
AND NOT releases_recent.is_rapid_beta
group by releases_recent.product_name, version,
releases_recent.beta_number,
releases_recent.update_channel::citext, releases_recent.update_channel;
-- insert rapid betas "parent" products
-- these will have a product, but no builds
insert into product_versions (
product_name,
major_version,
release_version,
version_string,
beta_number,
version_sort,
build_date,
sunset_date,
build_type,
is_rapid_beta,
has_builds,
build_type_enum
)
select products.product_name,
to_major_version(version),
version,
version || 'b',
0,
version_sort(version, 0),
build_date(min(build_id)),
sunset_date(min(build_id), 'beta' ),
'beta',
TRUE,
TRUE,
'beta'
from releases_recent
join products ON releases_recent.product_name = products.release_name
left outer join product_versions ON
( releases_recent.product_name = product_versions.product_name
AND releases_recent.version = product_versions.release_version
AND product_versions.beta_number = 0 )
where is_rapid
and releases_recent.is_rapid_beta
and product_versions.product_name IS NULL
group by products.product_name, version;
-- finally, add individual betas for rapid_betas
-- these need to get linked to their master rapid_beta
insert into product_versions (
product_name,
major_version,
release_version,
version_string,
beta_number,
version_sort,
build_date,
sunset_date,
build_type,
rapid_beta_id,
build_type_enum
)
select products.product_name,
to_major_version(version),
version,
version_string(version, releases_recent.beta_number),
releases_recent.beta_number,
version_sort(version, releases_recent.beta_number),
build_date(min(build_id)),
rapid_parent.sunset_date,
'beta',
rapid_parent.product_version_id,
'beta'
from releases_recent
join products ON releases_recent.product_name = products.release_name
left outer join product_versions ON
( releases_recent.product_name = product_versions.product_name
AND releases_recent.version = product_versions.release_version
AND product_versions.beta_number = releases_recent.beta_number )
join product_versions as rapid_parent ON
releases_recent.version = rapid_parent.release_version
and releases_recent.product_name = rapid_parent.product_name
and rapid_parent.is_rapid_beta
where is_rapid
and releases_recent.is_rapid_beta
and product_versions.product_name IS NULL
group by products.product_name, version, rapid_parent.product_version_id,
releases_recent.beta_number, rapid_parent.sunset_date;
-- add build ids
-- note that rapid beta parent records will have no buildids of their own
insert into product_version_builds
(product_version_id, build_id, platform, repository)
select distinct product_versions.product_version_id,
releases_recent.build_id,
releases_recent.platform,
releases_recent.repository
from releases_recent
join product_versions
ON releases_recent.product_name = product_versions.product_name
AND releases_recent.version = product_versions.release_version
AND releases_recent.update_channel = product_versions.build_type
AND ( releases_recent.beta_number IS NOT DISTINCT FROM product_versions.beta_number )
left outer join product_version_builds ON
product_versions.product_version_id = product_version_builds.product_version_id
AND releases_recent.build_id = product_version_builds.build_id
AND releases_recent.platform = product_version_builds.platform
where product_version_builds.product_version_id is null;
drop table releases_recent;
RETURN TRUE;
END;
$$;

I’m sure quite a few of you found the problem right away. For the rest of us…

Here’s the error message you get if you specify the columns for the INSERT:

ERROR:  INSERT has more target columns than expressions
LINE 10:     repository
             ^
QUERY:  INSERT INTO releases_recent (
    product_name,
    version,
    beta_number,
    build_id,
    update_channel,
    platform,
    is_rapid,
    is_rapid_beta,
    repository
)
SELECT 'MetroFirefox',
    version,
    beta_number,
    build_id
    update_channel,
    platform,
    is_rapid,
    is_rapid_beta,
    repository
FROM releases_recent
    JOIN products
        ON products.product_name = 'MetroFirefox'
WHERE releases_recent.product_name = 'Firefox'
    AND major_version_sort(releases_recent.version)
        >= major_version_sort(products.rapid_release_version)
CONTEXT:  PL/pgSQL function update_product_versions(integer) line 112 at SQL statement

Now, it should be completely obvious. There’s a missing comma after build_id.

Implicit columns for INSERT are a convenient feature when you’re getting work done quickly, they are definitely not a best practice when writing production code. If you know of a linting tool for plpgsql that calls this kind of thing out, I’d love to hear about it and use it.

Everyday Postgres: INSERT with SELECT

This is a continuation of a series of posts about how I use Postgres everyday.

One of the most pleasant aspects of working with Postgres is coming across features that save me lots of typing. Whenever I see repetitive SQL queries, I now tend to assume there is a feature available that will help me out.

One such feature is INSERT using a SELECT, and beyond that, using the output of a SELECT statement in place of VALUES.

Take for example:

INSERT into foo_bar (foo_id, bar_id) VALUES ((select id from foo where name = 'selena'), (select id from bar where type = 'name'));    
INSERT into foo_bar (foo_id, bar_id) VALUES ((select id from foo where name = 'funny'), (select id from bar where type = 'name'));
INSERT into foo_bar (foo_id, bar_id) VALUES ((select id from foo where name = 'chip'), (select id from bar where type = 'name'));

I think a lot of people know that this is possible. There are a few problems with it – like if the result of the WHERE clause isn’t unique in both cases, you’d get an error. In this case, id in both tables were surrogate keys, with both name and type being unique.

What some people don’t realize is that you can SELECT, and then directly insert that into a table:

INSERT into foo_bar (foo_id, bar_id) ( 
  SELECT foo.id, bar.id FROM foo CROSS JOIN bar 
    WHERE type = 'name' AND name IN ('selena', 'funny', 'chip') 
);

If the values you wanted to take from the table bar were not all the same, the query would be considerably more complex. Given that I only am interested in a single value from bar, and I want it joined with a series of explicitly selected values from foo, this version of the query saves me a lot of typing.

The bigger picture, however, was pointed out in the comments by Marko:

VALUES is just a special type of SELECT and that INSERT writes the
result of an arbitrary SELECT statement into the table. Consider:

SELECT 1; vs. VALUES (1);

SELECT * FROM (SELECT 1) sq; vs. SELECT * FROM (VALUES (1)) sq;

INSERT INTO quix VALUES (1); vs. INSERT INTO quix SELECT 1;

The reason VALUES is often used with INSERT is that many RDMBSs don’t
support SELECT without a FROM clause, so using VALUES is more
convenient. It’s also handy if you have a list of data you want to
SELECT, e.g. VALUES (..), (..), (..);

I may have referenced this feature a few times when breaking down functions used for reports in Socorro. It’s super convenient and saves quite a bit of typing! You can put any valid SQL query in there, including CTEs. The documentation for INSERT provides a few more examples.

Everyday Postgres: Tuning a brand-new server (the 10-minute edition)

Server tuning is a topic that consumes many books, blog posts and wiki pages.

Below is some practical advice for getting low-hanging fruit out of the way if you’re new to tuning Postgres and just want something that will likely work well-enough on low volume systems. I’d say looking at this list and making changes on a new system should take 10 minutes or less.

Run pgtune

Greg Smith open sourced a utility for making a first pass at tuning Postgres for a local system with pgtune. This tool is easy to run – just copy it to a target system and then point it at your existing Postgres config. It puts its changes into a new file at the very bottom.

Use XFS

Filesystem choice matters. Greg Smith goes into some detail on why ext3 is a terrible performance choice for a database filesystem in his talk Righting Your Writes. At this point, XFS is the filesystem that should be your default choice. If you want to explore ext4 or zfs (if that’s an option for you), that may be worth looking at. It is “safe” however to choose XFS. Depending on your disk situation, recreating your filesystem might take a bit longer than 10 minutes, but hopefully this will save you time and bad performance in the future!

Increase your readahead buffer

On Linux, the readahead buffer (brief explanation) is set way to small for most database systems. Increase this to about 1 MB with blockdev -setra 2048 [device].

For further performance analysis

I wrote this performance checklist a while back for assessing a system’s health. I’d say a review of all the things on that list would take probably half a day. Following up and making the changes could take a day or more. These kinds of analysis are worth exploring periodically to ensure you haven’t missed important changes in your environment or your application over time.

Everyday Postgres: How I write queries using psql: Common Table Expressions

This this series of posts about using Postgres every day. The last post was about \ commands in psql.

I’m now going to share in a series of posts my workflow for writing queries, and some of the things about working with Postgres that I take for granted in writing queries.

Shortcuts I can’t live without

Three important shortcuts you should learn are:

  • \e: Pulls the last query you executed into a buffer in your favorite editor
  • \df+ [function]: This displays [function] information, and the + dumps the function itself to STDOUT
  • \ef [function]: This pulls [function] into a buffer in your favorite editor. This is the most convenient way to grab a copy of an individual function for me.
  • \ef: This opens your favorite editor and puts a template for a function (in any supported procedural language) in a buffer

I’ll talk about writing functions in a future post.

Thinking in CTEs

In searching through my recent psql history, I found quite a few WITH queries. These are Common Table Expressions, a useful feature supported by many databases that allows you to embed subqueries in your SQL in a very readable format. CTEs have a lot more interesting features and properties, like RECURSIVE.

However, I tend to just use CTEs as a more convenient form of a subquery. This allows me to break apart long queries into smaller, testable chunks. I usually will write a subquery so that it’s in my command history, generate some fake data for testing, and go back to that query in my history to test edge cases.

I iterate on the smaller tables until I have a set of understandable “paragraphs” of SQL. This makes it easier for me to explain the logic of the query with others, and makes testing each piece easier in the event that something breaks. Usually, when a CTE breaks, I’ve made an assumption about incoming data that’s incorrect.

The composability of SQL is often terrible. CTEs help break apart the complexity visually. There’s some warnings about CTEs not performing well under certain circumstances. My approach is to design with CTEs and optimize for performance only if needed.

Other advantages of CTEs

In case you’re not yet convinced CTEs are worth learning, I made a bullet list of advantages from some useful comments about how others are using CTEs:

  • Alternative to throwaway VIEWs and temporary tables when querying replicas (comment from bma)
  • Variable declaration – to emulate DECLARE in SQL Server, for example
  • Easier to understand queries and faster development time (ME)

An example of the kinds of queries I write

Something you’d see a lot in my command history are queries that look like this:

WITH crashes AS (                                                               
    SELECT uptime_string AS category                                                      
        , sum(report_count) AS report_count                                     
    FROM signature_summary_uptime                                               
        JOIN signatures USING (signature_id)                                       
    WHERE                                                                           
        signatures.signature = 'Fake Signature #1'                                             
        AND report_date >= '2013-08-05T00:00:00+00:00'::timestamptz             
        AND report_date < '2013-08-12T00:00:00+00:00'::timestamptz              
        AND product_name IN ('Firefox')  AND version_string IN ('1')            
    GROUP BY category                                                           
),                                                                              
totals AS (                                                                     
    SELECT                                                                      
        category                                                                
        , report_count                                                          
        , sum(report_count) OVER () as total_count                              
    FROM crashes                                                                
)                                                                               
SELECT category                                                                 
    , report_count                                                              
    , round((report_count * 100::numeric)/total_count,3)::TEXT                  
as percentage                                                                   
FROM totals                                                                     
ORDER BY report_count DESC                                                      
;

You’ll see that I have one or more WITH clauses, and then a query that performs a final summary query using the data from the CTEs.

This query probably was asked for something like this:

Please provide counts of crashes with the same uptime, for Firefox version 1, and the signature ‘Fake Signature #1’ for the last week, including a percentage of all of the sampled crashes.

While I’m sure there are better ways to write the query above, I wanted to show how I have made a pattern for myself to speed up query writing. I’m not always interested in the best possible query. Hopefully, the Postgres planner makes up for many of my sins as a developer!

What I am interested in is finding answers to problems quickly for my coworkers.

In answering the question I was asked, I first dig out an appropriate summary table (we have quite a few in Socorro). I found the signature_summary_uptime table, and fortunately it has product_name and version_string available in the table. I only need to join signatures to fulfill the request. (Yay for denormalized data that supports the kinds of queries we often run!)

Next, I see that I’m being asked for a total percentage, so I need to calculate a sum across all the rows that I retrieve. That can be very slow, so I create a second CTE that uses data from the first CTE (rather than doing two full table scans to calculate the total). I use a window function instead of SUM() here because I’ve done experiments to see which tends to be faster.

And, finally once I have all the data together, I run my final query using my two CTE tables.

How CTEs and breaking down this process have helped me

So, I’ve had about a year to practice. A query like this today takes me 10-15 minutes to assemble and test. They are typically slightly more complex — with more dependencies, and maybe 2-3 more tables involved in JOINs. But they follow the same basic pattern.

Most queries on my data sets conform to recognizable patterns.

After a few months, we recognized that moving JSON for crash data into Postgres also would be a win, and was easy to process using very similar queries.

That’s all helped make finding answers about Firefox crashes easier and faster!

Everyday Postgres: Top 10 psql ‘\’ commands I use

I have been thinking about the kinds of questions people have about Postgres if they’re mostly users of MySQL. One thing that comes up a lot is how to use the psql command-line.

I’m going to do a series of posts based on what I actually do every day with Postgres. This isn’t going to be an exhaustive look at all the features, but just the kinds of things I find useful.

Here’s a look at the kinds of commands I regularly use on a production system:

selena@wuzetian:~ #1642 15:13: awk '{print $1}' /tmp/cmds   | uniq -c | sort -n -r
     47 \e
     22 \d
     13 \x
     12 \df+
     10 \q
      9 \df
      6 \ef
      6 \d+
      5 \o
      5 \h

Here’s the kinds of commands I use on my local system:

selena@wuzetian:~ #1645 15:15: awk '{print $1}' /tmp/local_cmds  | uniq -c | sort -n -r
     89 \d
     43 \e
     28 \df+
     14 \x
     14 \d+
     13 \df
     11 \c
     10 \h
      4 \a
      3 \ef+

There’s not a whole lot of difference between the two. I pretty clearly use the database locally to look at schema definitions over and over again!

Here’s what each of these commands do:

  • \d+: Examine a table, by default in 9.2 prints the table name, followed by the columns, their types, keys, indexes and constraints. The plus will cause all child tables that inherit from a parent to be listed.
  • \e: Opens an editor defined by your EDITOR environment variable, and put the most recent command entered in psql into the buffer. You can define a non-command line editor here!
  • \df+: Prints information about a User Defined Function, including the function’s whole definition (that’s what the + does), best when combined with \x and probably \a as well
  • \q: Quits psql. You can also quit with ^D
  • \ef [function]: Opens up your editor, and puts the function into the buffer. Without a function, it provides a convenient template for creating a new function.
  • \o [filename]: Open a local file for writing the output of whatever commands you run next. Stop writing to the file with another \o
  • \h: Help for SQL commands
  • \c [databasename]: Connect to [databasename] on local database cluster
  • \a: Print output “unaligned”, or without adding whitespace to make columns align. Good when trying to print machine-readable output to the terminal.
  • \x: Print output “expanded”. This causes output to be printed out like: “Column: Value”, rather than the normal tabular/spreadsheet style. Useful in lots of contexts, especially when you’ve got some columns that have a very large text field.

And here’s a few useful commands that didn’t make the top 10 lists:

  • \?: Help for \ commands
  • \timing: Turn timing of all commands on, reports in ms.
  • \s: print out your psql history to STDOUT.
  • \i [filename]: execute the contents of [filename]
  • \! [command]: execute a command in the local shell

Finally, when you start up psql, you have a few options. My favorite combination when generating machine-readable output is to add -AX -qt (axe cutie! hat tip to Greg Sabino Mullane for that mneumonic). Another very useful psql extension is -e, which causes the SQL commands used to produce output to also be printed out. This will help you learn about information_schema items and all the internal tables used to provide system information.

The shortcuts really worth spending a bit of time exploring are \e* and \d*. Both provide quite a bit of useful functionality, with relatively easy to remember letter combinations.

Eliminating duplicate code: our backfill functions for Socorro

Last Friday, I spent some time refactoring a user defined function in Socorro that was taking a little too long to run each day.

This meant splitting up one function into about 8 separate functions. Our functions are designed to backfill themselves when a failure occurs. However, if we need to remove an incorrect daily report and re-run the functions from scratch, we’ve typically written a special function for every report called backfill_REPORTNAME that handles the cleanup work.

This means we’ve got a lot of boilerplate code, that it would really be nice to replace. So, I took this opportunity to create a utility function and hopefully never have to write another backfill_REPORTNAME function again!

Here it is:

CREATE OR REPLACE FUNCTION backfill_named_table(tablename text, updateday date) 
    RETURNS boolean
    LANGUAGE plpgsql
AS $function$
DECLARE
    update_proc_name TEXT := 'update_' || tablename;
BEGIN

-- Check if requested table for backfilling exists
PERFORM 1 FROM information_schema.tables WHERE table_name=tablename;
IF NOT FOUND THEN
    RAISE INFO 'table: % not found', tablename;
    RETURN FALSE;
END IF;

-- Check that requested function for update exists
PERFORM 1 FROM pg_proc WHERE proname = update_proc_name;
IF NOT FOUND THEN
    RAISE INFO 'proc: % not found', update_proc_name;
    RETURN FALSE;
END IF;

EXECUTE format('DELETE FROM %I WHERE report_date = %L', tablename, updateday);

EXECUTE format('SELECT %I(%L, FALSE)', update_proc_name, updateday);

RETURN TRUE;

END;
$function$
;

Here’s the file with the code.

I’ve been trying to switch over to using format() instead of || in my queries, because it tends to be much more readable.

You’ll see that I’ve got a check for the existence of the table, and that the user defined function for the update exists. The type checking in the function handles ensuring that updateday is a valid date. If you think there’s any improvements I could make on this, definitely let me know in the comments.

JOINing against VIEWs can be harmful

I had a recent code review problem that was very curious at first glance, but came down to the use of complex VIEW in an even more complicated and frequently used reporting query.

I’ll just paste a edited version of the review below.

tl;dr: Don’t use product_info (a view, not a table) in this query, move WHERE clauses for product_name and version_string into the infos CTE, strictly limit the number of columns in tables being joined

This query is unfortunately doomed because it is using product_info — a view which already contains data from product_versions. There are four other tables which we don’t care about for the query that are included in the view.

As a result, you get a self-join many times over. A hint at the horrors of what Postgres decides to do with this is here:

Unique  (cost=10248.32..10248.35 rows=1 width=294)
   CTE infos
     ->  Hash Right Join  (cost=301.82..1683.83 rows=40195 width=96)
           Hash Cond: (pvb.product_version_id = pv.product_version_id)
           ->  Seq Scan on product_version_builds pvb  (cost=0.00..768.71 rows=42271 width=16)
           ->  Hash  (cost=282.46..282.46 rows=1549 width=84)
                 ->  Hash Right Join  (cost=218.53..282.46 rows=1549 width=84)
                       Hash Cond: (pv.product_version_id = pi.product_version_id)
                       ->  Seq Scan on product_versions pv  (cost=0.00..40.29 rows=1629 width=35)
                       ->  Hash  (cost=199.17..199.17 rows=1549 width=53)
                             ->  Subquery Scan on pi  (cost=179.81..199.17 rows=1549 width=53)
                                   ->  Sort  (cost=179.81..183.68 rows=1549 width=62)
                                         Sort Key: product_versions.product_name, product_versions.version_string
                                         ->  Hash Join  (cost=5.70..97.73 rows=1549 width=62)
                                               Hash Cond: ((product_versions.product_name = product_release_channels.product_name) AND (product_versions.build_type = product_release_channels.release_channel))
                                               ->  Seq Scan on product_versions  (cost=0.00..40.29 rows=1629 width=52)
                                               ->  Hash  (cost=5.03..5.03 rows=45 width=42)
                                                     ->  Hash Join  (cost=2.34..5.03 rows=45 width=42)
                                                           Hash Cond: (product_release_channels.release_channel = release_channels.release_channel)
                                                           ->  Hash Join  (cost=1.23..3.29 rows=45 width=34)
                                                                 Hash Cond: (product_release_channels.product_name = products.product_name)
                                                                 ->  Seq Scan on product_release_channels  (cost=0.00..1.45 rows=45 width=22)
                                                                 ->  Hash  (cost=1.10..1.10 rows=10 width=12)
                                                                       ->  Seq Scan on products  (cost=0.00..1.10 rows=10 width=12)
                                                           ->  Hash  (cost=1.05..1.05 rows=5 width=8)
                                                                 ->  Seq Scan on release_channels  (cost=0.00..1.05 rows=5 width=8)

Whenever you see so many nested joins, subquery sorts and sequence scans mushed together in a staircase, that’s a signal that we should investigate whether the query we’re running is really what we thought it was.

While @peterbe dug through code with me, he mentioned that product_info was a view! Now all the self-JOINs made sense and I started refactoring.

The product_info view was being deconstructed into it’s component parts, which already included product_versions (resulting in a self-join) and including a bunch of junk that for the purposes of this query, we don’t really care about. So, as the first step, I just made a copy of the SELECT query from the view (you can get that by running \d+ product_info in psql or you can dig it out of the socorro/external/postgresql/procs/views section of our code.

Here’s my proposal for what should go into infos:

         SELECT 
                product_versions.product_version_id
                , product_versions.version_string
                , 'new'::text AS which_table
                , product_versions.product_name
                , product_versions.release_version
                , product_versions.build_type
                , product_version_builds.build_id
                , product_versions.is_rapid_beta
                , product_versions.rapid_beta_id
                , product_versions.version_sort
        FROM product_versions
                LEFT JOIN product_version_builds USING (product_version_id)
        WHERE  %(product name and versions)s

We really need to move the product name and version filtering to this portion of the query because otherwise we end up doing a horrible self join on a 42,000 row table! :watch:

Here’s what the self-join looks like in the EXPLAIN:

   ->  Sort  (cost=8564.48..8564.49 rows=1 width=294)
         Sort Key: i1.version_sort, i1.product_version_id, i1.product_name, i1.version_string, i1.which_table, i1.release_version, i1.build_type, i1.build_id, i1.is_rapid_beta, i2.is_rapid_beta, ((((i2.product_nam
e)::text || ':'::text) || (i2.version_string)::text))
         ->  Merge Join  (cost=7755.52..8564.47 rows=1 width=294)
               Merge Cond: ((i1.product_name = i2.product_name) AND (i1.release_version = i2.release_version) AND (i1.build_type = i2.build_type))
               Join Filter: (((i1.product_name = 'Firefox'::citext) AND (i1.version_string = '26.0a2'::citext) AND (i1.version_string = i2.version_string)) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.pr
oduct_name = 'Firefox'::citext) AND (i2.version_string = '26.0a2'::citext) AND (i2.is_rapid_beta IS TRUE)))
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=233)
                     Sort Key: i1.product_name, i1.release_version, i1.build_type
                     ->  CTE Scan on infos i1  (cost=0.00..803.90 rows=40195 width=233)
               ->  Sort  (cost=3877.76..3978.25 rows=40195 width=133)
                     Sort Key: i2.product_name, i2.release_version, i2.build_type
                     ->  CTE Scan on infos i2  (cost=0.00..803.90 rows=40195 width=133)

sad_kitten

This is pretty sad. The Sort at the top of Mt. Sadness. There are a series of sorts further down that are just HUGE because we’re tossing 45k records that must be joined to each other, and the width of the query is 294 — 294 columns in addition to our 45k rows.

The obvious (but sadly not always effective) thing to try is to see if we can filter our rows out earlier. Because we’re using infos, conveniently, that looks possible without too much trouble.

That just leaves sorting out the rapid beta self-join, which based on my tests should be pretty easy to continue to do in the body of the main SELECT, at line 125.

With the changes I proposed, the estimated duration of this query is ~200 ms in stage and the query plan looks like:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=37.07..37.08 rows=1 width=294) (actual time=221.131..221.149 rows=31 loops=1)
   CTE infos
     ->  Nested Loop Left Join  (cost=0.00..35.18 rows=26 width=64) (actual time=0.136..0.459 rows=150 loops=1)
           ->  Index Scan using product_version_version_key on product_versions  (cost=0.00..7.27 rows=1 width=52) (actual time=0.111..0.112 rows=1 loops=1)
                 Index Cond: ((product_name = 'Firefox'::citext) AND (version_string = '26.0a2'::citext))
           ->  Index Only Scan using product_version_builds_key on product_version_builds  (cost=0.00..27.58 rows=33 width=16) (actual time=0.019..0.268 rows=150 loops=1)
                 Index Cond: (product_version_id = product_versions.product_version_id)
                 Heap Fetches: 150
   ->  Hash Join  (cost=0.84..1.86 rows=1 width=294) (actual time=0.943..47.334 rows=22500 loops=1)
         Hash Cond: (i1.product_version_id = i2.product_version_id)
         Join Filter: ((i1.version_string = i2.version_string) OR ((i1.rapid_beta_id = i2.product_version_id) AND (i2.is_rapid_beta IS TRUE)))
         ->  CTE Scan on infos i1  (cost=0.00..0.52 rows=26 width=233) (actual time=0.141..0.236 rows=150 loops=1)
         ->  Hash  (cost=0.52..0.52 rows=26 width=69) (actual time=0.778..0.778 rows=150 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 8kB
               ->  CTE Scan on infos i2  (cost=0.00..0.52 rows=26 width=69) (actual time=0.002..0.664 rows=150 loops=1)
 Total runtime: 221.321 ms
(16 rows)