The problems with copyright re-assignment

While I was in NYC (eating awesome food, riding my bike across the brooklyn bridge in the rain!), I spent time catching up with free software advocates. One issue that we talked about was copyright assignment. H Online recently published an article about this. Their description of the Linux kernel’s policy pretty much matches PostgreSQL’s policy:

Ownership of free software is a difficult area, and one that is resolved simply by the Linux kernel project. The code belongs to everyone and no-one, and the copyright for each individual piece of code belongs to the original coder, so that any future reassignment of the licence or the code for the Linux kernel requires the agreement of every other contributor.

I haven’t contributed code to projects other than PostgreSQL in a long time, but an important aspect of contribution that I used to not think very much about is copyright assignment. Now that I have spent a little time thinking about it, my preference is to contribute to projects which do not require copyright re-assignment.

Copyright came up in a conversation about dual-licensing, because it is the copyright assignment which provides the opportunity for a codebase to be re-licensed. But more important to me than the possibility of re-licensing, is the chilling effect copyright re-assignment agreements have on communities. The intent can be to be to hedge a company’s bets against contributor interference, and ultimately be able to assert complete control over a codebase. If we agree that the collaborative production of software is a social good, this type of hedging can only be seen as anti-social, and ultimately, destructive to a software community. In practice, I’ve seen projects which require contributor agreements effectively shun all non-corporate contributions, or actively engage in “ornamental sourcing“.

For a business owner who invests in free and open source software, this is an unsustainable position. The advantage of accessing source code is not just the code, but the people who know the code. And while I’m sure there are some exceptions, I doubt most people consider themselves experts in a codebase without having contributed significant patches to it.

Given all that, copyright assignment to the Free Software Foundation or to Canonical has been a contentious issue. But maybe if you have an organization which is committed in its charter to maintaining software freedom, then the copyright assignment serves a social good and gives an organization like the FSF the legal authority to pursue legal action if the terms of a license are violated.

Training lessons learned: Code dojo, whiteboards, interactivity

Training can be an incredibly boring, frustrating exercise. Often, I have friends who don’t bother to attend sessions or tutorials during conferences. Instead, they cherry-pick friends and colleagues that they can work on code, gossip or brainstorm with in the hall while others sit passively in lectures. When I think about it now, knowing this about my friends is what motivated me to start Open Source Bridge.

The PostgreSQL training I gave to Ondo State was specifically targeted at developers. I used material End Point had from previous trainings, and added few new things designed to meet the needs of fledgling database developers. The high points I wanted to hit were: schema design basics, user defined function development and highlight developer-friendly features of Postgres that they should be aware of.

One big obstacle for me was that they would all be using Windows as their primary operating system. I develop exclusively on UNIX-based platforms, and so I had to spend a little time getting re-acquainted with Windows tools. pgAdmin III was essential, and I was happy that a new version was released along with version 8.4 of Postgres.

Also, while the concepts are the same, the built-in monitoring tools for Windows are quite a bit different, and I used freely available material from my Postgres colleagues who support Windows for a couple hour tutorial on interactive troubleshooting.

When trying to explain concepts – like replication, or basic database terms – it really helps to have a whiteboard. I was working with a group of people with diverse IT backgrounds, and often, I asked individuals to try to explain in their own words various terms (like “transaction”). This helped engage the students in a way that simply stating definitions can’t. Observing their fellow students struggling with terminology helped them generate their own questions, and I saw the great results the next day – when students were able to define terms immediately, that took five minutes the day before to work through.

Finally, one important request from the client was that some time be spent mentoring developers on standards, best practices for development and coding style. To accomplish this task with fourteen students in such a limited period of time, I decided to conduct a series of coding sessions where students and I took turns at the keyboard programming as a group. We call this coding dojo, a concept built on the Coding Katas from Dave Thomas.

Overall, I prefer interactive training, where students are not only encouraged, but forced to interact with each other and the instructor.

When I sent out the CFP for MySQL Conf yesterday, lots of people asked me for suggestions on talk topics. In general, I recommend that speakers focus on a particular take-away for the audience, and mention specifically what a person sitting in is going to learn *and* apply immediately. Not every talk can be interactive, or give people chunks of code. But *every* talk should have a clear goal, and leave the audience educated. The best leave them inspired!

PostgreSQL at MySQL Users’ Conference 2011

So, I’m on the committee for MySQL Conf this year, and the committee is specifically seeking talks about PostgreSQL. The idea is to broaden the scope of the conference to include a lot of different open source database technology, including a bunch more about Postgres.

The theme of the conference is “the ecosystem and beyond”, which was chosen specifically because the open source database world has exploded and grown so much in the last three years. Below is a slide from a presentation I made last year at LinuxConf AU about the growth in free and open source (FOSS) databases:

We’ve seen a half-dozen forks of MySQL appear, exponential growth among “NoSQL” databases, and now, a powerful release from PostgreSQL. It may seem odd that the name of the conference didn’t change to reflect the change in focus – but this is the largest FOSS database conference I know off – weighing in at over 2000 people last year. Given the community that’s grown around it, I understand why they are keeping the name.

The content will still largely focus on MySQL — the core, the many forks, and the community around it. But we’ll also hear from many new, successful database projects, and definitely hear from PostgreSQL. To do that, though, I need you to submit talks!

The submission deadline for all proposals is October 25, 2010!

Topics for consideration include:

  • Innovative uses of Postgres
  • Data warehousing and BI
  • Architectures based on Postgres
  • PostGIS
  • Government + Postgres
  • [your favorite web framework] + Postgres
  • Performance and optimization
  • Security and database administration
  • “In the cloud”
  • Business and case studies

If you’ve got an idea, submit a proposal today!

Please contact me directly for feedback, help with submissions or help generating ideas. And if you’re submitting, please just drop me a line to let me know! I’d love to hear from all that are interested.

PostgreSQL 9.0.1 released, includes security fix & maintenance releases for 6 other versions

The PostgreSQL Global Development group released new maintenance versions today: 9.0.1, 8.4.5, 8.3.12, 8.2.18, 8.1.22, 8.0.26 and 7.4.30. This is the final update for PostgreSQL versions 7.4 and 8.0. There’s a security issue in there involving procedural languages, and a detailed description of the vulnerability is on our wiki. A key thing to remember is that the issue primarily affects people who use SECURITY DEFINER along with a procedural language function. PL/PgSQL is not affected, but any other procedural language with a “trusted” mode is. This includes PL/Perl, PL/tcl, PL/Python (7.4 or earlier) and others. The new versions fix issues in PL/Perl and PL/tcl. A patch for PL/PHP is currently in the works.

Most developers feel that the security issue is relatively obscure. If you aren’t using a procedural language with some mechanism for altering privileges (SET ROLE or SECURITY DEFINER, for example), you aren’t vulnerable to the security issue and can upgrade Postgres during your next regularly scheduled downtime. If you *are* vulnerable, we recommend investigating the use of the functions that may be vulnerable, and taking steps to prevent their exploitation by upgrading as soon as you can.

From the FAQ:

What is the level of risk associated with this exploit?

Low. It requires all of the following:

  • An attacker must have an authenticated connection to the database server.
  • The attacker must be able to execute arbitrary statements over that connection.
  • The attacker must have an strong knowledge of PostgreSQL.
  • Your application must include procedures or functions in an external procedural language.
  • These functions and procedures must be executed by users with greater privileges than the attacker, using SECURITY DEFINER or SET ROLE, and using the same connection as the attacker.

This was also the first release for which I generated release notes! 😀

Here was my list of interesting changes for the announcement:

  • Prevent show_session_authorization() from crashing within autovacuum processes, backpatched to all supported versions;
  • Fix connection leak after duplicate connection name errors, fix handling of connection names longer than 62 bytes and improve contrib/dblink’s handling of tables containing dropped columns, backpatched to all supported versions;
  • Defend against functions returning setof record where not all the returned rows are actually of the same rowtype, backpatched to 8.0;
  • Fix possible duplicate scans of UNION ALL member relations, backpatched to 8.2;
  • Reduce PANIC to ERROR on infrequent btree failure cases, backpatched to 8.2;
  • Add hstore(text, text) function to contrib/hstore, to support migration away from the => operator, which was deprecated in 9.0. Function support backpatched to 8.2;
  • Treat exit code 128 as non-fatal on Win32, backpatched to 8.2;
  • Fix failure to mark cached plans as transient, causing CREATE INDEX CONCURRENTLY to not be used right away, backpatched to 8.3;
  • Fix evaluation of inner side of an outer join is a sub-select with non-strict expressions in its output list, backpatched to 8.4;
  • Allow full SSL certificate verification to succeed in the case where both host and hostaddr are specified, backpatched to 8.4;
  • Improve parallel restore’s ability to cope with selective restore (-L option), backpatched to 8.4 with caveats;
  • Fix failure of “ALTER TABLE t ADD COLUMN c serial” when done by non-owner, 9.0 only.
  • Several bugfixes for join removal, 9.0 only.

If you have a look at a new tool that Robert Haas and Tom Lane commited to the repo called git_changelog, you can use it to find the commit IDs for the various features (you need the whole source tree to do it :)).

You’ll find that there are a lot of commits in these sets. We haven’t had a minor release since May 2010, so they kind of added up.

Any other changes in there you think we should have mentioned in the announcement? Let me know in the comments.

Download new versions now:

Custom aggregates: a couple tips and ORDER BY in 9.0

A friend asked about a way to report the first three semesters that a group of students were documented as being present, and report those values each in a column.

The tricky thing is that the semesters students attend are rarely the same. I started out with a very naive query (and sorry for the bad formatting that follows.. i need to find some good SQL formatting markup) just to get some initial results:


select student,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1) as sem1,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 1) as sem2,
(SELECT semester as sem1 FROM assoc a2 WHERE a2.student IN (a1.student) ORDER BY sem1 LIMIT 1 offset 2) as sem3
FROM assoc a1
WHERE
student IN ( select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

That query pretty much sucks, requiring five sequential scans of ‘assoc’:

                                     QUERY PLAN                                     
 HashAggregate  (cost=3913.13..315256.94 rows=78 width=2)
   ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=2)
         Hash Cond: (a1.student = assoc.student)
         ->  Seq Scan on assoc a1  (cost=0.00..1126.17 rows=78017 width=2)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)
   SubPlan 1
     ->  Limit  (cost=1326.21..1326.22 rows=1 width=3)
           ->  Sort  (cost=1326.21..1328.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 2
     ->  Limit  (cost=1331.22..1331.22 rows=1 width=3)
           ->  Sort  (cost=1331.21..1333.71 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)
   SubPlan 3
     ->  Limit  (cost=1334.14..1334.14 rows=1 width=3)
           ->  Sort  (cost=1334.14..1336.64 rows=1000 width=3)
                 Sort Key: a2.semester
                 ->  Seq Scan on assoc a2  (cost=0.00..1321.21 rows=1000 width=3)
                       Filter: (student = a1.student)

So, he reminded me about custom aggregates! I did a little searching and found an example function that I added an extra CASE statement that stops the aggregate from adding more than three items to the array returned:


CREATE FUNCTION array_append_not_null(anyarray,anyelement)
RETURNS anyarray
AS '
SELECT CASE WHEN $2 IS NULL THEN $1 WHEN array_upper($1, 1) > 2 THEN $1 ELSE array_append($1,$2) END
'
LANGUAGE sql IMMUTABLE RETURNS NULL ON NULL INPUT;

And finally, I declared an aggregate:


CREATE AGGREGATE three_semesters_not_null (
sfunc = array_append_not_null,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
);

One problem though – we want the array returned to be only the first three semesters, rather than any three semesters a student has a record for. Meaning, we need to sort the information passed to the aggregate function. We could do this inside the aggregate itself (bubble sort, anyone?) or we can presort the input! I chose presorting, to avoid writing a real ugly case statement.

My query (compatible with 8.3 or higher):


SELECT sorted.student, three_semesters_not_null(sorted.semester)
FROM (SELECT student, semester from assoc order by semester ) as sorted
WHERE
sorted.student IN (select a.student from assoc a group by a.student HAVING count(*) > 2)
GROUP BY sorted.student;

Which yields the much nicer query plan, requiring just two sequential scans:

                                      QUERY PLAN                                      
 HashAggregate  (cost=11722.96..11725.46 rows=200 width=64)
   ->  Hash Semi Join  (cost=10052.32..11570.82 rows=30427 width=64)
         Hash Cond: (assoc.student = a.student)
         ->  Sort  (cost=8533.14..8728.18 rows=78017 width=5)
               Sort Key: assoc.semester
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
         ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
               ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                     Filter: (count(*) > 2)
                     ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=2)

I ran my queries by Magnus, and he reminded me that what I really needed was ORDER BY in my aggregate! Fortunately, 9.0 has exactly this feature:


SELECT student,
three_semesters_not_null(semester order by semester asc ) as first_three_semesters
FROM assoc
WHERE student IN (select student from assoc group by student HAVING count(*) > 2)
GROUP BY student;

Which results in the following plan:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: public.assoc.student
         ->  Hash Semi Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (public.assoc.student = public.assoc.student)
               ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

A final alternative would be to transform the IN query into a JOIN:


SELECT a.student,
three_semesters_not_null(a.semester order by a.semester asc ) as first_three_semesters
FROM assoc a
JOIN (select student from assoc group by student HAVING count(*) > 2) as b ON b.student = a.student
GROUP BY a.student;

And the plan isn’t much different:

                                        QUERY PLAN                                        
 GroupAggregate  (cost=11125.05..11711.15 rows=78 width=5)
   ->  Sort  (cost=11125.05..11320.09 rows=78017 width=5)
         Sort Key: a.student
         ->  Hash Join  (cost=1519.18..3718.08 rows=78017 width=5)
               Hash Cond: (a.student = assoc.student)
               ->  Seq Scan on assoc a  (cost=0.00..1126.17 rows=78017 width=5)
               ->  Hash  (cost=1518.20..1518.20 rows=78 width=32)
                     ->  HashAggregate  (cost=1516.26..1517.42 rows=78 width=2)
                           Filter: (count(*) > 2)
                           ->  Seq Scan on assoc  (cost=0.00..1126.17 rows=78017 width=2)

Any other suggestions for this type of query?

I’ve attached the file I was using to test this out.
custom_aggregates.sql

PostgreSQL 9.0: contributions!

New releases are opportunities for reflection!

As PostgreSQL grows, I would like to know how many people are contributing at any time. This is difficult to measure, given how many people contribute in ways not visible to the internet – advocating for PostgreSQL at work, sharing information about PostgreSQL offline in any way, or developing code related to PostgreSQL that isn’t shared directly back.

PostgreSQL developers have a habit of mentioning the people involved in the development of features in the commit logs. This includes people who discuss topics on the mailing list, who report bugs, provide test cases or send in patches. I spent a bit of time digging through the commit logs and pulling out unique names that are mentioned. This is a lossy process, as the log files are long, names are not always easy to spot, and I only spent 6 hours going through it.

I’ve made it through the 9.0 (16163 lines of logs) and 8.4 logs (21257 lines of logs) so far.

Here’s some basic information about them:

  • 8.4 logs mention about 230 unique people (11 committers)
  • 9.0 logs mention about 275 unique people (16 committers)
  • 8.4 development contained 2293 commits with commits per author broken down below (click for a bigger version):
  • 9.0 development contained 1703 commits, and the commits per author broken down below (click for a bigger version):

I’m working on graphs about number of lines inserted or deleted by each author, but need more time to work out the information presentation. Some interesting trends emerge about what the role of each committer is – particularly that there are a couple people who seem to be “gardeners” of the code – removing a lot of lines, sometimes more than they are adding. With a project as old as ours (first commit in 1996!), this maintenance work is critical.

I also did some grepping for key words in the commit messages:

word times mentioned
in 8.3
times mentioned
in 8.4
times mentioned
in 9.0
review 24 14 49
cute 29 26 25
tom lane 904 901 635
gripe 37 48 26
hot standby 0 5 48
replication 18 4 52

You’re welcome to explore our git repo at git.postgresql.org. Thanks to all the folks who worked on the git migration over the past few months, and finally made our transition from CVS to git complete last night!

PostgreSQL 9.0 media coverage roundup

As you probably noticed, we released PostgreSQL 9.0 today!

There were quite a few stories, and here’s what I’ve got so far:

CouchCamp 2010: yay!


Max in a tree! Talking about GeoCouch

I was at CouchCamp last week out at the Walker Creek Ranch – a bit disconnected (no cel service, and spotty internet), but fully immersed in the CouchDB community.

I was there to give a talk on MVCC in PostgreSQL. I forgot to mention it during my talk, but it was a fitting topic given that I first talked with JChris after a talk he gave in Portland, where I basically trolled him about compaction and MVCC in CouchDB. My goal was to show people the benefits of CouchDB’s built-in MVCC, to point out some places where core developers can learn from PostgreSQL and avoid some of the traps we’ve fallen into over the years. I’ve got more to say about the talk some other day, but I wanted to just reflect on CouchCamp for a moment.

One comment a friend made was, “Wow, these people are just so nice!” And it’s true. Every hacker meetup I attend is full of people who are overwhelmingly kind and thoughtful, and CouchCamp was more of the same.

CouchDB is at a critical point in their development – 1.0 is out the door, and developers are already building cool apps on top of it. CouchApps + Evently are an interesting and fun way to get started building things on top of a couch. And replication parties – seriously awesome. Ward Cunningham is rumored to be considering a CouchDB wiki to drive the patterns repository wiki (And here it is! Thanks, Max!), and CouchCamp was overflowing with ideas and implementations (distributed social, a replacement for email, UbuntuOne).

So what did I learn at CouchCamp? I learned how to hack on a CouchApp (Thanks for the help, JChris!). I learned about what Max Ogden is up to, and am so excited for him and the lucky folks that get to work with him. (and he’s running a hack/project night next weekend you should TOTALLY GO TO!)

I heard about the success and tribulations of running CouchDB on the desktop, and the launch of UbuntuOne from Stuart Langridge. During his talk, Stuart brought up the idea of a general replication API – something that I also believe is important to the growth of open source databases and is critical to enabling data freedom. I met a real, live Pick user/admin/developer, and talked about the inability to move to another system but the possibility of interfacing something like CouchDB to it. I got to chat with Rebecca Murphey about Javascript, MVCC and quality booze. I saw bunnies, foxes, deer, raccoons, and tons of bright stars late at night. And, I saw Damien Katz perform a brief interpretive dance.

I also was pointed to a retrospective on Couch 1.0 development by Ted Leung. I don’t know Noah Slater, but wow, what a testimonial. Noah’s comments about why he continues to contribute to CouchDB mirror a recent thread about PostgreSQL contribution — we work on these open source projects because of the incredible community that develops around them.

Thanks, Mikael, JChris, Jan and Damien, and all the CouchDB folks for creating a community that so many people want to contribute and become a part of. I certainly want to be a part of it, and look forward to finding ways of contributing more.

And thanks for bringing us all together in person. From the squirt guns in the welcome bag, to the campfire and sing-alongs, to the very late night Android libc storytelling by Aaron… These are the moments that glue us all together, and make all that work we do to connect up with one another through software completely worth it.

Explaining MVCC in Postgres: system defined columns

I’m playing around with some diagrams for explaining MVCC that I’ll be posting here over the next few days. Not sure if I’ll end up giving up on slides and just use a whiteboard for the talk. I made an illustrated shared buffers deck to go along with Greg Smith’s excellent talk on shared buffers a while back. This is the beginning of a talk that I hope will emulate that.

Here are my first few slides, showing the system-defined columns. The next few slides will describe optimizations PostgreSQL has for managing the side effects of our pessimistic rollback strategy, and reducing IO during vacuuming and index updates.

Variable substitution with psql

Updated: Thanks @johto for s/:bar/:foo/. 🙂

A coworker asked about variable substitution with psql using \set, and so I looked into it a bit further.

You definitely can do things like this:

16:55 sdeckelmann@[local]:5432|postgres=> \set test 'select * from :foo limit 10;'
16:56 sdeckelmann@[local]:5432|postgres=> \set foo 'test'
16:56 sdeckelmann@[local]:5432|postgres=> :test
myint
-------
1
2
3
4
5
6
7
8
9
10
(10 rows)

But, what about something like this:


=> \set test 'select * from :var limit 10;'
=> :test mytable

Unfortunately, this isn’t supported.

The best you could do is something pathological like:

=> \set s 'select * from '
=> \set pr ' limit 10;'
=> :s mytable :pr
=> :s test :pr
myint
-------
1
2
3
4
5
6
7
8
9
10
(10 rows)