PostgreSQL 8.5 recently learned how to handle "inline functions" through the DO statement. Further discussion is here, but the basic idea is that within certain limitations, you can write ad hoc code in any language that supports it, without having to create a full-fledged function. One of those limitations is that you can't actually return anything from your function. Another is that the language has to support an "inline handler".
PostgreSQL procedural languages all have a language handler function, which gets called whenever you execute a stored procedure in that language. An inline handler is a separate function, somewhat slimmed down from the standard language handler. PostgreSQL gives the inline handler an argument containing, among other things, the source text passed in the DO block, which the inline handler simply has to parse and execute.
As of when the change was committed in PostgreSQL, only PL/pgSQL supported inline functions. Other languages may now support them; today I spent the surprisingly short time needed to add the capability to PL/LOLCODE. Here's a particularly useless example:
DO $$ HAI VISIBLE "This is a test of INLINE stuff" KTHXBYE $$ language pllolcode;
I'm in Seattle for the PostgreSQL Conference West today! I just finished giving a talk on Bucardo, a master-slave and multi-master replication system for Postgres.
The talk was full, and had lots of people who've used Slony in the past, so I got lots of great questions. I realized we should publish some "recommended architectures" for setting up the Bucardo control database, and provide more detailed diagrams for how replication events actually occur. I also talked to someone interested in using Bucardo to show DDL differences between development databases and suggested he post to the mailing list. Greg has created scripts to do similar things in the past, and it would be really cool to have Bucardo output runnable SQL for applying changes.
I also made a hard pitch for people to start a SEAPUG, and it sounds like some folks from the Fred Hutchinson Cancer Research Center are interested. (I'm naming names, hoping that we can actually do it this time :D). If you are from the Seattle area, go ahead and subscribe to the seapug@postgresql.org mailing list (pick 'seapug' from the list dropdown menu)!
Thanks everyone who attended, and I'm looking forward to having lunch with a bunch of PostgreSQL users here in Seattle!
Recently I've been working on a database with many multi-column indexes, and I've wondered how often all the columns of the index were used. Many of the indexes in question are primary key indexes, and I need all the columns to guarantee uniqueness, but for non-unique indexes, it would make sense to remove as many indexes from the column as possible. Especially with PostgreSQL 8.3 or greater, where I can take advantage of heap-only tuples[1], leaving columns out of the index would be a big win. PostgreSQL's statistics collector will already tell me how often an index is scanned. That shows up in pg_stat_all_indexes. But for a hypothetical index scanned 100 times, there's no way to know how many of those 100 scans used all the columns of the index, or, for instance, just the first column.
First, an example. I'll create a table with three integer columns, and fill it with random data:
5432 josh@josh# CREATE TABLE a (i INTEGER, j INTEGER, k INTEGER); CREATE TABLE 5432 josh@josh*# INSERT INTO a SELECT i, j, k FROM (SELECT FLOOR(RANDOM() * 10) AS i, FLOOR(RANDOM() * 100) AS j, FLOOR(RANDOM() * 1000) AS k, GENERATE_SERIES(1, 1000)) f; INSERT 0 1000 5432 josh@josh*# CREATE INDEX a_ix ON a (i, j, k); CREATE INDEX 5432 josh@josh*# ANALYZE a; ANALYZE 5432 josh@josh*# COMMIT; COMMIT
This leaves me with a three-column index on 1000 rows of the following:
5432 josh@josh*# SELECT * FROM a LIMIT 10; i | j | k ---+----+----- 3 | 6 | 380 7 | 94 | 933 1 | 73 | 326 2 | 86 | 224 2 | 59 | 336 9 | 44 | 220 9 | 48 | 694 3 | 27 | 268 3 | 0 | 410 8 | 25 | 337 (10 rows)
Now I need to make a query that will use the index. That's easy enough, with these two queries. As shown by the index condition, the first query uses all three columns of the index, and the second, only two.
5432 josh@josh# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80 AND k > 800;
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on a (cost=5.64..10.74 rows=4 width=12)
Recheck Cond: ((i > 8) AND (j > 80) AND (k > 800))
-> Bitmap Index Scan on a_ix (cost=0.00..5.64 rows=4 width=0)
Index Cond: ((i > 8) AND (j > 80) AND (k > 800))
(4 rows)
5432 josh@josh*# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80;
QUERY PLAN
--------------------------------------------------------------------
Bitmap Heap Scan on a (cost=5.37..10.67 rows=20 width=12)
Recheck Cond: ((i > 8) AND (j > 80))
-> Bitmap Index Scan on a_ix (cost=0.00..5.36 rows=20 width=0)
Index Cond: ((i > 8) AND (j > 80))
(4 rows)
Inside PostgreSQL, these queries result in a call to _bt_first() inside src/backend/access/nbtree/nbtsearch.c. This function two parameters: an IndexScanDesc object called scan, which describes the index to scan, the key to look for, and some other stuff, and a ScanDirection parameter to tell _bt_first() which direction to scan the index. It's this call that tells the statistics collector about each index scan, and it's this call that we'll instrument with SystemTap. I'm interested in the value in scan->numberOfKeys, which tells me how many of the index's keys will be considered in each scan. SystemTap makes getting this information really easy. I gave an introduction to SystemTap and using it with PostgreSQL in an earlier post; the following assumes familiarity with that material.
Since PostgreSQL doesn't come with a DTrace probe built into the _bt_first() function, I'll use SystemTap's ability to probe directly into a function. Conveniently, SystemTap also allows access to the values of variables in the function's memory space at runtime. Note that the technique shown below requires a PostgreSQL binary built with --enable-debug. Without debug information in the binary, different techniques are used, and the information is harder to get.
The test script I used is as follows:
probe process("/usr/local/pgsql/bin/postgres").function("_bt_first")
{
/* Time of call */
printf ("_bt_first at time %dn", get_cycles())
/* Number of scan keys */
printf("%d scan keysn", $scan->numberOfKeys)
/* OID of index being scanned */
printf("%u index oidnn", $scan->indexRelation->rd_id)
}
Note that the script above accesses variables in the _bt_first() function just as standard C functions would. The script has the following output:
[josh@localhost ~]$ sudo /usr/local/bin/stap -v test.d Pass 1: parsed user script and 59 library script(s) in 130usr/70sys/196real ms. Pass 2: analyzed script: 2 probe(s), 3 function(s), 0 embed(s), 0 global(s) in 50usr/50sys/103real ms. Pass 3: translated to C into "/tmp/stapzCCwZE/stap_1854c2da59908c3e3633d6385ca6ce52_2782.c" in 120usr/80sys/209real ms. Pass 4, preamble: (re)building SystemTap's version of uprobes. Pass 4: compiled C into "stap_1854c2da59908c3e3633d6385ca6ce52_2782.ko" in 2240usr/3270sys/8102real ms. Pass 5: starting run. _bt_first at time 49379911010213 1 scan keys 2703 index oid _bt_first at time 49379982691988 1 scan keys 2684 index oid _bt_first at time 49379987397126 1 scan keys 2684 index oid
You'll note several indexes get scanned immediately. These are indexes from the PostgreSQL catalog. The index we created above has OID 16388. First, I'll run the query with three scan keys, followed by the query with two keys:
_bt_first at time 50357469430819 3 scan keys 16388 index oid _bt_first at time 50363763650571 2 scan keys 16388 index oid
As expected, SystemTap reported first three and then two scan keys used, along with the OID of the a_ix index I created. With a technique like this I could, at least theoretically, get an exact usage profile for each index, and determine whether they need all the columns they have.
[1] See, for example, this page.
During some routine checking on a slony cluster, Greg noticed something curious. Replication was still happening between the master and a couple slaves, but we were seeing our indicator for lag inside of slony increasing.
To check out the status of slony replication, you will typically take a look at the view ?sl_status?:
mydatabase=# select * from sl_status;
st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st
_lag_time
-----------+-------------+---------------+----------------------------+------------------+----------------------------+---------------------------+-------------------+---------
----------------
2 | 1 | 2697511 | 2008-04-30 02:40:06.034144 | 2565031 | 2008-04-14 15:31:32.897165 | 2008-04-14 16:24:08.81738 | 132480 | 15 days
10:16:03.060499
(1 row)
This view pulls data out of sl_event and sl_confirm, two tables that keep track of the forward progress of replication. Every time there is an event - SYNCs, DDL changes, slony administrative events - a row is added to sl_event. Slony is very chatty and so all of the slaves send events to each other, as well as the master. (That statement is a simplification, and it is possible to make some configuration changes that reduce the traffic, but in general, this is what people who set up slony will see.)
Broken down, the columns are:
st_origin: the local slony system
st_received: the slony instance that sent an event
st_last_event: the sequence number of the last event received from that origin/received pair
st_last_event_ts: the timestamp on the last event received
st_last_received: the sequence number of the last sl_event + sl_confirm pair received
st_last_received_ts: the timestamp on the sl_confirm in that pair
st_last_received_event_ts: the timestamp on the sl_event in that pair
st_lag_num_events: difference between st_last_event and st_last_received
st_lag_time: difference between st_last_event_ts and st_last_received_ts
Depending on the type of event, a row might be added to sl_confirm immediately (by the same thread that created the event), or this may be created separately by another process. The important thing here is that there is a separation between sl_event and sl_confirm, so it is possible for sl_event SYNCs (replication events) to continue to come through and be applied to the server, without the sl_confirm rows being eventually created.
We have a monitor which checks the status of replication by looking at a recently added value on the master and comparing that to what is on the slave. This works well for workloads that are primarily append-only. So, that monitor thought replication was working fine, even though the lag was increasing steadily.
sl_event and sl_confirm tables are periodically cleaned up by cleanupEvent(), automatically by slony. Typically, this function is run every 100 seconds. When the slon process kicks it off, it checks to see what the newest confirmed events are, deletes old event records, and old confirm rows.
When confirms stop coming through, sl_events can?t be cleaned up on the affected server (because they haven?t been confirmed!). Depending on how active your servers are, this will eat up disk space. But you?ve got disk space monitors in place, right? :)
So, how do you fix the problem when the confirms stop coming through?
I had a look at process tables on all the slon slaves, and noticed that on the two lagged systems, there was no incoming connection from the master slony system. The fix: restart slony on the master so that it could reconnect.
There?s a couple things I wished that slony would have told me:
- Notification on the slave that it no longer had its connection back to the master. We?ll set up our own monitors to detect that this connection no longer exists, but it would be much nicer for slony to warn about this. Additionally, it would be nice to be able to re-connect to a single slave without restarting slon entirely.
- More explanation about sl_confirm and likely causes of failed confirmations. I hope I?ve shed a little light with this blog post.
The documentation for setting up slony is very good, but the troubleshooting information is lacking around events and confirmations, and how each type of event and confirmation actually happens. I?m happy to be proven wrong -- so please leave pointers in the comments!
Wednesday was the first official day of OSCON, and I spent it elbow deep in filesystems. The morning was kicked off with Val Aurora delivering a great overview of Btrfs, a new fileystem currently in development. Some of the features include:
- Copy on write filesystem
- Cheap, easy filesystem snapshots
- Dynamically resizable partitions
- Indexed directory structure
- Very simple administration
Val demonstrated basic functionality, including creating snapshots and creating a Btrfs filesystem on top of an ext3 filesystem. Cool stuff! The filesystem is still under heavy development, but seems very promising.
Next I saw Theodore Ts'o, the primary developer behind ext4, talk about the future of filesystems and storage. He referenced a great paper that dives deep into the economics behind SSD (solid state drives) and platter hard drive manufacturing. One interesting calculation was that even if we could convert all the silicon fabs to manufacture flash, would only be able to covert about 12% of the world-wide capacity of hard drive production. Because of this, Theodore believes that it is going to be challenging for the cost of SSDs to drop to the point where it becomes cost competitive with hard drives.
Other observations from Theodore concerned the slowing of innovation around hard drives, and companies like Seagate cutting back in their R&D departments. He sees opportunity for software and filesystem innovation in this environment, and so far that is playing out in the rapid development of new filesystems for Linux (Nilfs2, POMELFS, and EXOFS as three recent new examples). One open issue he brought up is the need for more and better benchmarking tools.
In the afternoon, I presented Linux Filesystem Performance for Databases. I've uploaded the slides to the conference site. I talked about the work that the Portland PostgreSQL Performance Pad team did on filesystem testing with some hardware donated from HP. I also included results from some recent DBT-2 tests Mark had run with PostgreSQL, using pgtune and then refining a few key parameters.
There were quite a few interesting questions, and I talked to one of the Wikia admins about a recent change he'd made to use SSDs instead of hard drives in some of their servers. I mentioned that it would be great to see a case study and data from his experience.
Yesterday, Brian Aker and Eric Day presented pgGearman: A distributed worker queue for PostgreSQL during the OSCON/SFPUG PgDay.
Gearman is a distributed worker queuing system that allows you to farm work out to a collection of servers, and basically run arbitrary operations. The example they presented was automating and distributing the load of image processing for Livejournal. For example, everyone loves to share pictures of their kittens, but once an image is uploaded, it may need to be scaled or cropped in different ways to display in different contexts. Gearman is a tool you can use to farm these types of jobs out.
So, in anticipation of the talk, I worked with Eric Day on a set of C-language user defined functions for Postgres that allow client connections to a Gearman server.
You can try out the pgGearman 0.1 release on Launchpad!
Today is my last day in Nigeria. I hop in a car in a couple hours and head off to visit the university in Akure, and then I will be driving to Lagos to catch a plane home.
My students are pictured above. We covered a great deal of material this week. They learned about the PostgreSQL project, basic database administration, how to develop a schema from forms and application requirements, how to write procedural code inside the database, and how to use the pgAdmin and psql interfaces.
I learned about how many of the officials and IT workers I met (both in the class and outside of it), had worked very hard on the court case that led to the change in government in Ondo State three months ago. There had been systematic election fraud, and they were able to prove it in court using some clever IT and forensic analysis work. The members of SITEDEC believe very strongly in the importance of IT in increasing government accountability and transparency, a belief re-affirmed by their recent successes.
I'm looking forward to hearing about how the work progresses on their census and voter registration databases. Of course, I want to come back to Nigeria. It's a beautiful country, and I didn't have nearly enough time here to appreciate it.
When specifying a password for the Windows PostgreSQL one-click installer, you get this message:
"
If you have already installed Postgres as a service, you will need to enter the current user postgres service user password to get past the password dialog box. Meaning, if you're logged in to Windows as 'selena', you need to enter selena's password. As a non-Windows user, this baffled me, and a few other people on this thread.
Otherwise, you can just enter a password that will be used for the 'postgres' database user. Hope this helps someone!
Update:
Further explanation from Dave Page, the maintainer of the windows package:
Selena: It's not the password for the user that you are logged in as that you need to enter, it's the password for the service account (ie. postgres).
Unlike *nix & Mac, service accounts on Windows need to have passwords so unfortunately we need to ensure we have the correct password to install the service. Hence, if there's an existing postgres account, we need the existing password, otherwise the account will be created with whatever password you specify.
In all OSs, we use the password entered on that page as the database superuser password.
One of the more valuable features of any conference is the so-called "hall track", or in other words, the opportunity to talk to all sorts of people about all sorts of things. PGCon was no exception, and I found the hall track particularly interesting because of suggestions I was able to gather regarding multi-column statistics, not all of which boiled down to "You're dreaming -- give it a rest". One of the problems I'd been trying to solve was where, precisely, to put the code that actually applies the statistics to a useful problem. There are several candidate locations, and certainly quite a few places where we could make use of such statistics. The lowest-hanging fruit, however, seems to be finding groups of query clauses that aren't as independent as we would normally assume. Between PGCon sessions one day, Tom Lane pointed me to a place where we already do something very similar: clausesel.c
"Clause selectivity" means much the same thing as any other selectivity: it's the proportion of rows from a relation that an operation will return. A "clause", in this case, is a filter on a relation, such as the "X = 1" and the "Y < 10" in "WHERE X = 1 AND Y < 10". PostgreSQL uses functions in clausesel.c to find clauses whose combined selectivity differs from the product of their individual selectivities. For instance, in "WHERE X < 4 AND X < 5", the "X < 5" is redundant; the clauses' combined selectivity is simply that of "X < 4". With "WHERE Y > 4 AND Y < 10", clausesel.c can determine that we really want the selectivity of the clause "4 < Y < 10". It's also smart enough to recognize "pseudo-constants": values from non-volatile functions, or from the outer relation of a nested loop. Although these values aren't truly constants, they remain constant at the level of the query where the clause will be applied, and can be treated as constants.
With any luck, one day clausesel.c will also know enough to notice cases where, for instance, although "foo.x = 3" and "foo.y > 10" are individually true for much of table "foo", there are very few rows where both conditions are true.
Those familiar with PostgreSQL know it has supported DTrace since version 8.2. The 8.4beta2 includes support for several new DTrace probes. But for those of us using platforms on which DTrace doesn't exist, this support hasn't necessarily meant much. SystemTap is a relatively new, Linux-based package with similar purpose to DTrace, available on Linux, and is under heavy development. As luck would have it, PostgreSQL's DTrace probes work with SystemTap as well.
A few caveats: it helps to run a very new SystemTap version (I used one I pulled from SystemTap's git repository today), and in order for SystemTap to have access to userspace software, your kernel must support utrace. I don't know precisely what kernel versions include the proper patches; my Ubuntu 8.04 laptop didn't have the right kernel, but the Fedora 10 virtual machine I just set up does.
Step 1 was to build SystemTap. This was a straightforward ./configure, make, make install, once I got the correct packages in place. Step 2 was to build PostgreSQL, including the --enable-dtrace option. This also was straightforward. Note that PostgreSQL won't build with the --enable-dtrace option unless you've already installed SystemTap. Finally, I initialized a PostgreSQL database cluster and started the database.
Here's where the fun starts. SystemTap's syntax differs from DTrace syntax. Here's an example probe SystemTap would accept:
probe process("/usr/local/pgsql/bin/postgres").function("eqjoinsel")
{
printf ("%dn", pid())
}
This tells SystemTap to print out the process ID (which comes from the SystemTap pid() function) each time the PostgreSQL eqjoinsel function is called. That's the function to estimate join selectivity with most equality operators, and gets called a lot, so it's a decently useful test. It also shows that SystemTap can probe inside programs without an explicitly defined probe. I saved this file as test.d, and ran it like this:
[josh@localhost ~]$ sudo stap -v test.d Pass 1: parsed user script and 52 library script(s) in 160usr/220sys/641real ms. Pass 2: analyzed script: 1 probe(s), 1 function(s), 1 embed(s), 0 global(s) in 40usr/60sys/331real ms. Pass 3: translated to C into "/tmp/stapDD5a4p/stap_c0b737cdffdb48cec3fd55b631bb0656_1057.c" in 30usr/160sys/211real ms. Pass 4, preamble: (re)building SystemTap's version of uprobes. Pass 4: compiled C into "stap_c0b737cdffdb48cec3fd55b631bb0656_1057.ko" in 1510usr/3430sys/8052real ms. Pass 5: starting run. 4521 4521 4521 4521
4521 is the process ID of the PostgreSQL backend I'm connected to, and it gets printed every time I type "dt" in my psql session.
Now for something more interesting. Although SystemTap lets me probe whatever function I want, it's nice to be able to use the defined DTrace probes, because that way I don't have to find the function name I'm interested in, in order to trace something. Here are some examples I added to my test.d script, pulled more or less at random from the list of available DTrace probes in the PostgreSQL documentation. Note that whereas the documentation lists the probe names with dashes (or are these hyphens?), to make it work with SystemTap, I needed to use double-underscores, so "transaction-start" in the docs becomes "transaction__start" in my script.
probe process("/usr/local/pgsql/bin/postgres").mark("transaction__start")
{
printf("Transaction start: %dn", pid())
}
probe process("/usr/local/pgsql/bin/postgres").mark("lwlock__condacquire") {
printf("lock wait start at %d for process %d on cpu %dn", gettimeofday_s(), pid(), cpu())
}
probe process("/usr/local/pgsql/bin/postgres").mark("sort__start") {
printf("transaction abort at %d for process %d on cpu %dn", gettimeofday_s(), pid(), cpu())
}
probe process("/usr/local/pgsql/bin/postgres").mark("smgr__md__write__done") {
printf("smgr-md-write-done at %d for process %d on cpu %dn", gettimeofday_s(), pid(), cpu())
}
...which resulted in something like this when I ran pgbench:
[josh@localhost ~]$ sudo stap -v test.d Pass 1: parsed user script and 52 library script(s) in 130usr/150sys/286real ms. Pass 2: analyzed script: 7 probe(s), 4 function(s), 2 embed(s), 0 global(s) in 30usr/30sys/120real ms. Pass 3: translated to C into "/tmp/stapW9yfAQ/stap_f6f3ffd834ef5b249edcf7d1ca19dce2_3025.c" in 10usr/150sys/163real ms. Pass 4, preamble: (re)building SystemTap's version of uprobes. Pass 4: compiled C into "stap_f6f3ffd834ef5b249edcf7d1ca19dce2_3025.ko" in 1380usr/2690sys/4155real ms. Pass 5: starting run. Transaction start: 4894 Transaction start: 4894 lock wait start at 1243552147 for process 4907 on cpu 0 Transaction start: 4907 Transaction start: 4907 lock wait start at 1243552147 for process 4907 on cpu 0 Transaction start: 4907 lock wait start at 1243552174 for process 2770 on cpu 0 smgr-md-write-done at 1243552174 for process 2770 on cpu 0 smgr-md-write-done at 1243552174 for process 2770 on cpu 0 smgr-md-write-done at 1243552174 for process 2770 on cpu 0
This could be a very interesting way of profiling, performance testing, debugging, troubleshooting, and who knows what else. I'm interested to see SystemTap become more ubiquitous. I should note that I have no idea how SystemTap compares to DTrace or whether it will manage to do for Linux what DTrace can do on other operating systems. Time will tell, I guess.
UPDATE: As has been pointed out in the comments, compiling PostgreSQL with --enable-dtrace is only necessary if I want to use the built-in "taps" (the SystemTap word, apparently, for its equivalent of DTrace probes). Probing by function call, or any of the other probe methods SystemTap supports, works without --enable-dtrace.
UPDATE 2: It's important to note that the defined DTrace probes include sets of useful variables that DTrace and SystemTap scripts might be interested in. For instance, it's possible to get the transaction ID within the transaction__start probe. In SystemTap, these variables are referenced as $arg1, $arg2, etc. So in a transaction__start probe, you could say:
printf("Transaction with ID %d startedn", $arg1)
Last time I described the design and major functions of Bucardo version 1 in detail. A natural question to ask about Bucardo 1 is, why didn't I use something else already out there? And that's a very good question.
I had no desire to create a new replication system and work out the inevitable kinks that would come with that. However, nothing then available met our needs, and today still nothing I'm familiar with quite would. So writing something new was necessary. Writing an asynchronous multimaster replications system for Postgres was not trivial, but turned out to be easier than I had expected thanks to Postgres itself -- with the caveats noted in the last post.
But, back to the landscape. What follows is a survey of the Postgres replication landscape as it looked in mid-2002 when I first needed multimaster replication for PostgreSQL 7.2.
pgreplicator
PostgreSQL Replicator is probably the most similar project to Bucardo 1. It was released in 2001 and does not appear to have had any updates since October 2001. I don't recall why I didn't use this, but from reviewing the documentation I suspect it was because it hadn't been updated for PostgreSQL 7.2, it used PL/Tcl, and required a daemon to run on every node. But the asynchronous store-and-forward approach, the use of triggers and data storage tables is similar to Bucardo 1.
dbmirror
I don't remember whether this was around in 2002, but it's part of PostgreSQL contrib now. It is master/slave replication only.
Slony-I
I don't think Slony-I existed in 2002 -- version 1.0 was released in 2004. But in any case, it only does master/slave replication.
Slony2
There has been no code released from this project and the website is now gone.
erserver
Master/slave replication, abandoned in favor of Slony-I. Website is now gone.
Postgres-R
This was a research project that worked with PostgreSQL 6.4. Some Postgres-R design documents were published. An effort to port it to PostgreSQL 7.2 (the pgreplication project) did not appear to have gotten very far. In 2008 it seems to have been partially revived. I don't know what the current status is.
PGCluster
This didn't exist in 2002. I'm not sure where it's at now. I believe it uses synchronous replication.
pgpool
This isn't the kind of "replication" I wanted; it's database load balancing and multiplexing. The pgpool listener is a single point of failure, and all databases must be accessible or data will be lost on a database server that is down.
Usogres
Master/slave replication for backup purposes.
Mammoth PostgreSQL + Replication
This didn't exist in 2002. It is only master/slave replication. It began as proprietary software but I believe is open source now.
EnterpriseDB Replication Server
A proprietary offering that came out in 2005 or 2006, for master/slave replication only. Has apparently been replaced by Slony, or perhaps was always rebranded Slony.
pgComparator
An rsync-like tool for comparing databases. Didn't exist in 2002. Probably much better than Bucardo 1's compare operation.
DBBalancer
Kind of like pgpool, more of a connection pooler. Hasn't been updated since 2002.
DRAGON
"Database Replication based on Group Communication." Links to this project were defunct.
DBI-Link
DBI-Link isn't about replication.
(Summary)
I assembled this list some time back and have made some updates to it. I'm sure there are more to consider today. Please comment if you have any corrections or additions.
Since PGCon 2009 begins next week, I thought it would be a good time to start publishing some history of the Bucardo replication system for PostgreSQL. Here I will cover only Bucardo version 1 and leave Bucardo versions 2 and 3 for a later post.
Bucardo 1 is an asynchronous multi-master and master/slave database replication system. I designed it in August-September 2002, to run in Perl 5.6 using PostgreSQL 7.2. It was later updated to support PostgreSQL 7.4 and 8.1, and changes in DBD::Pg's COPY functionality. It was built for and funded by Backcountry.com, and various versions of Bucardo have been used in production as a core piece of their infrastructure from September 2002 to the present.
Bucardo's design is simple, relying on the consistently correct behavior of the underlying PostgreSQL database software. It made some compromises on ideal behavior in order to have a working system in a reasonable amount of time, but the compromises are few and are mentioned below.
General design
Bucardo 1 needed to:
- Support asynchronous multimaster replication.
- Support asynchronous master/slave replication of full tables and changes to tables.
- Leave frequency of replication up to the administrator, which came by default since each replication event is a separate run of the program.
- Preserve transaction atomicity and isolation across databases.
- Continue collecting change information even when no replication process is running.
- Be fairly efficient in storing changes and in bandwidth usage sending them to the other database.
- Have a default "winner" in collision situations, with special handling possible for certain tables where more intelligent collision merges could be done.
- Not require any database downtime for maintenance, upgrades, etc.
- Be fairly simple to understand and support.
- Support a data flow arrangement such that the replicator is behind a firewall and reaches out to an external database, but doesn't require inbound access to the internal database.
Operations
There are four types of database operations Bucardo 1 can perform:
- peer - synchronize changes in one or more tables between two peer databases (multi-master)
- pushdelta - copy only changed rows from a table or set of tables from a master database to a slave database
- push - copy an entire table or set of tables from a master database to a slave database
- compare - compare all rows of one or more tables between two databases
I will discuss each of these operations in turn.
Peer sync
The peer sync operation is the most groundbreaking feature of Bucardo 1. The much smaller Backcountry.com of 2002 wanted to have an internal master database in their office, which housed their customer service and warehouse employees, buyers, and management. Their office had a low-bandwidth and not entirely reliable Internet connection. Their e-commerce web, application, and database servers were at a colocation facility with a fast Internet connection, and they wanted an identical master database to reside there, so that in the case of any disruption in connectivity between their office and colocation facility, both locations could continue to function independently, and their databases would automatically synchronize after connectivity was restored.
To summarize, what they needed is multi-master replication. Their needs would be satisfied with asynchronous multi-master replication. That meant that it was acceptable for the databases to be current with each other with 1-2 minutes of lag time. (Synchronous multi-master replication requires a continuous connection between the two master databases, and transactions are not allowed to commit until the transaction is completed on both databases.)
I want to review some of the features that are required for multi-master replication to work. First, it needs to have ACID properties just as the underlying database itself. The most relevant properties for our multi-master replication system are atomicity and isolation. A transaction must be entirely visible on a given database, or not visible at all.
For example, let us imagine that a customer ecommerce order consists of exactly 1 row in the "orders" table, which references 1 row in the "users" table, and the following tables may have 0 or more rows pointing to the "orders" table:
- order_lines
- order_notes
- credit_cards
- payments
- gift_certificates
- coupon_uses
- affiliate_commissions
- inventory
To add an order to the source database, a transaction is started, rows are added to relevant tables, the transaction is committed, and then those rows will all appear to other database users at once. Until the transaction is committed, no changes are visible. If an error occurs, the entire transaction rolls back, and it will never have been seen by any other database user.
This ensures that warehouse employees, customer service representatives, etc. will never see a partial order. This is especially important since we don't want to ship an order that is missing some of its line items, or double-charge a credit card because we didn't have a payment record yet. And an order without its associated inventory records would have trouble shipping at the warehouse.
This is all standard ACID stuff. But since I was writing a multi-master replication system from scratch, I had to assure the same properties across two database clusters, for which PostgreSQL had no facilities.
Changes are tracked by having a "delta table" paired with every table that's part of the multi-master replication system. The table has three columns: the primary key in the table being tracked, the wallclock timestamp, and an indicator of whether the change was due to an insert, update, or delete. Every change in the table being tracked is recorded by rules and triggers that insert a corresponding row in the delta table.
This is what the delta table for "orders" looks like (simplified a bit for readability):
Table "public.orders_delta"
Column | Type | Modifiers
---------------+-------------+-----------------------------------------
delta_key | varchar(14) | not null
delta_action | char(1) | not null
last_modified | timestamp | not null default timeofday()::timestamp
Check constraints:
"delta_action_valid" CHECK (delta_action IN ('I','U','D'))
Triggers:
orders_delta_last_modified BEFORE INSERT OR UPDATE ON orders_delta
FOR EACH ROW EXECUTE PROCEDURE update_last_modified()
The new row data itself in the tracked table is not copied, because the data is right there for the taking. It is enough to note that a change was made. If multiple changes are made, only the most recent version of the row is available, but that is fine because that's the only one we need to replicate.
Because nothing outside of the database is required to track changes, the tracking continues even when Bucardo 1 is not running. As long as the delta table exists and can be written to, and the tracking rules and triggers are in place on the tracked table, the changes will be recorded.
Bucardo 1 achieves atomicity and isolation of the replication transaction with this process:
- Open a connection to the first database, set transaction isolation to serializable, and disable triggers and rules.
- Open a connection to the second database, set transaction isolation to serializable, and disable triggers and rules.
- For each table to be synchronized in this group:
- Verify that the table's column names and order match in the two databases.
- Walk through the delta table on the first database, making identical changes to the second database. Empty the delta table when done.
- Walk through the delta table on the second database, making identical changes to the first database. Empty the delta table when done.
- Make a note of any changes that were made to the same rows on both databases ("conflicts"). By default, we resolve the conflicts silently by allowing the designated "winner" database's change be the one that remains. For certain tables such as "inventory", appropriate table-specific conflict resolution code was added that merged the changes instead of designating a winner and loser version of the row.
- Once all changes have succeeded, commit transactions on both databases.
This last step of the process does not satisfy the ACID durability requirement. Since Bucardo 1 was designed on PostgreSQL 7.2, with no 2-phase commit possible, there is a chance that one database will fail to commit its transaction after the other database already did, and the changes will be lost on one side only. This has never happened in practice, mostly due to the fact that committing a transaction in PostgreSQL is a nearly instantaneous operation, since the data is already in place and no separate rollback or log tables need to be modified. But it is certainly possible that it could happen, and it is an undesirable risk. With real 2-phase commit now available in PostgreSQL, complete durability could be achieved.
All of a sudden, the changes on each side are now available to the other side, all at once. Only entire orders are visible, never partial orders.
ACID consistency is achieved by assuming that due to PostgreSQL's integrity checks on the source database, the data was already consistent there, and it is copied verbatim to the destination database where it will still be consistent. Thus, CHECK constraints, referential integrity constraints, etc. are expected to be identical between the two databases. Bucardo 1 does not propagate database schema changes.
Thus the main principles to provide fairly reliable replication are:
- All related tables must be synchronized within the same transaction.
- Synchronization must always be done in both directions in the same transaction, so that the code can detect simultaneous change conflicts.
- The most recent change to a given row must of course be the last change, so changes should be replayed in order. (We optimize this by not copying over row changes that we know will be deleted later in the same transaction.)
Things to consider with multi-master replication:
- Conflicts are less likely the more often the synchronization is performed. But conflicts can still happen, and must be resolved somehow. Creating a generic conflict resolution mechanism is difficult, but declaring a "winning" database is easy and special conflict resolution logic can be added for certain tables where lost changes would be troublesome.
-
Very large change sets can take a long time to synchronize. For example, consider an unintentionally large update like this:
UPDATE inventory SET quantity = quantity + 5
That may change hundreds of thousands of rows, all in a single transaction. Our replication system need to make all those changes in a single transaction to the other database, but it must do so over a comparatively slow Internet connection. As transactions run longer, they often encounter locks from other concurrent database activity, and rollback. Then the process must start over, but now there are even more changes to copy over, so it takes even longer. In the worst situations, the synchronization simply cannot complete until other concurrent database activity is temporarily stopped, so that no locks will conflict. And that means downtime of applications, and manual intervention of the system administrator.
Perhaps you could ship over all the data to the other database server ahead of time, then begin transactions on both databases and make the changes based on the local copy of the data, and expect the changes to be accepted more quickly since the network is no longer a bottleneck. But the destination database won't have been idle during that copying, which needs to be accounted for.
Statement replication does not have this same weakness, but it has many weaknesses of its own.
- Sequences need to be set up to operate independently without collisions on the two servers in a peer sync. Two easy ways to do this are:
- Set up sequences to cover separate ranges on each server. For example, MAXVALUE 999999 on the first server, and MINVALUE 1000000 on the second server. Make sure to spread the ranges far enough apart that they'll never likely collide.
- Set up sequences to supply odd numbers on one server, and even on the other. For example, START 1 INCREMENT 2 on the first server, and START 2 INCREMENT 2 on the second server.
- A primary key is required. Currently, it must be a single column, and must be the first column in the table.
- Because each table's primary key may be of a different datatype, and to keep queries on delta tables as simple as possible, Bucardo 1 uses a separate delta table for each table being tracked.
- A more pluggable system for adding table-specific collision handling would be nice.
- The delta table column "delta_action" isn't actually necessary -- inserts and updates are already handled identically, and deletes can be inferred from the join on the tracked table. The "delta_action" is perhaps a nice bit of diagnostic information, and not burdensome as a CHAR(1), but otherwise could be removed.
- It's important that the delta table's "last_modified" column be based on wallclock time, not transaction start time, because we only keep the most recent change, and if all changes within a transaction are tagged by transaction start time, we'd end up with an arbitrary row as the "most recent" one, resulting in inconsistent data between the databases.
Pushdelta
The pushdelta operation uses the same kind of delta tables and associated triggers and rules that the peer sync uses, but is a one-way push of the changed rows from master to slave. It is useful for large tables that don't have a high percentage of changed rows.
The pushdelta operation currently only supports a single target database. The ability to use pushdelta from a master to multiple slaves would be useful.
Push
The push operation very simply copies entire tables from the master to one or more slaves, for each table in a group. It requires no delta tables, triggers, or rules.
Table pushes can optionally supply a query that will be used instead of a bare "SELECT *" on the source table. Any query is allowed that will result in matching columns for the target table. We've used this to push out only in-stock inventory, rather than the whole inventory table, for example.
No primary key is required on tables that are pushed out in full.
The push operation uses DELETE to empty the target table. It would be good to optionally specify that TRUNCATE be used instead, and to take advantage of the PostgreSQL 8.1 multi-table truncate feature on tables with foreign key references.
Compare
The compare operation compares every row of the tables in its group, and displays any differences. It is a read-only operation. It can be used to make sure that tables to be used in multi-master replication start out identical, and later, to verify correct functioning of peer, pushdelta, and push operations.
The compare operation is fairly slow. It reads in all primary keys from both tables first, then fetches each row in turn. It could be made much more efficient.
Options
Optionally, tables can be vacuumed and/or analyzed after each operation.
In earlier versions of Bucardo 1, there was also an option to drop and rebuild all indexes automatically, to reduce index bloat, but beginning with PostgreSQL 7.3, primary key indexes could not be dropped when foreign keys required them, and the index bloat problem was dramatically reduced in PostgreSQL 7.4, mostly eliminating the need for the feature.
Limitations
Some of these are limitations that could easily be lifted, but no need had arisen. Some are minor annoyances, and others are major feature requests.
- For peer, pushdelta, and compare operations, a primary key is required. There are currently limitations on that key:
- Only single-part primary keys are supported.
- The primary key is assumed to be the first column. It would be easy to allow specifying another column as the primary key, or to interrogate the database schema directly to determine the key column, but we've never needed it.
- If an operation of one type is already underway, other operations of the same type will be rejected. It would be much more convenient for the users to add the newly requested operation to a queue and perform it when the current operation has finished.
- The program stands alone, performing a single operation and exiting. It was designed to run from cron. A persistent daemon that accepts requests in a queue or by message passing could better handle the many operations needed on a busy server.
- The program could use PostgreSQL's LISTEN and NOTIFY feature to learn of changes in a table and run a peer sync based on that notification, instead of being run on a timed schedule or on demand.
- Delta tables and triggers must be created or removed manually, though our helper script makes that fairly easy. It would be nice to have Bucardo automatically create delta tables and triggers as needed, or remove them when no longer needed (so that the overhead of tracking changes isn't incurred).
- Delta tables clutter the schema of the tables they are connected to. PostgreSQL didn't yet have the schema (namespace) feature when Bucardo 1 was created, but it would be nice to centralize the delta tables and functions in a separate schema.
- The datatypes of the fields in tables being replicated are not compared; only the names and order are compared.
- The configuration file syntax is fairly unpleasant.
- Only tables can be synchronized. It would be good to add support for views, sequences, and functions as first-class objects that could be pushed from master to slave or synchronized between two masters.
- It would be more convenient, and could reduce the chance of trouble due to misconfiguration, if Bucardo would interrogate the database to learn of all foreign key relationships between tables so that it could automatically create groups of tables that need to be processed together. Trigger functions and rules can cause changes to one table's row to modify rows in other table(s), in an opaque way that is resistant to introspection, but Bucardo could offer a location for users to declare what other tables a function can affect, and use that in building its dependency tree.
- There is no unit test suite.
- The insert trigger and update_last_modified function are written in PL/pgSQL, and are the only dependency on PL/pgSQL. They are both simple functions and should work fine as plain SQL functions, but it seems like there was a reason I had to use PL/pgSQL -- I just can't remember why anymore.
- In Bucardo 1, permission to insert to the various delta tables must be granted to any user that would change the base tables, or changes will be prevented by PostgreSQL. For a database with many users of varying access levels, this is a pain. It would be better to define the function to run as SECURITY DEFINER, and create the function as the superuser. Then no explicit permission would need to be granted on any delta table, and the delta tables would be inaccessible except through the Bucardo 1 API (except to the superuser). That would necessitate a change to using functions for updates and deletes, which currently are tracked by rules.
Future
Bucardo 1 performed admirably for Backcountry.com for over 4 years. The most serious problems, already mentioned above, have been the lack of a queue for push and pushdelta requests, limitations of running one-off processes from cron, limited row collision resolution, and bogging under a large insert or update that happens inside a single transaction.
Greg Sabino Mullane then created Bucardo 2, which is a rearchitected system built around all new code. It has all the important features of Bucardo 1, addressed most of Bucardo 1's deficiencies, and added many of the desired features listed above. We hope to publish some design notes about Bucardo 2 in the near future.
The Name
I originally gave Bucardo 1 the fairly descriptive but uninspiring name "sync-tables". Greg Sabino Mullane came up with the name Bucardo, a reference to the logo of this program's patron, Backcountry.com. You can read about attempts to clone the extinct bucardo in the Wikipedia articles Bucardo and Cloning.
Two separate posts taken from two separate mailing lists I'm on have gotten me thinking about PostgreSQL data types and operator classes today. The first spoke of a table where the poster had noticed that there was no entry in the pg_stats table for a particular column using the point data type. The second talks about Bucardo failing when trying to select DISTINCT values from a polygon type column. I'll only talk about the first, here, but both of these behaviors stem from the fact that the data types in question lack a few things more common types always have.
The first stems from the point type's lack of a default b-tree operator class and lack of an explicitly-declared analyze function. What are those, you ask? In the pg_type table, the column typanalyze contains the OID of a function that will analyze the data type in question, so when you call ANALYZE on a table containing that data type, that function will be run. In a default installation of PostgreSQL, all rows contain 0 in this column, meaning use the default analyze function.
This default analyze function tries, among other things, to build a histogram of the data in the column. Histograms depend on the values in a table having a defined one-dimensional ordering (e.g. X <> Y, like numbers on a number line or words in alphabetical order). Now it gets a bit more complex. Index access methods define "strategies", which are numbers that correspond to the function of a particular index. Per this page, the b-tree access method defines the following:
| Operation | Strategy Number |
|---|---|
| less than | 1 |
| less than or equal | 2 |
| equal | 3 |
| greater than or equal | 4 |
| greater than | 5 |
To build a histogram we might use strategies 1, 3, and 5, to determine whether two given values are equal, or which is greater. So having found that there's an appropriate operator class for this data type, the analyze function would finally look in the pg_amop table to get the operators it needs to build its histogram. pg_amop matches these strategy numbers with actual function OIDs to find the functions it should actually call.
This whole line of thought stemmed from the point data type not having these functions. B-tree indexes try to sort their data in some order, as determined by the functions talked about above. But point types don't have an obvious one-dimensional ordering, so the b-tree index isn't really appropriate for them. So there's no b-tree operator class, and thus no statistics from columns of point type.
All that said, if you can think of a nice set of statistics ANALYZE might get from point data that would be useful for later query planning, you might implement a custom analyze function to fill the pg_stats table, and selectivity estimation functions to consume the data you generate, to make queries on point data that much better...
UPDATE: Those interested in the guts of a type-specific analyze function might take a look at ts_typanalyze, which is in 8.4. Note that on its own, the typanalyze function doesn't do any good -- it needs selectivity functions, defined in this file, which also were committed in 8.4. Both patches courtesy of Jan Urbanski, and various reviewers.
A few days ago a patch was committed to improve PostgreSQL's performance when hash joining tables too large to fit into memory. I found this particularly interesting, as I was a minor participant in the patch review.
A hash join is a way of joining two tables where the database partitions each table, starting with the smaller one, using a hash algorithm on the values in the join columns. It then goes through each partition in turn, joining the rows from the first table with those from the second that fell in the same partition.
Things get more interesting when the set of partitions from the first table is too big to fit into memory. As the database partitions a table, if it runs out of memory it has to flush one or more partitions to disk. Then when it's done partitioning everything, it reads each partition back from the disk and joins the rows inside it. That's where the "Multi-Batch" in the title of those post comes in -- each partition is a batch. The database chooses the smaller of the two tables to partition first to help guard against having to flush to disk, but it still needs to use the disk for sufficiently large tables.
In practice, there's one important optimization: after partitioning the first table, even if some partitions are flushed to disk, the database can keep some of the partitions in memory. It then partitions the second table, and if a row in that second table falls into a partition that's already in memory, the database can join it and then forget about it. It doesn't need to read in anything else from disk, or hang on to the row for later use. But if it can't immediately join the row with a partition already in memory, the database has to write that row to disk with the rest of the partition it belongs to. It will read that partition back later and join the rows inside. So when the partitions of the first table get too big to fit into memory, there are performance gains to be had if it intelligently chooses which partitions go to disk. Specifically, it should keep in memory those partitions that are more likely to join with something in the second table.
How, you ask, can the database know which partitions those are? Because it has statistics describing the distribution of data in every column of every table: the histogram. Assume it wants to join tables A and B, as in "SELECT * FROM A JOIN B USING (id)". If B.id is significantly skewed -- that is, if some values show up noticeably more frequently than others -- PostgreSQL can tell by looking its statistics for that column, assuming we have an adequately large statistics_target on the column and have analyzed the table appropriately. Using the statistics, PostgreSQL can determine approximately what percentage of the rows in B have a particular value in the "id" column. So when deciding to flush a partition to disk while partitioning table A, PostgreSQL now knows enough to hang on those partitions containing values that show up most often in B.id, resulting in a noticeable speed improvement in common cases.
