Wednesday, 20 October 2010

Extensions in 9.1+

One of the most exciting developments in 9.1 will be the new CREATE EXTENSION feature that Dimitri Fontaine has been designing. Dimitri has been working on them for some time now, proposing them in detail at the 2010 Developer Meeting. Recently he's been coding them and the first patch arrived a few days ago. Reviews should happen in November, more than enough time to happen for the next release.

What do extensions do? Make it easier to add and remove plugins, datatypes, functions etc with minimum fuss and without needing to compile things yourself. It turns out that there's a great many aspects to this and much more complex than I've made it sound.

Dimitri has himself designed a number of very cool add-ons for PostgreSQL and its from those that he's gained insight into what's required. I've been trying to follow it myself, but I'm a bit lost on some aspects of it just yet.

Anyway, looks like there will be a few websites and repositories from which you can download extensions, so the hope is that it will be much easier to both obtain and install add-in components in future versions of PostgreSQL.

We're especially lucky that funding has been made available from the 4CaaST project, a Cloud Computing project funded by the European Union (under FP7). More on that another time.

Tuesday, 5 October 2010

Cleaning your archive

Whether you're thinking of using PostgreSQL 9.0 yet or not, you maybe interested in a small new utility that is designed to help you clean out old WAL archive files.

pg_archivecleanup is a standalone program which simply removes all WAL files older than a certain filename. You could use it like this:

pg_archivecleanup /my/archive 000000010000DEAD0000BEEF

The good thing here is that the tool works just fine for any release of PostgreSQL, not just 9.0. Mostly useful for bash or Windows scripts, since people will no doubt confirm it's trivial to do this in Python or perl etc..

The utility was designed as a dual purpose tool, since with 9.0 pg_standby is no longer required, yet you still need some way to clear down an archive. So you can also include it in the archive_cleanup_command, like so

archive_cleanup_command = 'pg_archivecleanup /my/archive %r'

which gets called regularly with changing values of %r to clean out your archive directories.

Not big, but its useful.

Friday, 1 October 2010

PostgreSQL Admin Book: Last Edit

I've been working on a PostgreSQL Administration book for some time now and am very happy to report that I'm done with the last edit on the last chapter. Hannu finished his a few days earlier.


The full title of the book is "PostgreSQL 9 Administration Cookbook" and has been written by Hannu Krosing and myself. We've written it together over the course of 9 months or so. I guess it's no surprise to some, but these things always take more time than you expect.

The book is in the form of short, punchy recipes that tell you what you need to know on a topic quickly with examples, then we describe how it works later. Some of the recipes are really basic to make sure we address the frequent questions, but around half are advanced topics. It's certainly taken some time to express both the basic and the advanced topics succinctly.

An advanced database system needs advanced books to show it off in the best light. I very much hope it is up to the high standards of the PostgreSQL community and moves us all forwards.

I'll write some more about it later. All I can really express now is relief!

You can pre-order the book here:

Thursday, 24 June 2010

Hot Bugs & Cold Beer

2ndQuadrant is sponsoring a number of Bugs & Beer events at PostgreSQL user groups around the world.

The idea is to get together with your friends and see if we can shake out any more bugs in Hot Standby prior to release.

Particular focus on these areas:
* transactions holding locks before/during/after base backup taken
* prepared transactions
* correctness of results
* VACUUMs on master
* any other ways you can find to break it

Please write to me at if you'd like funding for a local event.

All the best,

Thursday, 17 June 2010

What's a VLF?

I read a good blog called

The funniest thing about it for me was the artful blogger talked about having 87,000 VLFs, taking up 1.5TB, though didn't mention what a VLF was.

Anyway, I checked - VLFs are WAL files in SQLServer. In case, like me, you didn't know, VLF stands for Virtual Log File. So it's a virtual physical thing. Got it. :-\

Which means the blogged-about related to 1.5TB worth of WAL files in the equivalent of pg_xlog. Ouch!

Ah! That reminds me: on both SQLServer and DB2 the log contains undo information, which means that the transaction log can't reuse files while transactions are still active. So long running write transactions causes the log to grow in size and can eventually fill the disk. This can be a real pain for DBAs and requires workarounds. That effect doesn't occur with PostgreSQL, so pg_xlog never overflows for that reason. With PostgreSQL, undo information is effectively stored in the database itself in the form of previous row versions. So a database grows as changes occur. Long running write statements can have a similar effect in PostgreSQL, but not long running transactions. So, one more problem that us Postgres-people get to miss out on.

And just to complete the circle, I will avoid explaining what a WAL file is here.

Smoothing replication

At first, streaming replication looked pretty raw though I've been doing a thorough review of the code. I guess that's probably how Hot Standby looked at first as well. Now, I'm getting quite excited at the way things are shaping up.

Originally the WALSender process waited for 200ms after sending each chunk of data. That's now been fixed so that WALSender will stream continually until no outstanding WAL data remains.

Also, the max chunk size has been reduced to 128kB, which is now the same default size used by DRBD.

On Monday, I noticed that the WALSender was sending WAL data before it had been fsynced. So in case of a crash that could cause a problem. That only really matters if you're using synchronous_commit = off or fsync = off. (Though fsync = off and replication just have no business being configured simultaneously).

Lots of smaller changes have also come through in recent months, so now we have the ability to get log messages when replication connects or disconnects. We also now have special messages if the replication connection is refused by the primary via the pg_hba.conf.

From here, it's looking like streaming replication is fast, efficient and low latency. It's also easy to use and easy to troubleshoot. It's an impressive community effort.

I'm also happy to say that almost all of the features of pg_standby have been assimilated into core now, so the only external module you'll need is something called pg_archivecleanup which is new in this release.

Next few months we're going to see designs emerge for synchronous replication. That is going to require some subtle coding to get good. It sounds like a couple of people are planning prototypes of how that should work.

Monday, 17 May 2010

Bollywood Features

PostgreSQL 9.0 is in the can and is coming to the end of post-production. Still looking forward to the usual last minute special effects by ILM (Industrial Lane & Magic) to pull some extra zing into performance.

So what next? Well, three genres attract my attention: High performance ("OLTP"), Replication and Data Warehousing. The first two are action features, while the last one is more of a romance for me. Some of the stunts for the first two are similar, so we'll be looking to reduce production costs with some clever thinking.

No blockbusters this year, though hopefully some worthwhile features.

High performance stuff we hope for in 9.1 will be

* Group commit
* MERGE related stuff - production costs stalled the first attempt to bring this to the big screen, though some Euro sponsorship should make this viable. Hoping that Greg Smith will take the lead role on this cos its a big performance critical project, with some hard bits.
* Some surprise features!


* Synchronous replication
* Relay replication
* Fast switchover - an art house flick with a small, yet good following

Data Warehousing features for 9.1 will be

* Bit map indexes - has had its script rewritten a few times and the actors have changed as well, though it really needs to happen this time around cos its just so cool.
I also expect to keep my eye on Partitioning features to ensure it actually works for the use cases we care about, which is Big Data.

Dancing is a key element of Bollywood and the best bits are when everyone gets involved in huge Busby Berkeley numbers. Expecting the dancing to be particularly intense for Synchronous Replication though the dancers are also fairly well trained, so we're hoping for a happy ending with a cool soundtrack.

Unsure if its a promise I can keep, but no more features with the word "Hot" in the title, especially since the dance routines are so long.

Not going to Cannes this year, but I will be appearing at the Brits:

Tuesday, 27 April 2010

Tuning Hot Standby

Focus for the last month has been on tuning the new PostgreSQL 9.0 features. The latest set of changes have been around Hot Standby.

Erik Rijkers published some good performance test results, very helpful, thanks. Mark Kirkwood also published some insightful results, also thanks.

After the latest set of patches the results show that we get equivalent performance between master and standby under heavy load of very fast indexes SELECTs. The peak reported transaction rate was 28,000 tps (transactions per second), showing that the internal mechanics of queries during Hot Standby are very efficient. This is great news and has taken much work from myself and the usual suspects.

What we have now is a very tight set of code that should provide good performance and scalability from all angles. Changes from the master are applied using minimal locking, so that a high rate of user queries should be possible even under a mixed workload of changes from master and many standby queries. What we don't yet have good independent performance tests to confirm that, or do further tuning.

Of course, there's no magic there. If we give the standby the same workload of changes that the master had and then we add a whole new workload we can probably bring a server to its knees. That's just CPU physics, so nothing surprising there. What we've done is relax the internal locking to allow multiple concurrent workloads to hit the server from both directions, and also allow PostgreSQL to scale well to high numbers of CPUs.

Anyway, bold words. I'm happy to hear from people that have got a good test environment, publish controlled tests and can answer detailed questions to help us isolate any strangeness in the tests. Particularly keen to have Dtrace installed, so either Solaris or SystemTap. pgbench is a good benchmarking tool for what we need, though other controlled testing is also welcome.

We need independent measurements of CPU overheads and processing rates of
* standby when no queries running, just replaying transactions
* standby when we have a mix of changes and queries
especially on multi-socket servers. It would also be useful to have comparisons of how moving from a single master to master plus slave can improve total cluster throughput.

Summary is that things are looking good so far. Is that the end? Doubt it. What I'm saying is that these new features can now be taken seriously enough to spend time on in-depth benchmarking of your workloads to prove them for yourselves and also to see if you can find some performance issues in the current code. Plan it now, but wait for beta, which is soon...

If you want to hear more about these types of issues, come to CHAR(10) conference:

Thursday, 1 April 2010

The Empty Set?

April 1st, so time to goof around a little.

I discovered recently that you can have a table called " ". That's one space character or ascii #32. Just make sure you enclose it in quotes. You can also have a table with no columns. So try


which works fine. What does that look like?

postgres=# select * from " ";
(0 rows)

Can you have rows in this almost non-existent table?


fails, but let's try


Works! And in fact you can insert multiple rows this way.

How about an index? Surely not?? Yep, we can build an index on it, even if it has no columns, as long as it references an immutable function with constant input

CREATE INDEX ON " " (exp(1));

The name of the index is " _exp_idx"! And for my coup de grace, yes, it can be a unique index


which then enforces that there can only be a single row in our table (called " "). I've tried and failed to create an index that would enforce that our empty table called " " with no columns should have zero rows. Any takers?

So how about a Schema? Surely not? Yep.


and a table called " " in the " " schema? Oh yes.

CREATE TABLE " "." " ();

I also tried VACUUM VERBOSE " "." ";
INFO: vacuuming " . "
INFO: " ": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

Nothing practical comes out of that at all, just some fun. Needless to say, we only cover practical and useful things on 2ndQuadrant's courses...serious stuff.

Friday, 26 March 2010

Tuning Replication

Performance of streaming replication had been bothering me for some time. I'd noticed that the CPU utilisation was extremely high, so that the WALSender process was guzzling CPU even when nothing was happening.

When we think we have a clue about something and we were right we call that "intuition" and it sounds all mystical and clever. When we are wrong about something we call those "preconceptions". Strange difference of wording. Anyway, I had an idea that turned out to be a preconception. I started by looking for a smoking gun: one part of the system that would turn out to be using more CPU than another. There was no smoking gun, so that trail was a deadend. So next idea was just to look at the bgwriter code to see if there were any clear differences.

Anyway, the real culprit was something trivial. WALSender was supposed to sleep every 200ms by default. What it was actually doing was sleeping for 200 microseconds and then seeing if there was anything to send. So all it took to improve performance was to multiply by 1000 for the call to pg_usleep(). If only tuning databases was ever that simple!

That leaves me to say that the streaming replication now has a clean bill of health from an overall performance perspective. As far as I can see, YMMV, caveat emptor etc..

Sunday, 14 March 2010


I'm told today is Pi day, in celebration of our friendly transcendental constant and because today is the 14th March, which is written as 3.14 in some locales....

So I thought I'd celebrate with a look at Postgres' mathematical musculature.

postgres=# select pi();

Cool! That matches on all the digits, so we're rocking. So what datatype is this?

postgres=# select pg_typeof(pi());
double precision

and as the manual says, we support at least 15 digits for this datatype. Even better. So let's flex those long dormant trigonometry muscles:

postgres=# select sin(90);

Oh no! Surely sin() of 90 degrees is 1.0? Perhaps the default is radians and I just forgot. Of course, doh! The manual doesn't actually say, which seems like a flaw. Here comes a doc patch. So lets supply radians instead.

postgres=# select sin(radians(90));

Phew! That works. Now lets try some advanced stuff. There are 2*pi() radians in a circle, so pi() radians is half way round. So the sin() of that should be 0, and the cos() should be 1.

postgres=# select sin(pi()), cos(pi());
sin | cos
1.22464679914735e-16 | -1

Hmmm. That is somewhat strange, but I guess that pi is transcendental so any representation with a fixed number of digits will always be slightly wrong by exactly that number of digits. That makes sense, but I guess I was expecting sin(pi()) to return 0.

Just to put this in perspective, in comparison to the diameter of the earth that is a precision of about one millionth of a millimetre. So that is very accurate for most applications.

Let's quickly check "e", another well known transcendental. We don't provide a function for e as we do for pi, but its easy to calculate.

postgres=# select exp(1);

That looks good. So lets play some games with that also.

postgres=# select ln(exp(1));

Thank goodness for that. All good. I notice in the docs that exp() takes either a double precision or a numeric input, so which one did we just try there? Let's see

postgres=# select ln(exp(1::float));

and lets also try

postgres=# select ln(exp(1)::numeric);

Hmmm. Looks like Postgres is being irrational. Or should that be insufficiently irrational? It's been a long day. I can't wait for "sqrt(2) day".

Thursday, 4 February 2010

Parallel Query (1)

I recently returned from a lunch meeting of the UK ex-Teradatans to see old friends and colleagues. Some people know that I spent time with Teradata when it was in startup mode, what seems like a very long time ago now. Anyway, that's left me with good knowledge and interest in parallel database systems. And that's why I know Greenplum's Chief Architect Chuck McDevitt and hence why I've been using Greenplum on and off since 2005. Greenplum have also funded some of the developments I've done for PostgreSQL.

I'm disappointed we've not made much progress with parallel operations and partitioning in core Postgres in last few releases. Recent Greenplum results show we have much work to do in improving things.
Some people may think I should be sad at that, though the way I see it, Greenplum is very close to being PostgreSQL. It just happens to have some good performance enhancements of great use in Data Warehousing. A few other enhanced versions of Postgres exist also.

Some other recent results also show that MonetDB and Infobright don't fare any better by comparison either.

Having seen the above results I'm thinking about projects for next release now. Anybody want to fund some additional Data Warehousing features in Postgres core? I'm determined that next release we will get Bitmap Indexes in core, at least.

There's some more to discuss on parallel query, such as "How does this all relate to Hot Standby?", so I'll follow up later with another blog.

Thursday, 28 January 2010

9.0 Replication Features

We're getting close to deadline now for the next release of PostgreSQL.

Tensions running high as usual.

My concerns at this stage are always that we get some rounded features in before the doors close. It's always a shame to ship software exactly on time if that misses out on a few essential extras. Mind you without a hard deadline things do tend to slip for weeks and months on any project, not just PostgreSQL.

We're all working hard on the various replication features while we still can.

Greg Smith's been doing a great job improving pg_standby for use in file-based WAL shipping. I'm working on conflict resolution and other essential internals for Hot Standby. Hannu Krosing has been looking at improvements in Londiste replication, though that's not strictly tied to the release of PostgreSQL core.

We'll get there.

Monday, 18 January 2010

Mating Elephants

I've just discovered that we'll not be including Synchronous Replication in the next version of PostgreSQL, which is a surprise after so much work. (We will have streaming replication, however).

Remembering that the PostgreSQL symbol is an elephant (apart from in Japan), there is an old office joke that seems strangely appropriate. It goes like this:

Getting anything done around here is like mating elephants
* Everything gets decided at a high level
* There's a lot of stomping and trumpeting
* And it takes two years to get any results

Anyway, I'm sure we'll get there in the end so we can get multiple elephants synchronised.

Monday, 11 January 2010

Hybrid Row/Column Datastores

One of the things I wanted to talk about in more detail at PGday Europe in November was hybrid row/column stores for PostgreSQL. I've just read some more stuff about column stores and that's made me remember this again.

I'd read lots of hot air about how databases organised as column stores are much better than row stores and quite frankly I was getting a little tired of it. Sometime in 2007, I came up with a great solution but it really is taking quite a while to get going with this. Let's start with the background.

The column store argument is that if you separate out each column in a table into its own slice of data then this can be stored as bitmaps. These achieve very good compression ratios for each column. To answer a query you just link together all the individual columns required for the query and you've finished. So query cost is O(n) on the number of (compressed) columns in the query, rather than a row store for which data access is O(m) on the number of columns in the table. Column stores are great at doing things like count(*) but not very fast at retrieving data. So there are lots of queries that would show how good column stores can be, plus lots of examples of really bad queries as well. It all really depends upon what your exact query mix will be.

My main issue with column stores is how will you know ahead of time what your data warehouse query mix will be? If you are forced to pick one database product or another before you've even run a query then it seems a very risky decision. Would you base your data warehousing strategy on that? What we really want is a hybrid row/column datastore where you can have some data in row form and some data in column form.

Nobody seemed to have noticed that the column store data is exactly the same format as bitmap indexes on a single column. So, if we had a bitmap index on every column in a table we would be able to use exactly the same algorithm to extract data as column store databases do. The only implementation barriers would be that we don't yet have index-only scans, nor do we have the ability to have columns that are stored *only* in the index. This latter capability would mean that the heap row might actually hold only the visibility information and all other data would be held in indexes. I admit it sounds crazy, but it is exactly what column stores already do to extract data from the individual columns; we can do that also, it would be just an additional step in the executor to retrieve columns and assemble a result row.

Now the beauty of this approach is that you don't have to make choices between using a row and a column store database. You can choose, column-by-column if you wish, which columns will be "column-oriented" and which columns are "row-oriented", or you can have both.

So the things we need to make row/column hybrids work in PostgreSQL are

* Bitmap indexes
* Index-only scans
* Index-only columns

So core Postgres is at least one release away from that.

There are a couple of other thoughts alongside that as well. First is "join removal", now in 8.5, which is an optimizer technique for removing tables that we don't actually need to answer queries. I was very interested in that because it allows us to speed up queries by providing a mechanism for vertically partitioned tables - splitting a table into many tables each with a subset of columns. If you slice them up correctly then you avoid significant I/O when accessing tables when you only want a few of the columns. It was the first step in the right direction and I pursued it initially because it required only optimizer changes to make it work. I'll explain more about that on another post.

Last thought is about Greenplum and of course the new Single Node Edition. The latest version now has column-oriented storage as well as row oriented storage. So you have the option of choosing the design that will work best for your needs. It's a step towards where I'd like to see us go, plus GP is at least a year ahead of core Postgres in getting some form of column orientation. Initial indications are that it is faster also, though I've yet to do any objective performance tests. Been busy.

Tuesday, 5 January 2010

Extra Cost Option

Oracle's version of the Hot Standby concept is called Active Data Guard. You may know that Oracle11g was the first release of Oracle to have this functionality, so as of PostgreSQL 8.5, we will be on par. Probably worth adding that IBM DB2 doesn't have anything similar.

Someone just pointed out to me that Active Data Guard is an extra cost option with Oracle. Please point this out to your IT Manager!

Oracle suggests the following benefits for Active Data Guard
  • Increase performance—Offload unpredictable workloads to an up-to-date replica of the production database
  • Simplify operations—Eliminate management complexity that accompanies traditional replication solutions
  • Eliminate compromise—The reporting replica is up to date and online at all times, which is not possible with traditional storage mirroring technology
  • Reduce cost—An Oracle Active Data Guard physical standby database can also provide disaster recovery and/or serve as a test database—no additional storage or servers required
I think the first one definitely applies for PostgreSQL. The second and third depend upon what you're comparing against. I don't really understand the last claim: yes, the standby provides disaster recovery, but then so did Inactive(?) Data Guard/Warm Standby. And this idea about test databases seems a little strange, since the standby is read-only its not going to be a very useful test database.