PostgreSQL: Partitioning big tables (Part 2)

Alongside various strategies for handling large amounts of data with indexes, PostgreSQL provides another feature: splitting the table with inheritance. This separation of a table’s data into different child tables is called “partitioning” in PostgreSQL.

In part one, we described the pro and cons of this method and its conditions of use. In this second part we describe the steps required to actually configure a partition for production. A third blog post shows the specifics to consider in a JPA Hibernate project. The proposed instructions are intended for use with a PostgreSQL version 9.2 database but may work on all versions from 8.4.

Preparation

Several steps are necessary for successful partitioning. Assuming a master table already exists, the first and simplest step is the creation of child tables. They can be generated by the inheritance function of PostgreSQL with the keyword INHERITS in the CREATE TABLE statement. The tables can be distributed in different schema, in a clustered environment and even on different devices. Attention: The parameter “sql_inheritance” has to be set to “on” (default value)! The criteria to distinguish the child tables are set with the command CHECK(condition). These two steps can be combined while executing the CREATE TABLE statement:

CREATE TABLE child01
 (
 CONSTRAINT child01_pkey PRIMARY KEY (id),
 CONSTRAINT child01_selectionfield _check CHECK
 (selectionfield>= 0 AND selectionfield < 1000)
 )
INHERITS (master);

Check-Conditions

The conditions for the selection mustn’t overlap and have to be unique. Gaps can cause trouble later. For example the combination of the conditions CHECK(ZIP BETWEEN 10000 AND 20000) and CHECK(ZIP BETWEEN 20000 AND 30000) is an error because they are ambiguous: it’s not clear where to put the records with ZIP 20000.

Alongside ranges with from-to-conditions, lists are also allowed. Example: CHECK ( state IN ( ‘Minnesota’, ‘Wisconsin’, ‘Michigan’)) and CHECK ( state IN (‘Oregon’, ‘Washington’, ‘Idaho’))…

Each child table should at least be indexed on the partition parameter value. It is not obligatory, but strongly recommended. Other than that, the primary key and the indexes should be applied as for the origin master table.

You should pay attention when converting values, because CHECK conditions have the same rules as WHERE clauses: if the field value is mapped, for example a String value is compared as an integer (CHECK(ZIP::Integer >10000)), the planner cannot optimize the query. You might be able to use a functional index in this case. The same is true for “upper” and “lower” conversions. A typical error is the comparison of a date value with a string field.

Trigger and functions

Inheritance alone is not sufficient to fill the partitions from the master table. Functions are also required to distribute the data on the tables as defined in the CHECK conditions. These functions are managed by triggers on the master table. The plpgsql module must be installed in order to use the procedural language PL/pgSQL for this.

CREATE OR REPLACE FUNCTION master_to_child_insert_trigger()
RETURNS TRIGGER AS $BODY$
BEGIN
IF (NEW.selectionfield >= 0
AND NEW.selectionfield < 1000) THEN INSERT INTO child01 VALUES (NEW.*); ELSIF (NEW.selectionfield >= 1000
AND NEW.selectionfield < 2000)
THEN
INSERT INTO child02 VALUES (NEW.*);
ELSE
RAISE EXCEPTION 'selection field out of range!';
END IF;
RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql;

The “RETURN NULL” is necessary to avoid accidentally inserting a returned value into the master table. The “RAISE EXCEPTION” for unexpected input handles values which would otherwise disappear unnoticed.

To call the function at the right time, you use a trigger:

CREATE TRIGGER master_trigger
BEFORE INSERT ON master
FOR EACH ROW EXECUTE PROCEDURE master_to_child_insert_trigger();

With this trigger in place, the partition is ready to be used. Instead of triggers you can also use rules such as “CREATE RULE… AS ON INSERT TO… WHERE… DO INSTEAD INSERT INTO… VALUES(NEW.*)“, but these have more overhead than triggers and do not work with copy operations.

A PostgreSQL module exists to actively manage a partition: the extension “pg_partman“. It is useful if the partition is based on time or ID values and continuous rebalancing of the tables is required.

Conclusion

In many cases involving large amounts of data, partitioning offers one possible approach to optimize the database. The installation is however costly and error-prone. The implementation should be performed with care and exhaustive error control. In addition, there are limitations such as the opaque constraint behavior and the requirement for central search criteria, which limit its usability. The adoption of this technique should therefore be preceded by thorough discussion and should be supervised by experienced database experts.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s