Alongside various strategies for handling large amounts of data with indexes, PostgreSQL provides another feature: splitting the table with inheritance. This separation of a table’s data into different child tables is called “partitioning” in PostgreSQL. Because it does require a bit of effort to implement, and because there are also limitations in how it can be used, you need to take some points into consideration when creating a partition.
This first part of the article presents the pro and cons and what to consider when using partitioning to improve performance. In the second part we will describe the steps to create a production-ready partition. A third blog post shows the specifics to consider in a JPA Hibernate project.
One sign that you should consider partitioning is when you have poor performance despite effective indexing. The number of records can provide a hint that the poor response time is caused by the table’s size: depending on the structure of the index, the situation can start to become critical at five to eight million rows. As soon as the indexes themselves become too large to be searched quickly, or memory becomes limited, you have to start looking at other solutions. One possibility is to manipulate the STATISTICS parameter (globally with default_statistics_target, for one row with ALTER TABLE … ALTER COLUMN … SET STATISTICS) to reduce memory consumption, but this may reduce the effectiveness of the indexation. In a partitioned table, the index is always built on one partition only, i.e. on just one part of the records.
The most important requirement for successful partitioning is to find an attribute that can be used to define equally distributed blocks of records. The division of data is most effective when no one table is bigger than the others. So using a year or a date range is usually a bad idea, because one child table will likely become much bigger than the others over time. However, one thing that can work is to separate the data by month, so long as the data is distributed regularly throughout the year and it doesn’t look likely that one table will pass the five million records limit.
Furthermore, the partitioned table provides the most efficient search when the partition attribute is also the search criteria. If the partition boundaries are not part of the search, the database must look in all child tables. For an optimal result it’s absolutely necessary to use an attribute that can also be used as a search criterium. So instead of using temporal attributes, a geographic or organizational reference makes more sense. A combination of attributes is also feasible.
In addition to equal distribution, you can also look at how frequently a certain category of data in a table is accessed. Grouping the most frequently queried data together in one partition automatically improves the chance of fast hits, even if this partition is bigger than the others. So it makes sense to pay attention to the context and logical relations of the content when deciding on the parameters for delimiting partitions.
Even a very large amount of data remains manageable if you can separate it into smaller parts. The author had a result from a join over tables with 30 and 8 million records respectively in a fraction of a second. Another advantage particular to PostgreSQL is with bulk import or bulk delete: a partition can be inserted or deleted very quickly as a single entity. And delete doesn’t force a VACUUM in this case!
A partitioned table does lose an important feature: constraints are no longer valid over the whole dataset. In particular, foreign keys which can only be used on the master table do not work anymore. A “channeling” of conditions from the main, empty table in front to the child tables that actually contain the data is not possible. This is also true for indexes on the master table: neither constraints nor indexes will be inherited; they have to be installed on each partition separately. To enable the optimizer to work with partitions, the database parameter constraint_exclusion must not be set to “off”. The default value in postgresql.conf is “partition”.
A special situation occurs if the value of the attribute which is used to define the distribution changes. If an existing record is inserted again, but with a different value for the partition parameter field, there will be no primary key error and the same primary key will be present in the database twice. And if the partition parameter value is updated and gets a value outside the range of the current partition, the record will “disappear” because it will not be automatically moved. It must be deleted and reinserted in order to change partitions.