In the database, in order to improve query efficiency, when dealing with large amounts of data, indexes can be created to accelerate database query performance. This article will discuss PostgreSQL indexes. Although in most cases, using the CREATE INDEX
syntax can handle common development scenarios, PostgreSQL actually provides many different types of indexes.
Introduction
Recently, because of my small project, I wanted to implement features like full-text search and location search. I found that PostgreSQL is really convenient, with full-text search, powerful GIS support (with PostGIS), trigger functions, useful window functions, hstore, and more. Therefore, while implementing the features, I also documented them. However, since it is just a personal project, I find it amusing to create indexes for small data sets, so there is no rigorous benchmark here.
Table of Contents
- Basic principles of indexes
- How to create indexes in PostgreSQL
- Considerations when using indexes
- How to maintain indexes
- Index types in PostgreSQL
- B Tree
- GIN
- GiST
- BRIN
- Bloom
Basic Principles of Indexes
The most important aspect of a database is querying data. When data is stored on a hard disk, the performance of I/O operations needs to be considered. In a database, if no indexes are created on a table, sequential scans will be used to query the data. There is no significant performance difference when the data volume is small, and it may even be faster than creating indexes. However, as the data volume increases, sequential scans can potentially cause slow reading.
How Indexes Solve the Issue of Slow Sequential Scans
By creating an additional index table on the hard disk, we can use the index table as a directory. When querying the database, the database will use the given index table to find the corresponding index. Then, it follows the index to access the actual data stored on the hard disk.
B Tree
B Tree is a balanced tree optimized for disk storage. Unlike a regular binary tree, a B Tree can have multiple branches, effectively reducing the depth of the tree. The principles of B Tree require a lengthy explanation, so it will be omitted here.
Indexes in PostgreSQL
Creating Indexes
In PostgreSQL, indexes can be created using SQL syntax.
CREATE INDEX index_name ON table_name (column_name) USING method;
In PostgreSQL, if no algorithm is specified, the default is to use B-Tree
to create the index. Indexes can also be created on a specific column. It is also possible to create indexes on multiple columns at once.
CREATE INDEX index_name ON table_name (col1, col2);
"Index Scan using index_subscribers_on_email on subscribers (cost=0.28..8.29 rows=1 width=76)"
" Index Cond: ((email)::text = 'xxx@yahoo.co.jp'::text)"
"Seq Scan on subscribers (cost=0.00..37.38 rows=1 width=76)"
" Filter: ((email)::text = 'xxx@gamil.co.jp'::text)"
We can use EXPLAIN
to observe the performance of the query. It can be seen that using the B-Tree method successfully reduces the query time.
Note:
- When using
CREATE INDEX
to create an index, the entire table will be locked, which may take a considerable amount of time for large data sets. - To ensure that the table is not locked when creating an index,
CREATE INDEX CONCURRENTLY
can be used. However, to ensure the integrity of the index table,CONCURRENTLY
takes more time to create the index.
When this option is used, PostgreSQL will build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it's done. There are several caveats to be aware of when using this option — see Building Indexes Concurrently.
Appropriate Indexing
To explain when the database uses an index, let's create an index on a table with only two records.
CREATE INDEX index_users_on_email ON users (email);
EXPLAIN SELECT email FROM users WHERE email = 'xxx@gmail.com';
"Seq Scan on users (cost=0.00..1.02 rows=1 width=32)"
" Filter: ((email)::text = 'xxx@gmail.com'::text)"
For small data sets, even if an index is created, the database will still use sequential scans for querying. This is because the cost of random I/O is higher than that of sequential scans. Therefore, for fixed and small data sets (e.g., counties, postal codes), creating an index may not necessarily increase query time but instead occupy unnecessary disk space.
Index Size
Creating an index on a data set with more than 1300 records.
CREATE INDEX index_subscribers_on_email on subscribers (email);
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size;//80K
Using pg_relation_size
to obtain the size of the related relation.
- Deleting an index:
DROP INDEX IF EXISTS index_name
.
To ensure the integrity of the index, PostgreSQL locks the entire table when creating an index and only releases the lock after the index is created. Therefore, for frequently queried or modified tables, it may cause service interruption. For data with a considerable number of records, creating an index may take several minutes or even tens of minutes.
Maintaining Indexes
https://wiki.postgresql.org/wiki/Index_Maintenance
First, delete some data from the database. By default, when data is deleted from a table, it is not actually released from the hard disk but marked as "deleted". For indexes, deleting data does not reduce the size of the index.
DELETE FROM subscribers WHERE id >1350;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K
At this point, we can recreate the index using REINDEX
. REINDEX INDEX index_name
REINDEX INDEX index_subscibers_on_email;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 72K
Rebuilding the index removes unused indexes, reducing the size of the index table.
From the above, we can see that for frequently updated or deleted data tables, regularly rebuilding indexes can reduce the size of the index table. If a large amount of data needs to be deleted, it is also recommended to rebuild the index to maintain the size of the index table; otherwise, it may waste disk space.
However, REINDEX
also locks the table. If it is not possible to allow service maintenance or interruption, a new index can be created directly using CREATE INDEX CONCURRENTLY
. After deleting the old index, the new index can be renamed. This method takes more time than REINDEX
and is more cumbersome, but it ensures that the table is not locked.
Indexing Specific Data
Sometimes, we only need to frequently query specific data conditions. In this case, it is not necessary to create indexes on all data. By using the CREATE INDEX index_name ON table_name WHERE ...
syntax, we can create indexes on the required data.
Index Sorting
By default, when creating an index without specifying the sort order, ASC
is used. However, in some scenarios, we may prefer to use DESC
for sorting, such as ranking or article publishing dates. In these cases, creating an index with DESC
can be more efficient.
CREATE INDEX index_articles_on_published_date ON articles (published_date DESC);
Rebuilding Indexes
B-Tree indexes are only reused when all index pages are cleared. Therefore, deleting a value from the database does not reduce the size of the index table. With the original data set of 1400 records, deleting 50 records does not change the size of the index.
DELETE FROM subscribers WHERE id > 1350;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K
Recreate the index:
REINDEX INDEX index_subscibers_on_email;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size;
72K
By allowing unused indexes to be removed, disk space can be freed up, avoiding unnecessary indexes occupying the entire disk. However, it should be noted that REINDEX
also locks the table. If service cannot be interrupted, it is possible to consider creating a new index directly, deleting the original index, and renaming the new index.
Index Types in PostgreSQL
Note: The following content has not been benchmarked yet.
After explaining PostgreSQL indexes and their usage, let's now discuss the common index types in PostgreSQL. If no specific method is specified when creating an index, PostgreSQL uses B Tree as the default index type.
Btree
B Tree supports the following query operations:
- <
- <=
- =
- >=
- >
GIN (Generalized Inverted Index)
GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.
GIN is used for full-text search or for indexing composite values, such as arrays. As the name suggests, it is an inverted index that uses values as index values and finds the corresponding indexes based on the values.
For example:
this is cat
this is an apple.
cat meows.
// build inverted index
"this": {(0, 0), (1,0)}
"is": {(0,1), (1,1)}
"an": {(1,2)}
"apple: {(1,3)}
"cat": {(0,2), (2,0)}
"meows": {(2,1)}
Through the index table above, if we want to search for the keyword cat, we can find the corresponding indexes from the value with the key "cat". It can be seen that "cat" appears as the second word in the first sentence and as the first word in the third sentence.
Suitable for:
- Full-text search
- Arrays
Using GIN
in PostgreSQL:
CREATE INDEX index_on_document on sentences using gin(document);
GiST (Generalized Search Tree)
GiST is actually more of a general interface. It can be used to define our own index implementation methods (such as B-Tree, R-Tree, etc.). Generally, it is less commonly used. However, for spatial data structures, using B-Tree may not efficiently accelerate searches for containment, adjacency, and intersection. Therefore, for spatial data structures, such as in PostGIS, GiST is used to create indexes with better query performance.
Suitable for:
- Implementing custom index interfaces
- Spatial structures
- Data structures that cannot be efficiently handled by B-Tree
BRIN (Block Range Indexes)
Unlike B-Tree, which indexes each row, BRIN indexes a range of values. Although the performance is still inferior to B-Tree, the size is much smaller.
If data is frequently queried based on a range, such as log files, a range of transaction orders, or billing data, these data sets usually have a considerable volume. Creating indexes using B-Tree may consume a significant amount of disk space. However, these data sets are often not frequently accessed for precise queries but are processed in batches using range queries. In this case, using BRIN can achieve good performance while saving a considerable amount of space.
Suitable for:
- Log file analysis
- Transaction order analysis
- Large data sets frequently accessed using range queries
Bloom
Bloom Filter is designed to solve the space and query time issues of hash tables. The algorithm itself can quickly determine whether an element appears in a set, but false positives may occur. Therefore, in PostgreSQL, a second check is required.
CREATE INDEX bloom_index_event ON events USING bloom (startTime, endTime, name, gifts)
WITH (length=80, startTime=2, endTime=2, name=4, gifts=2);
The length
parameter represents the length of each signature and defaults to 80, with a maximum of 4096. The following parameters specify how many bits are mapped to each column name. The minimum is 2, and the maximum is 4096.
Suitable for:
- Precise searching of multiple columns, such as
SELECT * FROM events WHERE startTime=20171111 and endTime=20171231 and name=christmas and gifts="gift_special"
. Using a bloom filter can quickly filter out results that are not in the set.
Summary
In general, B-Tree can handle almost all usage scenarios. However, PostgreSQL provides a variety of index types, making it easier to query and index data, and providing developers with greater flexibility. This is one of the main reasons why I really like PostgreSQL.
- PostgreSQL offers several index algorithms for use, each with its own suitable scenarios.
- When using
CREATE INDEX
, the entire table is locked, which may have an impact when creating indexes for large amounts of data.CREATE INDEX CONCURRENTLY
can be used to solve this issue, but it takes more time to create the index. - If data is frequently updated or deleted, it may result in redundant indexes. It is recommended to periodically rebuild indexes to reduce disk space usage. Pay attention to the fact that
REINDEX
locks the table. If service interruption is not allowed, a new index can be created directly, the old index can be deleted, and the new index can be renamed to avoid table locking. CREATE UNIQUE INDEX
can be used to ensure the uniqueness of values.INDEX
can be created with aWHERE
condition to index specific columns based on specific usage scenarios.- Indexes can be created with sorting (default is ASC) to match specific usage scenarios (e.g., article publishing dates).
- Indexes are not a silver bullet. It can be observed that even with indexes, sequential scans are still used for small data sets. This is because the cost of random I/O is much higher than that of sequential scans. Therefore, for fixed and small data sets (e.g., counties, postal codes), creating an index may not necessarily increase query time but instead occupy unnecessary disk space.