If you have any questions or feedback, pleasefill out this form
This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.
In a database, to improve query efficiency, especially when dealing with large datasets, indexes can be created to accelerate query performance. This article will explore indexes in PostgreSQL. While in most cases, using the CREATE INDEX
syntax is sufficient for common development scenarios, PostgreSQL actually offers a variety of index types.
Introduction
Recently, I wanted to add full-text search and location search functionalities to my side project and found that PostgreSQL is incredibly convenient. It supports full-text search, has robust GIS capabilities (with PostGIS), trigger functions, user-friendly window functions, hstore, and more. Therefore, as I implemented these features, I decided to document my process. However, since this is just a personal project with a small dataset, it seems a bit amusing to create indexes, and I won't be conducting rigorous benchmarks.
Table of Contents
- Basic Principles of Indexing
- How to Create an Index in PostgreSQL
- Considerations When Using Indexes
- How to Maintain Indexes
- Index Types in PostgreSQL
- B-Tree
- GIN
- GiST
- BRIN
- Bloom
Basic Principles of Indexing
The most crucial aspect of a database is querying data. When data is stored on a hard drive, we must consider how to handle I/O performance issues. External hard drives, due to their hardware limitations, take significantly longer to read compared to direct memory access. In a database, if no index is created on a table, it will perform sequential scans to retrieve data. For small datasets, the performance difference may not be apparent, and sequential scans may even be faster than creating an index. However, as the data volume increases, sequential scans can lead to slow read times.
How Indexes Solve the Problem of Slow Sequential Scans
By creating an additional index table on the hard drive, we can use this index table as a directory. When querying, the database will refer to the specified index table to find the corresponding index, which then points to the actual data location on the hard drive.
B-Tree
B-Tree is a balanced tree specifically optimized for hard drives. Unlike regular binary trees, B-Trees can have multiple branches, effectively reducing the tree's depth. The principles of B-Trees require extensive explanation, so I will skip that here.
Indexes in PostgreSQL
Creating an Index
In PostgreSQL
, indexes can be created using SQL syntax.
CREATE INDEX index_name ON table_name (column_name) USING method;
In PostgreSQL, if no specific algorithm is indicated, it defaults to using B-Tree
for index creation. You can create an index on a specific column or on multiple columns at once.
CREATE INDEX index_name ON table_name (col1, col2);
"Index Scan using index_subscribers_on_email on subscribers (cost=0.28..8.29 rows=1 width=76)"
" Index Cond: ((email)::text = 'xxx@yahoo.co.jp'::text)"
"Seq Scan on subscribers (cost=0.00..37.38 rows=1 width=76)"
" Filter: ((email)::text = 'xxx@gmail.co.jp'::text)"
We can use EXPLAIN
to observe the performance of the query. It turns out that using B-Tree successfully reduces the query time.
Note:
- Using
CREATE INDEX
to create an index will lock the entire table, which may take a considerable amount of time with large datasets. - When creating an index, you can add
CREATE INDEX CONCURRENTLY
to ensure the table does not get locked. However, to ensure the integrity of the index table,CONCURRENTLY
will take longer to create the index.
When this option is used, PostgreSQL will build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it's done. There are several caveats to be aware of when using this option — see Building Indexes Concurrently.
Appropriately Using Indexes
To illustrate when to use an index, let's create an index on a table with only two entries.
CREATE INDEX index_users_on_email ON users (email);
EXPLAIN SELECT email FROM users WHERE email = 'xxx@gmail.com';
"Seq Scan on users (cost=0.00..1.02 rows=1 width=32)"
" Filter: ((email)::text = 'xxx@gmail.com'::text)"
For small datasets, even with an index created, the database will still opt for a sequential scan.
This occurs because the cost of random I/O is higher than that of sequential scans. Therefore, when the dataset is small, creating an index does not enhance query performance and may instead waste hard disk space.
Index Size
Creating an index
on a dataset with over 1300 entries.
CREATE INDEX index_subscribers_on_email on subscribers (email);
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K
Use pg_relation_size
to fetch the size of the related relation.
- To delete an index, use
DROP INDEX IF EXISTS index_name
.
To maintain index integrity, PostgreSQL will apply an exclusive lock on the entire table during index creation and will only release it after the index is built. This can potentially disrupt services for frequently modified tables, with index creation possibly taking several minutes or even longer for large datasets.
Maintaining Indexes
https://wiki.postgresql.org/wiki/Index_Maintenance
First, let's delete some entries in the database. By default, the deletion method in the database does not actually free up space on the hard drive; instead, it marks entries as "deleted." For indexes, deleting data does not reduce the size of the index.
DELETE FROM subscribers WHERE id > 1350;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K
At this point, we can use REINDEX
to rebuild the index. REINDEX INDEX index_name
REINDEX INDEX index_subscribers_on_email;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 72K
Rebuilding the index will remove unused entries, thus reducing the size of the index table.
From the above, we can understand that for tables with frequent insertions and deletions, regularly rebuilding the index can help reduce its size. If a large volume of data needs to be deleted, it's also advisable to rebuild the index to maintain its size, or else it may lead to wasted disk space.
However, REINDEX
will also lock the table. If service interruptions for maintenance are not allowed, you can create a new index concurrently using CREATE INDEX CONCURRENTLY
, then delete the old index and rename the new one. This method takes longer and can be a bit more cumbersome but ensures that the table remains unlocked.
Indexing Specific Data
Sometimes, we only frequently query data with specific conditions, in which case it may not be necessary to index all data. By using CREATE INDEX index_name ON table_name WHERE ...
, we can create an index on only the needed data.
Index Sorting
When creating an index, if no specific sorting is declared, it will default to using ASC
. However, in certain scenarios, we might often prefer DESC
sorting, such as for rankings or article publication dates. In these cases, using DESC
to create an index can be more efficient.
CREATE INDEX index_articles_on_published_date ON articles (published_date DESC);
Rebuilding Indexes
B-Tree indexes are only reused when the entire index page is cleared, so deleting a value from the database does not reduce the size of the index table. For example, if there were originally 1400 entries and 50 were deleted, the index size remains unchanged.
DELETE FROM subscribers WHERE id > 1350;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K
Rebuilding the index:
REINDEX INDEX index_subscribers_on_email;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 72K
This allows unused indexes to be removed, freeing up hard disk space and preventing unnecessary index usage from occupying the entire disk. However, it's important to note that REINDEX
will also lock the table. If the service does not allow for maintenance shutdowns, consider creating a new index, deleting the original index, and renaming the new index to avoid locking the table.
Index Types in PostgreSQL
Note: The following content has not been benchmarked.
Having introduced PostgreSQL indexes and their usage, let's now look at common index types in PostgreSQL. If no specific method is indicated when creating an index, PostgreSQL will default to using B-Tree.
B-Tree
B-Tree supports the following query operations:
- <
- <=
- =
-
=
-
GIN (Generalized Inverted Index)
GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.
GIN is used for full-text searches or for types like arrays. As its name suggests, an inverted index utilizes values as index keys to find corresponding occurrences.
For example:
this is cat
this is an apple.
cat meows.
// build inverted index
"this": {(0, 0), (1,0)}
"is": {(0,1), (1,1)}
"an": {(1,2)}
"apple": {(1,3)}
"cat": {(0,2), (2,0)}
"meows": {(2,1)}
Using the above index table, if we want to search for the keyword cat, we can find the corresponding indexes by looking up the key for cat. We see that cat appears as the second word in the first sentence and the first word in the third sentence.
Applicable Scenarios:
- Full-text search
- Arrays
In PostgreSQL, you can use GIN
as follows:
CREATE INDEX index_on_document on sentences using gin(document);
GiST (Generalized Search Tree)
GiST is more of a generic interface that allows us to define our own index implementation methods (B-Tree, R-Tree, etc.). It's generally less frequently used. However, for spatial data structures, B-Tree may struggle to accelerate searches for containment, adjacency, or intersection. Therefore, PostGIS uses GiST to create better-performing indexes for spatial queries.
Applicable Scenarios:
- Custom index implementation
- Spatial structures
- When B-Tree may not suit the data structure of the use case
BRIN (Block Range Indexes)
Unlike B-Tree, which indexes every row, BRIN indexes ranges. While performance-wise, B-Tree still reigns superior, BRIN is significantly smaller in size.
If data is often queried by range, such as log files, transaction orders, or billing data, these datasets can be immense. Using B-Tree for indexing may consume a considerable amount of disk space. However, since these datasets are typically not used for precise searches but rather for range queries, BRIN can provide decent performance while saving substantial space.
Applicable Scenarios:
- Log file analysis
- Transaction order analysis
- Large datasets with frequent range searches
Bloom
Bloom Filter is designed to address space and query time issues in hash tables. The algorithm can quickly determine whether an element exists in the set but may result in false positives, requiring a secondary check in PostgreSQL.
CREATE INDEX bloom_index_event ON events USING bloom (startTime, endTime, name, gifts)
WITH (length=80, startTime=2, endTime=2, name=4, gifts=2);
The length parameter specifies the length of each signature, with a default of 80 and a maximum of 4096. The subsequent parameters map the corresponding column names to a certain number of bits, with a minimum of 2 and a maximum of 4096.
Applicable Scenarios:
- Precise searches across multiple columns
SELECT * FROM events WHERE startTime=20171111 and endTime=20171231 and name=christmas and gifts="gift_special"
, where a bloom filter can quickly filter out results not in the set.
Conclusion
In general, B-Tree can handle most use cases effectively. However, PostgreSQL offers a rich variety of index types, making data querying and indexing more efficient while providing developers with greater flexibility. This is one of the reasons I appreciate PostgreSQL so much.
- PostgreSQL provides several algorithms for indexing, each suitable for specific scenarios.
- Using
CREATE INDEX
will lock the entire table, which can significantly impact performance when indexing large datasets. This can be mitigated withCREATE INDEX CONCURRENTLY
, although it will take longer to create the index. - Frequent updates and deletions can lead to redundant indexes. Regularly using
REINDEX
to rebuild indexes can minimize disk space usage. Keep in mind thatREINDEX
will lock the table. To avoid locking, consider creating a new index, deleting the old one, and renaming the new index. CREATE UNIQUE INDEX
can ensure the uniqueness of values.INDEX
can be created with aWHERE
condition to target specific columns for particular use cases.INDEX
can also be sorted (default is ASC) to fit specific use cases (e.g., article publication dates).- Indexes are not a silver bullet; for small datasets, you may find that even with an index, the database still performs a sequential scan due to the higher cost of random I/O compared to sequential scans. Thus, for fixed and small datasets (e.g., counties, postal codes), creating an index may not necessarily improve query times and could waste disk space.
If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨
☕Buy me a coffee