PostgreSQL Notes — INDEX

Written byKalanKalan
💡

If you have any questions or feedback, pleasefill out this form

This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.

In a database, to improve query efficiency, especially when dealing with large datasets, indexes can be created to accelerate query performance. This article will explore indexes in PostgreSQL. While in most cases, using the CREATE INDEX syntax is sufficient for common development scenarios, PostgreSQL actually offers a variety of index types.

Introduction

Recently, I wanted to add full-text search and location search functionalities to my side project and found that PostgreSQL is incredibly convenient. It supports full-text search, has robust GIS capabilities (with PostGIS), trigger functions, user-friendly window functions, hstore, and more. Therefore, as I implemented these features, I decided to document my process. However, since this is just a personal project with a small dataset, it seems a bit amusing to create indexes, and I won't be conducting rigorous benchmarks.

Table of Contents

  • Basic Principles of Indexing
  • How to Create an Index in PostgreSQL
  • Considerations When Using Indexes
  • How to Maintain Indexes
  • Index Types in PostgreSQL
    • B-Tree
    • GIN
    • GiST
    • BRIN
    • Bloom

Basic Principles of Indexing

The most crucial aspect of a database is querying data. When data is stored on a hard drive, we must consider how to handle I/O performance issues. External hard drives, due to their hardware limitations, take significantly longer to read compared to direct memory access. In a database, if no index is created on a table, it will perform sequential scans to retrieve data. For small datasets, the performance difference may not be apparent, and sequential scans may even be faster than creating an index. However, as the data volume increases, sequential scans can lead to slow read times.

How Indexes Solve the Problem of Slow Sequential Scans

By creating an additional index table on the hard drive, we can use this index table as a directory. When querying, the database will refer to the specified index table to find the corresponding index, which then points to the actual data location on the hard drive.

B-Tree

B-Tree is a balanced tree specifically optimized for hard drives. Unlike regular binary trees, B-Trees can have multiple branches, effectively reducing the tree's depth. The principles of B-Trees require extensive explanation, so I will skip that here.

Indexes in PostgreSQL

Creating an Index

In PostgreSQL, indexes can be created using SQL syntax.

CREATE INDEX index_name ON table_name (column_name) USING method;

In PostgreSQL, if no specific algorithm is indicated, it defaults to using B-Tree for index creation. You can create an index on a specific column or on multiple columns at once.

CREATE INDEX index_name ON table_name (col1, col2);
"Index Scan using index_subscribers_on_email on subscribers  (cost=0.28..8.29 rows=1 width=76)"
"  Index Cond: ((email)::text = 'xxx@yahoo.co.jp'::text)"

"Seq Scan on subscribers  (cost=0.00..37.38 rows=1 width=76)"
"  Filter: ((email)::text = 'xxx@gmail.co.jp'::text)"

We can use EXPLAIN to observe the performance of the query. It turns out that using B-Tree successfully reduces the query time.

Note:

  • Using CREATE INDEX to create an index will lock the entire table, which may take a considerable amount of time with large datasets.
  • When creating an index, you can add CREATE INDEX CONCURRENTLY to ensure the table does not get locked. However, to ensure the integrity of the index table, CONCURRENTLY will take longer to create the index.

When this option is used, PostgreSQL will build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it's done. There are several caveats to be aware of when using this option — see Building Indexes Concurrently.

source

Appropriately Using Indexes

To illustrate when to use an index, let's create an index on a table with only two entries.

CREATE INDEX index_users_on_email ON users (email);
EXPLAIN SELECT email FROM users WHERE email = 'xxx@gmail.com';

"Seq Scan on users  (cost=0.00..1.02 rows=1 width=32)"
"  Filter: ((email)::text = 'xxx@gmail.com'::text)"

For small datasets, even with an index created, the database will still opt for a sequential scan.

This occurs because the cost of random I/O is higher than that of sequential scans. Therefore, when the dataset is small, creating an index does not enhance query performance and may instead waste hard disk space.

Index Size

Creating an index on a dataset with over 1300 entries.

CREATE INDEX index_subscribers_on_email on subscribers (email);
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K

Use pg_relation_size to fetch the size of the related relation.

  • To delete an index, use DROP INDEX IF EXISTS index_name.

To maintain index integrity, PostgreSQL will apply an exclusive lock on the entire table during index creation and will only release it after the index is built. This can potentially disrupt services for frequently modified tables, with index creation possibly taking several minutes or even longer for large datasets.

Maintaining Indexes

https://wiki.postgresql.org/wiki/Index_Maintenance

First, let's delete some entries in the database. By default, the deletion method in the database does not actually free up space on the hard drive; instead, it marks entries as "deleted." For indexes, deleting data does not reduce the size of the index.

DELETE FROM subscribers WHERE id > 1350;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K

At this point, we can use REINDEX to rebuild the index. REINDEX INDEX index_name

REINDEX INDEX index_subscribers_on_email;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 72K

Rebuilding the index will remove unused entries, thus reducing the size of the index table.

From the above, we can understand that for tables with frequent insertions and deletions, regularly rebuilding the index can help reduce its size. If a large volume of data needs to be deleted, it's also advisable to rebuild the index to maintain its size, or else it may lead to wasted disk space.

However, REINDEX will also lock the table. If service interruptions for maintenance are not allowed, you can create a new index concurrently using CREATE INDEX CONCURRENTLY, then delete the old index and rename the new one. This method takes longer and can be a bit more cumbersome but ensures that the table remains unlocked.

Indexing Specific Data

Sometimes, we only frequently query data with specific conditions, in which case it may not be necessary to index all data. By using CREATE INDEX index_name ON table_name WHERE ..., we can create an index on only the needed data.

Index Sorting

When creating an index, if no specific sorting is declared, it will default to using ASC. However, in certain scenarios, we might often prefer DESC sorting, such as for rankings or article publication dates. In these cases, using DESC to create an index can be more efficient.

CREATE INDEX index_articles_on_published_date ON articles (published_date DESC);

Rebuilding Indexes

B-Tree indexes are only reused when the entire index page is cleared, so deleting a value from the database does not reduce the size of the index table. For example, if there were originally 1400 entries and 50 were deleted, the index size remains unchanged.

DELETE FROM subscribers WHERE id > 1350;

SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 80K

Rebuilding the index:

REINDEX INDEX index_subscribers_on_email;
SELECT pg_relation_size('index_subscribers_on_email')/1024 || 'K' AS size; // 72K

This allows unused indexes to be removed, freeing up hard disk space and preventing unnecessary index usage from occupying the entire disk. However, it's important to note that REINDEX will also lock the table. If the service does not allow for maintenance shutdowns, consider creating a new index, deleting the original index, and renaming the new index to avoid locking the table.

Index Types in PostgreSQL

Note: The following content has not been benchmarked.

Having introduced PostgreSQL indexes and their usage, let's now look at common index types in PostgreSQL. If no specific method is indicated when creating an index, PostgreSQL will default to using B-Tree.

B-Tree

B-Tree supports the following query operations:

  • <
  • <=
  • =
  • =

GIN (Generalized Inverted Index)

GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.

GIN is used for full-text searches or for types like arrays. As its name suggests, an inverted index utilizes values as index keys to find corresponding occurrences.

For example:

this is cat
this is an apple.
cat meows.
// build inverted index
"this": {(0, 0), (1,0)}
"is": {(0,1), (1,1)}
"an": {(1,2)}
"apple": {(1,3)}
"cat": {(0,2), (2,0)}
"meows": {(2,1)}

Using the above index table, if we want to search for the keyword cat, we can find the corresponding indexes by looking up the key for cat. We see that cat appears as the second word in the first sentence and the first word in the third sentence.

Applicable Scenarios:

  • Full-text search
  • Arrays

In PostgreSQL, you can use GIN as follows:

CREATE INDEX index_on_document on sentences using gin(document);

GiST (Generalized Search Tree)

GiST is more of a generic interface that allows us to define our own index implementation methods (B-Tree, R-Tree, etc.). It's generally less frequently used. However, for spatial data structures, B-Tree may struggle to accelerate searches for containment, adjacency, or intersection. Therefore, PostGIS uses GiST to create better-performing indexes for spatial queries.

Applicable Scenarios:

  • Custom index implementation
  • Spatial structures
  • When B-Tree may not suit the data structure of the use case

BRIN (Block Range Indexes)

Unlike B-Tree, which indexes every row, BRIN indexes ranges. While performance-wise, B-Tree still reigns superior, BRIN is significantly smaller in size.

If data is often queried by range, such as log files, transaction orders, or billing data, these datasets can be immense. Using B-Tree for indexing may consume a considerable amount of disk space. However, since these datasets are typically not used for precise searches but rather for range queries, BRIN can provide decent performance while saving substantial space.

Applicable Scenarios:

  • Log file analysis
  • Transaction order analysis
  • Large datasets with frequent range searches

Bloom

Bloom Filter is designed to address space and query time issues in hash tables. The algorithm can quickly determine whether an element exists in the set but may result in false positives, requiring a secondary check in PostgreSQL.

CREATE INDEX bloom_index_event ON events USING bloom (startTime, endTime, name, gifts)
 WITH (length=80, startTime=2, endTime=2, name=4, gifts=2);

The length parameter specifies the length of each signature, with a default of 80 and a maximum of 4096. The subsequent parameters map the corresponding column names to a certain number of bits, with a minimum of 2 and a maximum of 4096.

Applicable Scenarios:

  • Precise searches across multiple columns SELECT * FROM events WHERE startTime=20171111 and endTime=20171231 and name=christmas and gifts="gift_special", where a bloom filter can quickly filter out results not in the set.

Conclusion

In general, B-Tree can handle most use cases effectively. However, PostgreSQL offers a rich variety of index types, making data querying and indexing more efficient while providing developers with greater flexibility. This is one of the reasons I appreciate PostgreSQL so much.

  1. PostgreSQL provides several algorithms for indexing, each suitable for specific scenarios.
  2. Using CREATE INDEX will lock the entire table, which can significantly impact performance when indexing large datasets. This can be mitigated with CREATE INDEX CONCURRENTLY, although it will take longer to create the index.
  3. Frequent updates and deletions can lead to redundant indexes. Regularly using REINDEX to rebuild indexes can minimize disk space usage. Keep in mind that REINDEX will lock the table. To avoid locking, consider creating a new index, deleting the old one, and renaming the new index.
  4. CREATE UNIQUE INDEX can ensure the uniqueness of values.
  5. INDEX can be created with a WHERE condition to target specific columns for particular use cases.
  6. INDEX can also be sorted (default is ASC) to fit specific use cases (e.g., article publication dates).
  7. Indexes are not a silver bullet; for small datasets, you may find that even with an index, the database still performs a sequential scan due to the higher cost of random I/O compared to sequential scans. Thus, for fixed and small datasets (e.g., counties, postal codes), creating an index may not necessarily improve query times and could waste disk space.

If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨

Buy me a coffee