Kalan's Blog

Kalan 頭像照片,在淡水拍攝,淺藍背景

四零二曜日電子報上線啦!訂閱訂起來

Software Engineer / Taiwanese / Life in Fukuoka
This blog supports RSS feed (all content), you can click RSS icon or setup through third-party service. If there are special styles such as code syntax in the technical article, it is still recommended to browse to the original website for the best experience.

Current Theme light

我會把一些不成文的筆記或是最近的生活雜感放在短筆記,如果有興趣的話可以來看看唷!

Please notice that currenly most of posts are translated by AI automatically and might contain lots of confusion. I'll gradually translate the post ASAP

PostgreSQL Notes — INDEX

In the database, in order to improve query efficiency, when dealing with large amounts of data, indexes can be created to accelerate database query performance. This article will discuss PostgreSQL indexes. Although in most cases, using the CREATE INDEX syntax can handle common development scenarios, PostgreSQL actually provides many different types of indexes.

Introduction

Recently, because of my small project, I wanted to implement features like full-text search and location search. I found that PostgreSQL is really convenient, with full-text search, powerful GIS support (with PostGIS), trigger functions, useful window functions, hstore, and more. Therefore, while implementing the features, I also documented them. However, since it is just a personal project, I find it amusing to create indexes for small data sets, so there is no rigorous benchmark here.

Table of Contents

  • Basic principles of indexes
  • How to create indexes in PostgreSQL
  • Considerations when using indexes
  • How to maintain indexes
  • Index types in PostgreSQL
    • B Tree
    • GIN
    • GiST
    • BRIN
    • Bloom

Basic Principles of Indexes

The most important aspect of a database is querying data. When data is stored on a hard disk, the performance of I/O operations needs to be considered. In a database, if no indexes are created on a table, sequential scans will be used to query the data. There is no significant performance difference when the data volume is small, and it may even be faster than creating indexes. However, as the data volume increases, sequential scans can potentially cause slow reading.

How Indexes Solve the Issue of Slow Sequential Scans

By creating an additional index table on the hard disk, we can use the index table as a directory. When querying the database, the database will use the given index table to find the corresponding index. Then, it follows the index to access the actual data stored on the hard disk.

B Tree

B Tree is a balanced tree optimized for disk storage. Unlike a regular binary tree, a B Tree can have multiple branches, effectively reducing the depth of the tree. The principles of B Tree require a lengthy explanation, so it will be omitted here.

Indexes in PostgreSQL

Creating Indexes

In PostgreSQL, indexes can be created using SQL syntax.

CREATE INDEX index_name ON table_name (column_name) USING method;

In PostgreSQL, if no algorithm is specified, the default is to use B-Tree to create the index. Indexes can also be created on a specific column. It is also possible to create indexes on multiple columns at once.

CREATE INDEX index_name ON table_name (col1, col2);
"Index Scan using index_subscribers_on_email on subscribers  (cost=0.28..8.29 rows=1 width=76)"
"  Index Cond: ((email)::text = 'xxx@yahoo.co.jp'::text)"

"Seq Scan on subscribers  (cost=0.00..37.38 rows=1 width=76)"
"  Filter: ((email)::text = 'xxx@gamil.co.jp'::text)"

We can use EXPLAIN to observe the performance of the query. It can be seen that using the B-Tree method successfully reduces the query time.

Note:

  • When using CREATE INDEX to create an index, the entire table will be locked, which may take a considerable amount of time for large data sets.
  • To ensure that the table is not locked when creating an index, CREATE INDEX CONCURRENTLY can be used. However, to ensure the integrity of the index table, CONCURRENTLY takes more time to create the index.

When this option is used, PostgreSQL will build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it's done. There are several caveats to be aware of when using this option — see Building Indexes Concurrently.

source

Appropriate Indexing

To explain when the database uses an index, let's create an index on a table with only two records.

CREATE INDEX index_users_on_email ON users (email);
EXPLAIN SELECT email FROM users WHERE email = 'xxx@gmail.com';

"Seq Scan on users  (cost=0.00..1.02 rows=1 width=32)"
"  Filter: ((email)::text = 'xxx@gmail.com'::text)"

For small data sets, even if an index is created, the database will still use sequential scans for querying. This is because the cost of random I/O is higher than that of sequential scans. Therefore, for fixed and small data sets (e.g., counties, postal codes), creating an index may not necessarily increase query time but instead occupy unnecessary disk space.

Index Size

Creating an index on a data set with more than 1300 records.

CREATE INDEX index_subscribers_on_email on subscribers (email);
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size;//80K

Using pg_relation_size to obtain the size of the related relation.

  • Deleting an index: DROP INDEX IF EXISTS index_name.

To ensure the integrity of the index, PostgreSQL locks the entire table when creating an index and only releases the lock after the index is created. Therefore, for frequently queried or modified tables, it may cause service interruption. For data with a considerable number of records, creating an index may take several minutes or even tens of minutes.

Maintaining Indexes

https://wiki.postgresql.org/wiki/Index_Maintenance

First, delete some data from the database. By default, when data is deleted from a table, it is not actually released from the hard disk but marked as "deleted". For indexes, deleting data does not reduce the size of the index.

DELETE FROM subscribers WHERE id >1350;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K

At this point, we can recreate the index using REINDEX. REINDEX INDEX index_name

REINDEX INDEX index_subscibers_on_email;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 72K

Rebuilding the index removes unused indexes, reducing the size of the index table.

From the above, we can see that for frequently updated or deleted data tables, regularly rebuilding indexes can reduce the size of the index table. If a large amount of data needs to be deleted, it is also recommended to rebuild the index to maintain the size of the index table; otherwise, it may waste disk space.

However, REINDEX also locks the table. If it is not possible to allow service maintenance or interruption, a new index can be created directly using CREATE INDEX CONCURRENTLY. After deleting the old index, the new index can be renamed. This method takes more time than REINDEX and is more cumbersome, but it ensures that the table is not locked.

Indexing Specific Data

Sometimes, we only need to frequently query specific data conditions. In this case, it is not necessary to create indexes on all data. By using the CREATE INDEX index_name ON table_name WHERE ... syntax, we can create indexes on the required data.

Index Sorting

By default, when creating an index without specifying the sort order, ASC is used. However, in some scenarios, we may prefer to use DESC for sorting, such as ranking or article publishing dates. In these cases, creating an index with DESC can be more efficient.

CREATE INDEX index_articles_on_published_date ON articles (published_date DESC);

Rebuilding Indexes

B-Tree indexes are only reused when all index pages are cleared. Therefore, deleting a value from the database does not reduce the size of the index table. With the original data set of 1400 records, deleting 50 records does not change the size of the index.

DELETE FROM subscribers WHERE id > 1350;

SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size; // 80K

Recreate the index:

REINDEX INDEX index_subscibers_on_email;
SELECT pg_relation_size('index_subscibers_on_email')/1024 || 'K' AS size;
72K

By allowing unused indexes to be removed, disk space can be freed up, avoiding unnecessary indexes occupying the entire disk. However, it should be noted that REINDEX also locks the table. If service cannot be interrupted, it is possible to consider creating a new index directly, deleting the original index, and renaming the new index.

Index Types in PostgreSQL

Note: The following content has not been benchmarked yet.

After explaining PostgreSQL indexes and their usage, let's now discuss the common index types in PostgreSQL. If no specific method is specified when creating an index, PostgreSQL uses B Tree as the default index type.

Btree

B Tree supports the following query operations:

  • <
  • <=
  • =
  • >=
  • >

GIN (Generalized Inverted Index)

GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.

GIN is used for full-text search or for indexing composite values, such as arrays. As the name suggests, it is an inverted index that uses values as index values and finds the corresponding indexes based on the values.

For example:

this is cat
this is an apple.
cat meows.
// build inverted index
"this": {(0, 0), (1,0)}
"is": {(0,1), (1,1)}
"an": {(1,2)}
"apple: {(1,3)}
"cat": {(0,2), (2,0)}
"meows": {(2,1)}

Through the index table above, if we want to search for the keyword cat, we can find the corresponding indexes from the value with the key "cat". It can be seen that "cat" appears as the second word in the first sentence and as the first word in the third sentence.

Suitable for:

  • Full-text search
  • Arrays

Using GIN in PostgreSQL:

CREATE INDEX index_on_document on sentences using gin(document);

GiST (Generalized Search Tree)

GiST is actually more of a general interface. It can be used to define our own index implementation methods (such as B-Tree, R-Tree, etc.). Generally, it is less commonly used. However, for spatial data structures, using B-Tree may not efficiently accelerate searches for containment, adjacency, and intersection. Therefore, for spatial data structures, such as in PostGIS, GiST is used to create indexes with better query performance.

Suitable for:

  • Implementing custom index interfaces
  • Spatial structures
  • Data structures that cannot be efficiently handled by B-Tree

BRIN (Block Range Indexes)

Unlike B-Tree, which indexes each row, BRIN indexes a range of values. Although the performance is still inferior to B-Tree, the size is much smaller.

If data is frequently queried based on a range, such as log files, a range of transaction orders, or billing data, these data sets usually have a considerable volume. Creating indexes using B-Tree may consume a significant amount of disk space. However, these data sets are often not frequently accessed for precise queries but are processed in batches using range queries. In this case, using BRIN can achieve good performance while saving a considerable amount of space.

Suitable for:

  • Log file analysis
  • Transaction order analysis
  • Large data sets frequently accessed using range queries

Bloom

Bloom Filter is designed to solve the space and query time issues of hash tables. The algorithm itself can quickly determine whether an element appears in a set, but false positives may occur. Therefore, in PostgreSQL, a second check is required.

CREATE INDEX bloom_index_event ON events USING bloom (startTime, endTime, name, gifts)
 WITH (length=80, startTime=2, endTime=2, name=4, gifts=2);

The length parameter represents the length of each signature and defaults to 80, with a maximum of 4096. The following parameters specify how many bits are mapped to each column name. The minimum is 2, and the maximum is 4096.

Suitable for:

  • Precise searching of multiple columns, such as SELECT * FROM events WHERE startTime=20171111 and endTime=20171231 and name=christmas and gifts="gift_special". Using a bloom filter can quickly filter out results that are not in the set.

Summary

In general, B-Tree can handle almost all usage scenarios. However, PostgreSQL provides a variety of index types, making it easier to query and index data, and providing developers with greater flexibility. This is one of the main reasons why I really like PostgreSQL.

  1. PostgreSQL offers several index algorithms for use, each with its own suitable scenarios.
  2. When using CREATE INDEX, the entire table is locked, which may have an impact when creating indexes for large amounts of data. CREATE INDEX CONCURRENTLY can be used to solve this issue, but it takes more time to create the index.
  3. If data is frequently updated or deleted, it may result in redundant indexes. It is recommended to periodically rebuild indexes to reduce disk space usage. Pay attention to the fact that REINDEX locks the table. If service interruption is not allowed, a new index can be created directly, the old index can be deleted, and the new index can be renamed to avoid table locking.
  4. CREATE UNIQUE INDEX can be used to ensure the uniqueness of values.
  5. INDEX can be created with a WHERE condition to index specific columns based on specific usage scenarios.
  6. Indexes can be created with sorting (default is ASC) to match specific usage scenarios (e.g., article publishing dates).
  7. Indexes are not a silver bullet. It can be observed that even with indexes, sequential scans are still used for small data sets. This is because the cost of random I/O is much higher than that of sequential scans. Therefore, for fixed and small data sets (e.g., counties, postal codes), creating an index may not necessarily increase query time but instead occupy unnecessary disk space.

Prev

How to design user friendly table

Next

Analysis of Array.sort

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee