PostgreSQL: About Internals & Tuning

I attended Karen Jex’s talk on databases at DjangoCon in Edinburgh 2023. That sparked my curiousity. I find databases fascinating—they don’t change as fast and as much as other software, such as frameworks, yet they’re used in nearly every system we rely on.

I decided to spend this rainy weekend in Prague exploring the internals of PostgreSQL and writing down some notes from my learnings.

How Is Data Stored?

As someone who regularly works with databases, I know data is stored in tables, but how exactly are these tables stored?

  • Tables: Stored as heap files on disk.
  • Heap Files: Consist of data pages (also called blocks), each typically 8 KB in size.
  • Data Pages: The basic unit of storage in PostgreSQL, containing tuples (rows) and header information.
  • Tuples: Each tuple consists of a header (metadata) and the actual data (column values). Tuples are not stored in any particular order.

Structure of a Data Page

A data page has the following components:

  1. Page Header: Contains metadata about the page.
  2. Line Pointer (Item Identifier) Array: An array of pointers to the tuples stored within the page.
  3. Tuple Data Area: The actual storage area where tuples (rows) are stored.
  4. Free Space: Unused space available for new tuples.
Page architecture
Page architecture. Source: interdb.jp/pg


Tuple architecture
Tuple architecture. Source: interdb.jp/pg

Line Pointer Array

Now that we’ve introduced data pages, we can explain the line pointer array:

  • An array of pointers located immediately after the page header in each data page.
  • Each entry in this array is called a “line pointer.”
  • Each line pointer corresponds to a tuple (row) stored within the data range.
  • It points to the offset within the page where a tuple starts.

What Happens When a New Row is Inserted?

  1. PostgreSQL places the new row into a data page.
  2. The tuple for this row has a header (metadata) and the data itself (column values).
  3. The line pointer array contains a pointer to this tuple’s location within the page.
Tuple insertion
Tuple insertion. Source: interdb.jp/pg

What Happens When a Row is Deleted?

  1. The row is marked as logically deleted (dead) by updating its metadata.
  2. Its visibility is changed so future transactions cannot see it.
  3. The line pointer still points to the row’s location, but it is marked as ‘dead.’
  4. The data remains on the page until a cleanup process (VACUUM) occurs.

What Happens When a Row is Updated?

  1. A new version of the tuple is created and may be placed in a different part of the page or even a different page.
  2. The line pointer array is updated to point to the new location of the tuple.

Internals

With an understanding of how data is stored, let’s check out other key concepts of PostgreSQL internals.

PostgreSQL follows a client-server architecture where each client connection is handled by a separate backend process. The database consists of various subsystems responsible for processing queries, managing memory, storing data, and many more.

Process architecture
Process architecture. Source: interdb.jp/pg


Memory Architecture

PostgreSQL’s memory architecture consists of:

  • Local Memory: Allocated by each backend process for its own use.
  • Shared Memory: Used by all processes.
Memory architecture
Memory architecture. Source: interdb.jp/pg


Indexes

Indexes improve data retrieval speed by providing quick access paths to data. The most commonly used underlying structure is a B-Tree.

Index Lookup Process:

  1. Search Index: Find matching keys.
  2. Retrieve CTIDs: Get the physical locations.
  3. Access Data Pages: Use CTIDs to fetch tuples directly.

Without an index, PostgreSQL performs a sequential scan, checking each row in every page.

Difference Between an Index and a Line Pointer

  • Index is an external structure to the table while a line pointer is internal to each data page.
  • Index provides a mapping from column values to CTIDs, while line pointer points to the physical offset of tuples within the page

Query Path

Let’s see how a query is processed:

  1. Connection Established: A process is created per user.
  2. Query Sent: The client sends a plain text query to the backend.
  3. Parsing: The parser checks the query syntax and creates a query tree. It performs semantic analysis to understand the tables, functions, and operators referenced.
  4. Rewrite System: Applies rules to the query tree.
  5. Planning: The planner generates a query plan, choosing the optimal execution path.
  6. Execution: The executor steps through the plan tree and retrieves rows in the way represented by the plan.

VACUUM

Maintains database health by:

  • Removing dead tuples.
  • Freezing transaction IDs.

Two types of VACUUM:

  1. Concurrent VACUUM: Removes dead tuples while allowing other transactions to read the table.
  2. Full VACUUM: Removes dead tuples and defragments live tuples in the entire file, blocking access to the table during the process.

Write-Ahead Logging (WAL)

  • Ensures no data is lost even in case of system failure.
  • WAL records (XLOG records) are written to in-memory buffers and then immediately to a WAL segment file upon transaction commit or abort.
  • PostgreSQL starts recovery from the REDO point, the location of the latest checkpoint.

How Does Replication Work?

The primary server continuously sends WAL data to standby servers, which then replay the received data.

How to Tune Postgres?

Now that you know the basics, you might be wondering if there’s anything you need to tune or if PostgreSQL just works as is. If you look at the documentation, you’ll find over 300 parameters, which can be pretty overwhelming. I came across Karen’s talk at DjangoCon, where she shared a list of the most important parameters to change from their default values. Here are my notes from her talk. If you’re interested in tuning PostgreSQL, I recommend watching the full talk!

Setting Parameters

  • Cluster level: postgresql.conf or ALTER SYSTEM SET parameter=value;
  • Database level: ALTER DATABASE db1 SET parameter=value;
  • User/Role level: ALTER ROLE bob SET parameter=value;
  • Session level: SET parameter=value;
  • Transaction level: SET LOCAL parameter=value;

Contexts

  • Internal: Set during cluster creation, can’t be changed.
  • Postmaster: Requires server restart.
  • SIGHUP: Requires config file reload.
  • Superuser-backend: Affects only new connections.
  • Backend: Affects existing connections.
  • Superuser/User: Can be set in the session.

Where to Find Information About Parameters?

  • Documentation.
  • Default postgresql.conf.
  • pg_settings view. Example query: SELECT * FROM pg_settings WHERE name=’max_connections’;.

Connection and Session Parameters

  • listen_addresses:
    • Description: Addresses allowed to connect.
    • Default: localhost
    • Suggested: * or 0.0.0.0 (all IPv4), :: (all IPv6)
  • max_connections:
    • Description: Maximum concurrent connections.
    • Default: 100
    • Suggested: No more than 500. Use connection pooling if you have over hundreds of connections.
  • idle_in_transaction_session_timeout:
    • Description: Max idle time between queries in a transaction.
    • Default: 0 (disabled)
    • Suggested: 30 minutes

Memory Parameters

  • shared_buffers:
    • Description: Memory for caching data.
    • Default: 128MB
    • Suggested: 25-40% of available memory.
  • work_mem:
    • Description: Max memory per operation before spilling to disk.
    • Default: 4MB
    • Suggested: 10MB. Adjust per session for large operations.
  • maintenance_work_mem:
    • Description: Memory for maintenance tasks like VACUUM.
    • Default: 64MB
    • Suggested: ~5% of total RAM.

Logging Parameters

  • log_min_duration_statement:
    • Description: Logs statements exceeding a duration.
    • Default: -1 (disabled)
    • Suggested: 1s or as per system needs.
  • log_line_prefix:
    • Description: Information prefixed to each log line.
    • Default: %m [%p] (timestamp and process ID)
    • Suggested: %t:%r:%u@%d:[%p]: (timestamp, host, user, database, process ID).

WAL Parameters

  • wal_buffers:
    • Description: Memory for WAL before syncing to disk.
    • Default: -1 (auto-calculated)
    • Suggested: 32MB for many concurrent connections.
  • checkpoint_timeout:
    • Description: Max time between automatic WAL checkpoints.
    • Default: 5 minutes
    • Suggested: 10-30 minutes to balance crash recovery and IO intensity.
  • max_wal_size:
    • Description: Sets WAL size that triggers a checkpoint.
    • Default: 1GB
    • Suggested: ½ to ⅔ of disk space in the WAL directory.

Query Tuning

  • effective_cache_size:
    • Description: Approximate memory available for operations.
    • Default: 4GB
    • Suggested: 50-70% of total memory.
  • random_page_cost:
    • Description: Cost estimate for fetching a disk page.
    • Default: 4
    • Suggested: 1.1 for SSDs, 2.0 for fast spinning disks.

Parameters to Avoid Changing

  • fsync:
    • Can cause database corruption if disabled.
  • synchronous_commit:
    • Risk of data loss.
  • autovacuum:
    • Essential for cleaning up dead tuples. Monitor with log_autovacuum_min_duration=0.

What More to Pay Attention to?

I recently attended another one of Karen’s talks at PyCon UK called “How to Make Your Database Happy.” In addition to the parameters from above, she mentioned the importance of backups and high availability. Here are my notes from the talk.

Backups

If your data is important and you want to be able to recover it, you need:

  • Somewhere to store it: A backup repository (preferably on a different server).
  • Backup tool: Use tools like pgBackRest or Barman.
  • Scheduling: Ensure regular backups.
  • Online backups: Implement WAL archiving to save WAL files in a separate location.
    • Enable archive mode and set up an archive command.
    • Use pgBackRest to take regular backups.
    • Backup = Copy of data + WAL.

High Availability

  • Streaming Replication: Use PostgreSQL streaming replication to push data from the primary to replicas.
  • Failover Management: Use tools like Patroni to monitor your cluster. If the primary fails, a standby can take over as the new primary.

Sources

Mia Bajić's Picture

About Mia Bajić

I’m a Prague-based software engineer passionate about knowledge sharing & community building. I’m the main organizer of Prague Python Pizza & Prague Python meetups, and a co-organizer of EuroPython & PyCon CZ.