April 21, 20266

pg_catalog vs pg_dump for schema snapshots

Why snapshot ordering matters for schema drift detection, and why querying PostgreSQL metadata directly is often the more reliable approach.

Maxwell KimaiyoArcnull

If you have ever tried to detect PostgreSQL schema drift by diffing two pg_dump outputs, you have probably run into the same frustrating problem: the diff says the schema changed, but nothing actually did.

A column seems to have moved. A constraint looks deleted and re-added. An index appears different for no real reason. The output changes, even though the schema is logically the same.

That is not really a bug in pg_dump. It is a side effect of what pg_dump was built to do.

pg_dump is great for backup and restore. It is not great for deterministic comparison.

If your goal is schema drift detection, querying pg_catalog directly is usually the more reliable approach. With explicit ordering, you can produce stable snapshots that are much easier to diff, hash, and compare over time.

Why `pg_dump` creates noisy diffs

The key issue is that pg_dump is designed to recreate a database, not to produce stable text output for comparison.

That distinction matters.

When pg_dump generates schema-only SQL, its job is to emit valid DDL in an order that works for restore. It does not promise that objects will always appear in the same order across runs. In simple databases, you might get identical output twice in a row. But as schemas grow more complex, that becomes less reliable.

Take a small example:

CREATE TABLE orders (
    id bigserial PRIMARY KEY,
    customer_id bigint NOT NULL,
    total_cents integer NOT NULL,
    status text NOT NULL DEFAULT 'pending',
    created_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX idx_orders_customer ON orders (customer_id);
CREATE INDEX idx_orders_status ON orders (status);

Now run:

pg_dump --schema-only mydb > dump1.sql
pg_dump --schema-only mydb > dump2.sql
diff dump1.sql dump2.sql

Sometimes you will get no diff. Sometimes you will. The more tables, foreign keys, indexes, and constraints you add, the more likely it is that ordering differences start to show up.

This gets worse when you compare environments. Production has years of history behind it. Staging may have been recreated last week. Even if the logical schema is the same, the catalog layout underneath may not be. That can be enough to produce different output ordering.

A common failure mode

Imagine a table with multiple check constraints:

CREATE TABLE payments (
    id bigserial PRIMARY KEY,
    amount_cents integer NOT NULL,
    currency char(3) NOT NULL,
    status text NOT NULL,
    CONSTRAINT chk_amount CHECK (amount_cents > 0),
    CONSTRAINT chk_currency CHECK (currency IN ('USD', 'EUR', 'GBP')),
    CONSTRAINT chk_status CHECK (status IN ('pending', 'processed', 'failed'))
);

In one dump, the constraints may appear in this order: chk_amount, chk_currency, chk_status. In another, they may appear differently.

A text diff will make that look like change, even though the schema has not actually changed at all.

Multiply that across hundreds of tables and constraints, and you end up with pages of noise. The real drift signal gets buried inside false positives.

Why `pg_catalog` works better

pg_catalog is PostgreSQL's system catalog. It stores metadata about tables, columns, constraints, indexes, functions, types, and more.

The advantage is simple: you can query it directly and apply your own ORDER BY.

That gives you deterministic output.

If the schema has not changed, the query result will come back in the same order every time. That makes it much better for drift detection than comparing raw pg_dump output.

Core catalog tables worth querying

For schema snapshots, these are the most useful catalog tables:

Catalog table	What it contains
`pg_namespace`	Schemas
`pg_class`	Tables, views, indexes, sequences
`pg_attribute`	Columns
`pg_constraint`	Primary keys, foreign keys, unique and check constraints
`pg_index`	Index metadata
`pg_proc`	Functions and procedures
`pg_trigger`	Triggers
`pg_extension`	Installed extensions
`pg_type`	Custom types and enums

Deterministic column snapshot

A column snapshot query looks like this:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    a.attname AS column_name,
    a.attnum AS ordinal_position,
    pg_catalog.format_type(a.atttypid, a.atttypmod) AS data_type,
    a.attnotnull AS is_not_null,
    pg_catalog.pg_get_expr(d.adbin, d.adrelid) AS column_default
FROM pg_catalog.pg_attribute a
JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
LEFT JOIN pg_catalog.pg_attrdef d
    ON a.attrelid = d.adrelid AND a.attnum = d.adnum
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
  AND c.relkind IN ('r', 'p')
  AND a.attnum > 0
  AND NOT a.attisdropped
ORDER BY n.nspname, c.relname, a.attnum;

The important part is not just what you query, but how you order it. Once the ordering is explicit, the output becomes stable enough for reliable comparison.

Constraints and indexes follow the same pattern

Constraints:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    con.conname AS constraint_name,
    con.contype AS constraint_type,
    pg_catalog.pg_get_constraintdef(con.oid, true) AS definition
FROM pg_catalog.pg_constraint con
JOIN pg_catalog.pg_class c ON con.conrelid = c.oid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY n.nspname, c.relname, con.conname;

Indexes:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    i.relname AS index_name,
    pg_catalog.pg_get_indexdef(ix.indexrelid) AS index_definition
FROM pg_catalog.pg_index ix
JOIN pg_catalog.pg_class c ON ix.indrelid = c.oid
JOIN pg_catalog.pg_class i ON ix.indexrelid = i.oid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY n.nspname, c.relname, i.relname;

A fingerprint is even better than a raw diff

Once your snapshot output is deterministic, you can go one step further and compute a schema fingerprint.

The idea is straightforward:

Capture the ordered metadata
Convert it into a canonical string
Hash it with SHA-256

If the fingerprint is unchanged, the schema is unchanged. If it differs, you know something moved and you can run a deeper diff.

That approach is much more efficient for continuous monitoring. Most of the time you only need to compare a short hash instead of full schema text.

You can compute the fingerprint directly in SQL:

SELECT encode(
    sha256(
        string_agg(
            row_to_text,
            E'\n' ORDER BY row_to_text
        )::bytea
    ),
    'hex'
) AS schema_fingerprint
FROM (
    SELECT format('%s.%s.%s.%s.%s.%s',
        n.nspname, c.relname, a.attname, a.attnum,
        pg_catalog.format_type(a.atttypid, a.atttypmod),
        a.attnotnull
    ) AS row_to_text
    FROM pg_catalog.pg_attribute a
    JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
    JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
    WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
      AND c.relkind IN ('r', 'p')
      AND a.attnum > 0
      AND NOT a.attisdropped
) sub;

Performance is another advantage

This approach is not just cleaner. It is often faster.

On a database with 350 tables, 1,200 columns, 400 constraints, and 500 indexes:

Approach	Time	Output size	Deterministic
`pg_dump --schema-only`	1.8s	245 KB	No
`pg_catalog` queries	0.3s	82 KB	Yes
`pg_catalog` + SHA-256	0.4s	64 bytes	Yes

pg_dump has to resolve dependencies, order DDL for restore, and format everything as valid SQL. Catalog queries skip that overhead and pull only the metadata you actually need.

The fingerprint comparison is the key insight for continuous monitoring at scale. You are comparing a 64-character string, not 245KB of schema text. If it matches, you are done in milliseconds. If it differs, you run the full queries to find what changed.

A practical Java approach

At Arcnull, this pattern is implemented in Java through a CatalogService that queries pg_catalog, builds canonical snapshots, and computes a fingerprint:

@Service
public class CatalogService {

    private final JdbcTemplate jdbc;

    public List<ColumnSnapshot> captureColumns(String schema) {
        return jdbc.query("""
            SELECT
                n.nspname AS schema_name,
                c.relname AS table_name,
                a.attname AS column_name,
                a.attnum AS ordinal_position,
                pg_catalog.format_type(a.atttypid, a.atttypmod) AS data_type,
                a.attnotnull AS is_not_null,
                pg_catalog.pg_get_expr(d.adbin, d.adrelid) AS column_default
            FROM pg_catalog.pg_attribute a
            JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
            JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
            LEFT JOIN pg_catalog.pg_attrdef d
                ON a.attrelid = d.adrelid AND a.attnum = d.adnum
            WHERE n.nspname = ?
              AND c.relkind IN ('r','p')
              AND a.attnum > 0
              AND NOT a.attisdropped
            ORDER BY n.nspname, c.relname, a.attnum
            """,
            (rs, rowNum) -> new ColumnSnapshot(
                rs.getString("schema_name"),
                rs.getString("table_name"),
                rs.getString("column_name"),
                rs.getInt("ordinal_position"),
                rs.getString("data_type"),
                rs.getBoolean("is_not_null"),
                rs.getString("column_default")
            ),
            schema
        );
    }

    public String computeFingerprint(String schema) {
        String combined = Stream.of(
                captureColumns(schema).stream()
                    .map(ColumnSnapshot::toCanonicalString),
                captureConstraints(schema).stream()
                    .map(ConstraintSnapshot::toCanonicalString),
                captureIndexes(schema).stream()
                    .map(IndexSnapshot::toCanonicalString)
            )
            .flatMap(Function.identity())
            .collect(Collectors.joining("\n"));

        return Hashing.sha256()
            .hashString(combined, StandardCharsets.UTF_8)
            .toString();
    }
}

The design choice that matters: each snapshot object produces a stable string representation. Once you have that, the fingerprint is simple. Capture ordered metadata, serialize consistently, hash it, compare against the previous scan, and only run a full semantic diff when the fingerprint changes.

When to use each tool

The real takeaway is not that pg_dump is bad. It solves a different problem.

Use pg_dump when you need to:

Create backups for disaster recovery
Move schemas between PostgreSQL versions
Generate SQL for manual inspection
Clone or restore databases

Use pg_catalog queries when you need to:

Detect schema drift between environments
Compute deterministic fingerprints
Build automated schema comparison pipelines
Monitor production schemas for untracked changes

These tools are complementary. Problems start when pg_dump is used for something it was never meant to do.

Final takeaway

The issue with pg_dump is not something you can fully fix with better diff tooling. The noise comes from the nature of the output itself.

If you want reliable schema snapshots, start closer to the source. Query pg_catalog, order the results explicitly, and compare deterministic output instead of restore-oriented SQL.

That gives you cleaner diffs, fewer false alarms, and a much more dependable foundation for drift detection.

If you are building this yourself, start there. If you want it production-ready with scanning, diffing, CI integration, and alerts, that is what Drift Scanner is built to handle.

Your future self debugging a 3 AM deployment failure will thank you.

More from the blog

MCP server security: why governance matters as agent tool use grows

Detecting PostgreSQL schema changes with a GitHub Action

Start with Drift Scanner