Technical Project Documentation

Complete Writeup Collection

Project Overview

This is a collection of technical writeups covering various projects in software development, reverse engineering, and system analysis. Each writeup provides detailed methodology, tools used, and key learnings.

Documentation Structure

  • Development: Software projects and implementation details
  • Analysis: System analysis and reverse engineering studies
  • Tools: Utilities for creating new documentation

Use the sidebar navigation to explore writeups for each project. Click on any project name to view its detailed documentation.

Project 1

Documentation coming soon...

Project 2

Documentation coming soon...

Project 3

Documentation coming soon...

Analysis 1

Documentation coming soon...

Analysis 2

Documentation coming soon...

🗄️ PostgreSQL Query Performance Optimization

Identifying and Resolving Slow Query Performance

🎯 Objective

Investigate and optimize a production PostgreSQL database experiencing significant performance degradation on a critical reporting query that processes customer order data.

💡 Key Considerations

  • EXPLAIN ANALYZE provides detailed execution plans and actual runtime statistics
  • Sequential scans on large tables are often performance bottlenecks
  • Proper indexing strategy can reduce query time by orders of magnitude
  • Table statistics must be kept up-to-date for optimal query planning

📚 Background Information

PostgreSQL uses a cost-based query optimizer that evaluates multiple execution strategies and chooses the plan with the lowest estimated cost. The optimizer relies on table statistics collected by the ANALYZE command to estimate row counts and selectivity of predicates.

When a query performs poorly, the execution plan often reveals the root cause. Common issues include missing indexes causing full table scans, outdated statistics leading to poor plan choices, or inefficient join strategies. The EXPLAIN ANALYZE command executes the query and returns both the planned and actual execution details, making it invaluable for performance tuning.

Indexes are auxiliary data structures that allow the database to quickly locate rows matching specific criteria without scanning the entire table. However, indexes come with overhead - they consume disk space and slow down INSERT/UPDATE/DELETE operations. Therefore, indexing strategy requires careful consideration of query patterns and data access requirements.

📝 Problem Description

Situation: The monthly sales report query was taking over 45 seconds to complete, causing timeouts in the web application. The query joins three tables: orders (5M rows), customers (500K rows), and products (50K rows).

The query filters orders by date range, aggregates by customer region and product category, and was previously running in under 2 seconds before data volume grew significantly.

🔧 Analysis Steps

1Baseline Performance Measurement

First, we capture the current query execution time and plan:

psql -d production_db
EXPLAIN ANALYZE
SELECT 
    c.region,
    p.category,
    COUNT(o.order_id) as order_count,
    SUM(o.total_amount) as total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01' 
  AND o.order_date < '2025-01-01'
GROUP BY c.region, p.category
ORDER BY total_revenue DESC;

Initial Results:

  • Execution time: 45,234 ms (45.2 seconds)
  • Sequential scan on orders table: 5,000,000 rows scanned
  • Hash joins used for customer and product lookups
  • No indexes utilized for the WHERE clause filter

2Analyzing the Execution Plan

The EXPLAIN ANALYZE output reveals the bottleneck:

HashAggregate  (cost=524891.23..524891.89 rows=52 width=72) 
               (actual time=45231.445..45231.467 rows=156 loops=1)
  ->  Hash Join  (cost=3875.00..512391.23 rows=834000 width=40) 
                (actual time=89.234..43891.234 rows=823456 loops=1)
        Hash Cond: (o.product_id = p.product_id)
        ->  Hash Join  (cost=1250.00..483266.00 rows=834000 width=36) 
                      (actual time=45.123..41234.567 rows=823456 loops=1)
              Hash Cond: (o.customer_id = c.customer_id)
              ->  Seq Scan on orders o  (cost=0.00..465891.00 rows=834000 width=20) 
                                       (actual time=0.045..39876.123 rows=823456 loops=1)
                    Filter: ((order_date >= '2024-01-01') AND (order_date < '2025-01-01'))
                    Rows Removed by Filter: 4176544
              ->  Hash  (cost=875.00..875.00 rows=30000 width=24)
                       (actual time=44.876..44.876 rows=500000 loops=1)

Critical Discovery: The sequential scan on the orders table is reading all 5 million rows and then filtering, rather than using an index to directly access the relevant date range. This accounts for the majority of the execution time!

3Examining Current Indexes

We check what indexes already exist on the orders table:

SELECT 
    indexname,
    indexdef
FROM pg_indexes
WHERE tablename = 'orders';

Results show:

  • orders_pkey: PRIMARY KEY on order_id
  • No index on order_date
  • No index on customer_id (foreign key)
  • No index on product_id (foreign key)

4Creating Strategic Indexes

Based on the query pattern, we create targeted indexes:

Index on order_date for range queries

CREATE INDEX idx_orders_order_date 
ON orders(order_date);

Indexes on foreign keys for join optimization

CREATE INDEX idx_orders_customer_id 
ON orders(customer_id);

CREATE INDEX idx_orders_product_id 
ON orders(product_id);

Composite index for the specific query pattern

CREATE INDEX idx_orders_date_customer_product 
ON orders(order_date, customer_id, product_id);

Index Creation Time:

  • idx_orders_order_date: 12 seconds
  • idx_orders_customer_id: 15 seconds
  • idx_orders_product_id: 14 seconds
  • idx_orders_date_customer_product: 28 seconds

Note: Used CREATE INDEX CONCURRENTLY in production to avoid blocking writes

5Updating Table Statistics

After creating indexes, we update statistics for the query planner:

ANALYZE orders;
ANALYZE customers;
ANALYZE products;

6Re-testing Query Performance

Running the same query with new indexes:

EXPLAIN ANALYZE
SELECT 
    c.region,
    p.category,
    COUNT(o.order_id) as order_count,
    SUM(o.total_amount) as total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01' 
  AND o.order_date < '2025-01-01'
GROUP BY c.region, p.category
ORDER BY total_revenue DESC;

The new execution plan shows:

HashAggregate  (cost=18234.56..18234.78 rows=52 width=72) 
               (actual time=1247.123..1247.145 rows=156 loops=1)
  ->  Hash Join  (cost=3875.00..14234.56 rows=834000 width=40) 
                (actual time=67.234..989.567 rows=823456 loops=1)
        Hash Cond: (o.product_id = p.product_id)
        ->  Hash Join  (cost=1250.00..9876.23 rows=834000 width=36) 
                      (actual time=34.123..756.234 rows=823456 loops=1)
              Hash Cond: (o.customer_id = c.customer_id)
              ->  Index Scan using idx_orders_date_customer_product on orders o
                    (cost=0.43..7234.56 rows=834000 width=20) 
                    (actual time=0.098..456.789 rows=823456 loops=1)
                    Index Cond: ((order_date >= '2024-01-01') AND (order_date < '2025-01-01'))
              ->  Hash  (cost=875.00..875.00 rows=30000 width=24)
                       (actual time=33.876..33.876 rows=500000 loops=1)

Performance Improvement:

  • Before: 45,234 ms (45.2 seconds) - Sequential Scan
  • After: 1,247 ms (1.2 seconds) - Index Scan
  • Improvement: 97.2% reduction in execution time
  • Rows Scanned: Reduced from 5M to 823K (only relevant rows)

7Monitoring and Validation

After deployment, we monitored the query performance:

  1. Verified consistent sub-2-second response times over 7 days
  2. Confirmed no performance degradation on INSERT operations (< 5ms increase)
  3. Set up automated ANALYZE jobs to run weekly during off-peak hours
  4. Documented the indexing strategy for the development team
✓ Optimization Complete: Query performance improved by 36x

🎓 Key Learnings

  • EXPLAIN ANALYZE is Essential: Always use EXPLAIN ANALYZE to understand actual query execution, not just estimated plans. The difference between estimates and reality often reveals hidden issues.
  • Index Foreign Keys: Foreign key columns used in JOIN operations should almost always be indexed. PostgreSQL doesn't automatically create these indexes.
  • Date Range Queries Need Indexes: Filtering on date columns without indexes forces sequential scans. B-tree indexes on date columns enable efficient range scans.
  • Composite Indexes for Complex Queries: When queries consistently filter on multiple columns, composite indexes can eliminate the need for additional table lookups.
  • Keep Statistics Current: The query planner relies on table statistics. Regular ANALYZE operations ensure optimal plan selection, especially after bulk data changes.
  • Monitor Index Impact: While indexes speed up reads, they have overhead on writes. Always measure the impact on INSERT/UPDATE/DELETE operations in production workloads.
  • Use CONCURRENTLY in Production: CREATE INDEX CONCURRENTLY prevents table locks but takes longer. This trade-off is essential for zero-downtime deployments.

🛠️ Tools Used

  • psql - PostgreSQL command-line interface
  • EXPLAIN ANALYZE - Query execution plan analysis
  • pg_indexes - System catalog for index information
  • pg_stat_user_tables - Table statistics and usage metrics
  • pgAdmin - GUI tool for visual query plan analysis

📊 Additional Optimization Considerations

Further Optimizations Evaluated

  • Partitioning: Considered table partitioning by order_date but decided against it since current performance met requirements
  • Materialized Views: For this monthly report, a materialized view updated daily could reduce real-time query load
  • Query Caching: Application-level caching could serve repeated queries without hitting the database
  • Connection Pooling: Verified that pgBouncer was properly configured to handle concurrent report requests