How to Choose Between Star and Snowflake Schemas in PostgreSQL
When designing an analytics database in PostgreSQL, choosing the right schema can significantly impact query performance, storage efficiency, and data loading processes. Two common dimensional modeling patterns for Online Analytical Processing (OLAP) workloads are the star schema and the snowflake schema. This tutorial will guide you through implementing both, evaluating their performance, and making an informed decision for your data analytics needs.
Step 1: Understanding the Star Schema
A star schema features a central "fact" table surrounded by "dimension" tables, resembling a star. The key principle is denormalization: all attributes related to a specific dimension are stored in a single, flat table.
Core Components
- Fact Table: Stores measurable events (e.g., sales, page views) with foreign keys linking to dimensions and numerical measures.
- Dimension Tables: Provide descriptive context (e.g., product attributes, customer details). In a star schema, these are denormalized; all attributes for a concept are in one table (e.g.,
product_name,category_name,subcategory_namein a singledim_producttable).
Star Schema DDL Example
For a retail sales scenario:
CREATE TABLE dim_date ( date_key INT PRIMARY KEY, full_date DATE NOT NULL, day_of_week SMALLINT NOT NULL, month_name VARCHAR(20) NOT NULL, quarter SMALLINT NOT NULL, year SMALLINT NOT NULL);CREATE TABLE dim_product ( product_key INT PRIMARY KEY, product_id VARCHAR(50) NOT NULL, product_name VARCHAR(100) NOT NULL, category_name VARCHAR(50) NOT NULL, subcategory_name VARCHAR(50) NOT NULL);CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, customer_id VARCHAR(50) NOT NULL, customer_name VARCHAR(100) NOT NULL, city VARCHAR(50) NOT NULL, state VARCHAR(50) NOT NULL, country VARCHAR(50) NOT NULL);CREATE TABLE fact_sales ( sale_key BIGSERIAL PRIMARY KEY, date_key INT NOT NULL REFERENCES dim_date(date_key), product_key INT NOT NULL REFERENCES dim_product(product_key), customer_key INT NOT NULL REFERENCES dim_customer(customer_key), quantity INT NOT NULL, price NUMERIC(10, 2) NOT NULL, total_amount NUMERIC(10, 2) NOT NULL);CREATE INDEX idx_fact_sales_date_key ON fact_sales (date_key);CREATE INDEX idx_fact_sales_product_key ON fact_sales (product_key);CREATE INDEX idx_fact_sales_customer_key ON fact_sales (customer_key);Aggregating sales by product category would join fact_sales to only dim_product.
Step 2: Understanding the Snowflake Schema
The snowflake schema extends the star schema by normalizing dimension tables into multiple related tables, creating a branching "snowflake" pattern.
Core Components
- Fact Table: Similar to star schema, storing measurable events and foreign keys to primary dimension tables.
- Normalized Dimension Tables: Dimension tables are normalized. For instance,
dim_productmight link todim_subcategory, which then links todim_category. This reduces data redundancy and improves referential integrity.
Snowflake Schema DDL Example
Using the same retail sales scenario:
CREATE TABLE dim_date ( date_key INT PRIMARY KEY, full_date DATE NOT NULL, day_of_week SMALLINT NOT NULL, month_name VARCHAR(20) NOT NULL, quarter SMALLINT NOT NULL, year SMALLINT NOT NULL);CREATE TABLE dim_category ( category_key INT PRIMARY KEY, category_name VARCHAR(50) NOT NULL UNIQUE);CREATE TABLE dim_subcategory ( subcategory_key INT PRIMARY KEY, category_key INT NOT NULL REFERENCES dim_category(category_key), subcategory_name VARCHAR(50) NOT NULL UNIQUE);CREATE TABLE dim_product_snowflake ( product_key INT PRIMARY KEY, subcategory_key INT NOT NULL REFERENCES dim_subcategory(subcategory_key), product_id VARCHAR(50) NOT NULL, product_name VARCHAR(100) NOT NULL);CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, customer_id VARCHAR(50) NOT NULL, customer_name VARCHAR(100) NOT NULL, city VARCHAR(50) NOT NULL, state VARCHAR(50) NOT NULL, country VARCHAR(50) NOT NULL);CREATE TABLE fact_sales_snowflake ( sale_key BIGSERIAL PRIMARY KEY, date_key INT NOT NULL REFERENCES dim_date(date_key), product_key INT NOT NULL REFERENCES dim_product_snowflake(product_key), customer_key INT NOT NULL REFERENCES dim_customer(customer_key), quantity INT NOT NULL, price NUMERIC(10, 2) NOT NULL, total_amount NUMERIC(10, 2) NOT NULL);CREATE INDEX idx_fact_sales_snowflake_date_key ON fact_sales_snowflake (date_key);CREATE INDEX idx_fact_sales_snowflake_product_key ON fact_sales_snowflake (product_key);CREATE INDEX idx_fact_sales_snowflake_customer_key ON fact_sales_snowflake (customer_key);Aggregating sales by product category now requires joining fact_sales_snowflake to dim_product_snowflake, then to dim_subcategory, and finally to dim_category.
Step 3: Comparing Query Performance
The main performance difference lies in the number of joins. More joins generally mean more work for PostgreSQL's query planner.
Query Example
Let's compare total sales by product category for a specific year:
Star Schema Query:
EXPLAIN ANALYZESELECT dp.category_name, SUM(fs.total_amount) AS total_salesFROM fact_sales fsJOIN dim_product dp ON fs.product_key = dp.product_keyJOIN dim_date dd ON fs.date_key = dd.date_keyWHERE dd.year = 2023GROUP BY dp.category_name ORDER BY total_sales DESC;Snowflake Schema Query:
EXPLAIN ANALYZESELECT dc.category_name, SUM(fss.total_amount) AS total_salesFROM fact_sales_snowflake fssJOIN dim_product_snowflake dps ON fss.product_key = dps.product_keyJOIN dim_subcategory dsc ON dps.subcategory_key = dsc.subcategory_keyJOIN dim_category dc ON dsc.category_key = dc.category_keyJOIN dim_date dd ON fss.date_key = dd.date_keyWHERE dd.year = 2023GROUP BY dc.category_name ORDER BY total_sales DESC;Analyzing EXPLAIN ANALYZE Output
Running these queries with EXPLAIN ANALYZE reveals the query plan and execution times.
- Star Schema: Typically fewer hash join operations. Hash joins are efficient but consume memory (
work_mem). Fewer joins generally mean less memory and faster completion, especially with large fact tables. - Snowflake Schema: Requires additional joins to traverse the normalized hierarchy. Each join adds overhead. If intermediate results exceed
work_mem, performance can degrade significantly due to disk spills. The performance gap often widens with larger fact tables and lowerwork_mem.
Step 4: Optimizing for Analytics Workloads
Optimizing PostgreSQL for OLAP workloads involves specific configurations and indexing strategies.
Indexing Strategies
- B-tree Indexes: Standard for primary and foreign keys, excellent for equality and range lookups, and joining tables.
- BRIN Indexes: Ideal for large, append-only fact tables with sequentially ordered data (e.g., by date). BRIN indexes are smaller and faster to maintain than B-trees for such data, perfect for time-series filtering. Example:
CREATE INDEX idx_fact_sales_date_brin ON fact_sales USING BRIN (date_key);
PostgreSQL Configuration
Tune these parameters for analytics:
work_mem: Sets the maximum memory for query operations (sorts, hash tables) before spilling to disk. Increase this to prevent disk spills for complex multi-join queries.max_parallel_workers_per_gather: Controls parallel workers for a singleSELECTquery. Increasing this leverages multiple CPU cores for large scans and aggregations.
These are typically configured at the cluster level, often via a control panel or API in managed environments. Refer to official PostgreSQL documentation for tuning details. PostgreSQL Runtime Configuration.
Conclusion
Choosing between a star and snowflake schema for your PostgreSQL analytics database involves balancing query speed and ETL simplicity (star) against storage efficiency and referential integrity (snowflake).
Consider these factors:
- Fact Table Size: Star schema benefits are more pronounced for very large fact tables due to fewer joins.
- Query Patterns: Star schemas excel at simple aggregations; snowflake schemas may be more natural for deep hierarchical analysis but can be slower.
- ETL Complexity: Star dimensions can be simpler to load, but updates require careful handling. Snowflake dimensions are easier to maintain for updates.
- Storage Constraints: Snowflake schemas typically have a smaller storage footprint.
Start with a star schema if query speed and simpler ETL are primary. Opt for snowflake if storage and referential integrity are higher priorities. Understanding these patterns and optimizing PostgreSQL will help you build a robust analytics database. To build robust online stores that generate the kind of sales data you can analyze with these schemas, explore Yammbo Store at https://store.yammbo.com.