Dimensional Modeling

Palash Chaudhari
3 min readDec 29, 2023

--

Dimensional modeling, as part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball, is an approach to data warehouse design that encompasses a set of methods, techniques, and concepts. The primary goal of dimensional modeling is to structure and organize data in a way that facilitates efficient querying, reporting, and analysis within a data warehouse.

Benefits of Dimensional Modeling

  • Business User-Friendly Design: Dimensional modeling aims to create a structure that is intuitive and easily understandable for business users, allowing them to navigate and analyze data without requiring extensive technical knowledge.
  • Optimized Query Performance: The design of dimensional models is geared towards optimizing query performance, ensuring that analytical queries can be executed efficiently to support decision-making processes.
  • Iterative and Collaborative Development: The Business Dimensional Lifecycle methodology, including dimensional modeling, encourages an iterative and collaborative development process. It involves ongoing communication and feedback between technical teams and business users to refine and enhance the data warehouse design.

Tables in Dimensional Modelling:

  • Dimension Tables: These tables contain descriptive attributes that provide context to the data in fact tables. Dimension tables help in filtering, grouping, and labeling the data. Examples of dimensions include time, geography, products, and customers.
  • Fact Tables: These tables hold quantitative, numeric data representing business metrics or measures. Fact tables are typically the focus of analysis and reporting and contain foreign key references to related dimension tables.
  • Star Schema or Snowflake Schema: The organization of tables follows either a star schema, where a central fact table is connected to dimension tables, resembling the shape of a star, or a snowflake schema, where dimension tables may be further normalized into sub-dimensions.

Dimension Tables:

Slowly Changing Dimensions (SCDs) are a concept in data warehousing that refers to the way dimensional attributes change over time. There are several types of Slowly Changing Dimensions, denoted as Type 0, Type 1, Type 2, and Type 3, each handling changes to dimension data in a different manner.

  • Type 0: In Type 0 SCD, there is no tracking of historical changes. Once a dimension record is established, it remains fixed and does not change over time. This type is suitable when historical changes are irrelevant, and the dimension is considered static.
  • Type 1: Type 1 SCD involves updating the existing dimension record with new information when a change occurs. This approach overwrites the old data with the new data, and there is no preservation of historical values. It’s a simple and efficient method but doesn’t retain a history of changes.
  • Type 2: Type 2 SCD addresses the need to track historical changes. When a change occurs, a new record is added to the dimension table with a new surrogate key, preserving the historical state. This allows for analysis over time but can result in a larger dimension table.
  • Type 3: In Type 3 SCD, a dimension table includes columns for both the current and previous versions of an attribute. When a change occurs, the current attribute is updated, and the previous attribute is maintained in a separate column. This approach strikes a balance between historical tracking and table size.

Fact Tables

Facts are something that happens. Facts are not slowly changing dimension which makes them easier to model than dimension tables.

What makes Fact table design hard?

  1. Fact data is usually much bigger (10–100 times) than dimension data. For example, Facebook has 2 billion users and sends 50 billion notifications daily. The dimension table tracks 2 billion users, while the fact table shows 50 billion user interactions.
  2. Analyzing fact tables effectively requires a lot of context.
  3. Fact tables often have more duplicate entries than dimension tables.

How to work with high volume Fact data:

  1. Sampling:

Sampling in data is choosing a smaller, representative portion from a larger dataset for analysis. It helps save time and resources while providing insights into the entire dataset. Different methods exist, like random or stratified sampling, chosen based on research goals. Sampling is widely used in statistics, market research, and data analysis to draw conclusions from a subset of data.

  1. Data Bucketing: When working with a large set of data, data bucketing involves grouping continuous values into discrete intervals. This simplifies the data and can make it more manageable for analysis.
  • Organizing data based on user IDs.
  • Bucketing joins are often quicker than shuffle joins.
  • Sort-Merge bucket (SMB) can perform joins without the need for shuffling.

--

--

Palash Chaudhari

A professional Data Engineer who helps data to reach its destination.