In this article, we’re going to model a real-world scenario using Data Vault 2.0. If you’ve never heard of Data Vault, or don’t know why you should care, here are some quick hits:
- It’s a component of an enterprise data warehouse (EDW)
- Like third-normal form (3NF) or star/snowflake schemas, it’s a way to model your data
- It’s really well-suited to auditing and providing historical snapshots, because:
- when designed properly, it contains all of the source data, just transformed
- its structure adapts elegantly as your source data models change, without loss of data
- rows are immutable (you only do inserts, never updates or deletes)
- It is not intended to be queried directly, but its tables can be extended or transformed to another format for this purpose
So, if you are dealing with large datasets from multiple sources, and you want a scalable, flexible model that provides historical snapshots of your data across all those sources, Data Vault 2.0 is worth a look.
Our (Simplified) Real-World Scenario
I only really learn how to model data if I can see a scenario I already understand, and apply it to the model. First, we’re going to build a simple OLTP model for a web app where users can search for flights. Here is the basic use case:
- A user enters a date, a departure city/state, and an arrival city/state, and clicks “Search”
- Example: 6/1/2019, from San Diego, CA to New York, NY
- The web app returns a list of matching flights, and displays:
- the airline name (“Southwest”)
- the flight number, prefixed with the airline code (“WN 1673”)
- the departure airport code (“SAN”)
- the arrival airport code (“JFK”)
- the scheduled departure date/time, in UTC (“2019-06-01 15:05:00”)
- many databases support a date/time data type that includes time zone information, which could be a good option
- the scheduled arrival date/time, in UTC (“2019-06-01 20:08:00”)
This is a simplified use case, so let’s assume that the system only shows one-way flights, and that overnight flights will be returned as long as their scheduled departure time is on the date the user entered. Note that the system may also display flights that land at LaGuardia (LGA) or Newark, New Jersey (EWR) since they are in (or near) the departure city the user specified. With that information, we can design our OLTP model.
OLTP Data Model
Here is a naïve first pass at an OLTP data model to satisfy the requirements above. It may not be the best model for the use case (a more denormalized model would be faster to search since it would avoid joins), but it’s a good model for us to use to show how we might translate it to a Data Vault 2.0 model.
By the way, the lovely diagram above was generated by https://dbdiagram.io. It’s awesome. Check it out.
I’d like to point out a few things:
- In the Airline table, AirlineCode is a unique two-character string governed by the International Air Transport Association https://en.wikipedia.org/wiki/International_Air_Transport_Association (IATA). For example, American Airlines’ IATA code is “AA,” and Southwest Airlines’ is “WN.” We use it as a natural primary key.
- In the Airport table, AirportCode is a unique three-character string, also governed by IATA. San Diego is “SAN,” Chicago Midway is “MDW.” We use it as a natural primary key here also.
- In the Flight table, FlightNumber is a one to four-digit number, stored as a string because it might have leading zeroes. You will often see it prefixed with the airline code, like “WN 1673.”
- Flight numbers are not guaranteed unique! Some airlines reuse them when they have multiple flights between the same two cities in a single day. So, you might have three Delta Airlines flights from San Diego to Chicago, one that leaves at 8 am, one that leaves at 11 am, and one that leaves at 2 pm, and they all might have the same flight number. This will be important during our Data Vault modeling exercise.
- We should add latitude and longitude to both the Airport and Location tables, to allow us to calculate the distance between airports and cities, so “EWR (Newark, NJ)” would be returned when we search for “New York, NY” because it’s within a reasonable search radius. But it’s not super-relevant to the point I’m trying to make, so I won’t clutter the model with it.
Our OLTP model is serving us well, but we need to extend it because the business now wants to know whether planes take off and land on schedule. To do this, we will extend the Flight table to include two new columns, ActualDepartureTime and ActualArrivalTime.
But now we are starting to extend the OLTP model beyond the original use case for our users, and there are reasons to put the new columns in their own table (e.g., they are updated much later than the rest of the flight data, and updates to the table might impact search performance), and reasons to keep them in the Flight table (e.g., we can easily exclude flights in the search results that are already in the air by checking if ScheduledDepartureTIme is not null, without the need to join to a separate table).
At this point, we might consider an alternative way to store the information, such as a read replica, or a more traditional data warehouse structure. We could create a star or snowflake schema with “number of minutes late” as the key piece of data in the fact table, with dimensions for time, day, airport, airline, region, and more. But we are going to step right over Mr. Kimball, perhaps giving him a swift kick to the gut as we do, so we can embrace the added complexity (and flexibility) that is Data Vault 2.0!
Data Vault 2.0 Basics
To understand how Data Vault 2.0 (DV2) might help us meet our new requirements, let’s cover some basics.
What It is
- All of your source data preserved exactly, but transformed into new tables
- A comprehensive record of everything that happened in the database, perfect for auditing
- A flexible way to represent source data, since the model can adapt when the source tables change, without loss of data or meaning
- A transitional stage in the data pipeline that usually feeds other downstream systems
What It Isn’t
- It isn’t a replacement for other data modeling techniques, but enhances them to serve a specific purpose (flexible historical record of all source data)
- It isn’t usually queried directly, but can be extended or transformed to meet downstream needs
Key Concepts: Hubs, Links, and Satellites
Hubs, Links, and Satellites are the three primary table types used in Data Vault 2.0, each with a clear purpose, and rules governing their structure.
Hub tables contain unique “business keys.” A business key is any value that has meaning for the business (so, a natural key, not a surrogate key). Our Airport table above has a unique three-letter airport code as its primary key. That’s a perfect business key because it is unique and has inherent meaning to the business.
For our Flight table, the unique business key is a combination of the AirlineCode, the FlightNumber, and the ScheduledDepartureTime. Since flight numbers can be shared by different flights (even for the same airline), we need all three pieces of data to uniquely identify a flight with information that makes sense to the business.
All Hub tables (and in fact all Link and Satellite tables too) contain two important columns:
- LoadDateTime, which stores the date and time (timestamp) the record was initially loaded
- This might be stored in UTC time, or if the database supports it, in a data type that stores date, time, and time zone offset.
- RecordSource, which tells us where the data came from
- Part of the beauty of Data Vault 2.0 is that it can provide an audit trail of data stored in many source systems. So, a record source might be an OLTP database/schema/table (e.g., “FlightSearchDb.dbo.Airport”), a streaming system (e.g., “Kafka.<host>.<port>.Topic”), or something else entirely.
All Hub tables also have a primary key that is a hash of its business key columns. In our model, all hash keys will be suffixed with -HashKey, and are char(64) columns because we use SHA-256 as our hashing algorithm. Here’s why:
- Historically, MD5 and/or SHA-1 were used, but both have been deprecated; SHA-256 is more secure, and less likely to have collisions
- However, 64 characters is a lot for a key, MD5 is secure enough for these kinds of models, and 32 characters may be sufficient to avoid collisions; you will have to weigh these options for your model
- Hashed keys allow us to determine whether data has changed without joining to another table to do a comparison
- However, hashed keys can confuse query optimizers since they are non-sequential; there are workarounds for this, but they are not discussed in this article
- Hashed keys play nice with Hadoop
Here is our simplest hub table, for an airport:
- HubAirportHashKey is our primary key, which in this simple example is just a hash of the airport code, since airport codes are unique.
- LoadDateTime is the initial date and time when the record was loaded. We will assume the database allows us to store the time zone offset in the column. Precision of this column is generally to the whole second, but can be greater if desired.
- RecordSource tells us where the data came from.
- AirportCode is the familiar three-letter airport code, and is our “business key.” It is a unique natural key with business meaning.
And here is our slightly more complicated FlightHub table, which requires three different columns to represent the unique business key (the airline code, the flight number, and the scheduled departure time). In this case, our HubFlightHashKey is a hash of all three business key columns:
Important: to ensure data integrity and flexibility are retained, Hub tables should NEVER have foreign keys!
Link tables are pretty simple. They link two (or more) Hubs together, because they represent the relationship between two (or more) business keys. They always act as the join table in a many-to-many relationship, for flexibility.
Quick sidebar: in an OLTP data model, have you ever had to decide between using a one-to-many or a many-to-many relationship, and gotten stuck? I have… Here’s an example: I needed to store cities and counties because users could search for a city and see its county, or search for a county and see all the cities within it. Simple: a county can have many cities, so I had a City table and a County table, and a foreign key to the CountyID in the City table. Then I found out that a city can be in more than one county. The city of Dallas, Texas is in FIVE counties! So now, to support my use cases, I had to add a CityCounty join table and turn it into a many-to-many relationship. This happens a lot in relational data modeling… we avoid the join complexity of a many-to-many relationship because we don’t think we’ll need it, but then we do. Data Vault 2.0 recognizes this, and makes every relationship between two Hubs a many-to-many relationship, with the Link table as the join table. That way, support for storing and auditing multiple counties for a single city is baked in from the beginning.
Link tables always have a foreign key to each Hub table’s primary key (which are hashes). Link tables also have their own hashed primary key, which is a hash of the two hashed foreign keys! In Data Vault 2.0, the recommendation is for Link tables to include the business keys from each of its associated Hub tables, to simplify lookups. As mentioned before, all tables have a LoadDateTime and a RecordSource, so here’s what our Link table looks like that joins the airport and flight Hub tables:
We are linking a flight to an airport, which makes sense because it represents a meaningful business relationship (flights always take off from an airport, and land at an airport, or at least we hope they do).
So, what do we have?
- We have our primary key (LinkFlightAirportHashKey).
- We have our standard columns (LoadDateTime and RecordSource).
- We have our foreign keys to both Hubs (HubFlightHashKey and HubAirportHashKey).
- We have the unique business keys for both Hub tables duplicated in the Link table.
We could do something similar for the Airline OLTP table… create a HubAirline table, and a LinkFlightAirline table. But notice that we already have the AirlineCode in the HubFlight table, so it’s not even necessary to create a HubAirline table! We just have to make sure that all of the data in the Airline table goes somewhere. We’ll see where when we talk about Satellites. And remember that the whole point of this approach is flexibility. If we need Hub and Link tables to associate flights with airlines down the road, we can just add them!
Important: to ensure data integrity and flexibility are retained, Link tables should NEVER have their own unique business keys, or begin and end dates!
Satellite tables are the closest thing to a traditional data warehouse table in DV2, and are the most important part of a DV2 model. They capture ALL changes to descriptive data over time, and in that regard, they are very similar to Type 2 slowly-changing dimensions. Satellite tables contain all of the columns from the source table (or tables) we haven’t yet put somewhere else, and they can be associated with a Hub or a Link table. Let’s take a look at a Satellite table and its parent Hub. On the left are the source tables from the OLTP database. On the right, the DV2 Hub and Satellite:
Every column in the tables on the left is represented in the tables on the right, including the LocationId surrogate key (since it describes the relationship between Airport and Location). In this example, we’ve combined the columns from two source tables into one Satellite table, since the data in all of those columns doesn’t change very often. Sure, airport names can change (and city names can too) but the changes are infrequent, and orthogonal to the primary flight data being captured and audited. So we can safely group them together for our Satellite.
Important! Just as we would do for Kimball dimensional modeling, we want the data in our Satellite tables to have the same granularity. So it’s a good practice to group Satellite data by its rate of change. And it’s fine to have multiple Satellites connected to the same Hub, especially if the data in each Satellite changes at different rates. We’ll see that in action shortly. First, let’s dig into our Satellite table columns:
This column is called HubAirportHashKey because it is a foreign key to the HubAirport table. It is also the ONLY foreign key that should ever exist in a Satellite table, for data integrity and flexibility.
The key doesn’t change very often, but the Satellite table data might, so we need another column as part of a composite key, which brings us to…
The LoadDateTime column is perfect as a composite key with the HashKey, and it works exactly like a Type 2 slowly-changing dimension (foreign key + unique date/time). As with Hubs and Links, it contains the date and time the record was added to the Satellite table.
The Satellite table is our warehouse table, so we want a way to indicate date ranges during which a record was valid. To do this, we have an EndDateTime in every Satellite table. When a new record is added, the EndDateTime is NULL. When another new record is added after that, the previous record (the one with a NULL end time) is updated so that its EndDateTime matches the LoadDateTime of the new record exactly. This ensures there is always a valid date range associated with every record.
As with all other tables, we capture the source of the record here. Spoiler alert: in the next Satellite table we create, the source will be different!
We already talked about grouping Satellite fields by their rate of change. And we expect our Satellite table data to change over time. We can simplify our data loading process if we hash all of the descriptive columns (in this case, AirportName through CountryCode in the image above) of the incoming data and compare it to the hash of the same data in the most recent row of the Satellite table. If they are the same, we don’t need to add a record to the Satellite table, because nothing has changed! If they differ, we add a new record for the same HubAirportHashKey, but with a new LoadDateTime (for our composite key).
All of these descriptive (or non-metadata) fields are NULLABLE, because we might add columns to (or drop columns from) the source database at any time. If a column is dropped from a source table, we would still need it in the DV2 model to access the historical data. It would just be NULL in the DV2 table starting on the date it was dropped from the source table.
Fun Fact: a Satellite table can connect to a Link table, if the Satellite table is comprised of descriptive data about the association/relationship of the two Hubs that the Link joins.
With all of that, let’s take a look at our SatelliteFlight table, with its Hub. On the left is our source Flight table, and on the right, our DV2 tables:
While most of this is similar to our previous example, there are a few things to highlight:
- The SatelliteFlight table contains the FlightId from the source table; it’s a surrogate key, so it doesn’t belong in the Hub table. But we have to put ALL of our source columns somewhere, and the Satellite is the only other place for it to go. As mentioned before, retaining surrogate keys like FlightId is useful because they describe a relationship in the source tables, and an audit would be incomplete without them.
- There is an AirlineName column, but where did it come from? We pulled it in from the Airline source table, because we didn’t create a Hub or Satellite for the Airline table (we didn’t need one), and ALL source columns have to go somewhere. All flights are associated with an airline, so the SatelliteFlight table is a good fit.
- We have duplicated the ScheduledDepartureTime in the SatelliteFlight table (it is part of our unique business key in the HubFlight table). We don’t have to do this, but it makes lookups a bit easier. The purpose of a DV2 model is to serve your auditing and downstream needs, so outside of the hard and fast rules, feel free to extend it to make your life easier.
- The ActualDepartureTime and ActualArrivalTime columns are not in the SatelliteFlight table. But wait, didn’t you just say all source columns have to go somewhere? And don’t worry, they will. But they are updated out-of-band with the rest of the data in SatelliteFlight, so we are going to put them somewhere else.
- The data in SatelliteFlight is added when an airline registers a new flight with the FAA, but it shouldn’t change after that (although in rare circumstances, it could)
- The ActualDepartureTime only changes once, at the moment a flight takes off
- The ActualArrivalTime only changes once, at the moment a flight lands
BUT WAIT! What’s this? A late breaking change from the client?
Now, in addition to the actual departure and arrival times, they want to capture the estimated departure and arrival times too? These kinds of changes are common, so let’s flesh out the scenario and see how we might handle it.
The client says they just bought some fancy new technology that receives streaming updates directly from the plane every minute. It updates the estimated departure time while the plane is still on the ground, and the estimated arrival time while the plane is in the air. Changes in weather or other unexpected events might significantly affect the estimated departure or arrival time, so we might see a scenario like this (assuming our scheduled arrival time is 5:00 PM):
Over the course of the flight, we estimated that we might have been 12 minutes early, or 16 minutes late, but ultimately we got there right on time. The client really wants to track this information over time to see if they can spot helpful trends. We’ll be receiving data from a new source (the data stream from the plane), so we will need to capture that in the RecordSource column.
There are a few different ways to address this new requirement. We could store the estimated departure and arrival times in the SatelliteFlight table, but that would go against our rule of keeping columns together that have the same rate of change (SatelliteFlight columns seldom change; estimates may change frequently during a flight). We could create separate Satellite tables for departure estimates and arrival estimates, but that feels like unnecessary complexity for how the columns change. So here is a possible solution:
In addition to storing the estimated times, we store the actual times in separate columns. This is because the actual departure and arrival times are valuable pieces of data on their own. By having the ActualDepartureTime be NULL until the flight takes off, and the ActualArrivalTime be NULL until the flight lands, we have more flexibility in how we can query the data.
Another option we have is to add numeric columns for DepartureMinutesLate and ArrivalMinutesLate, which would be calculated by subtracting the ScheduledDepartureTime from the ActualDepartureTime, and the ScheduledArrivalTime from the ActualArrivalTime. The columns would store “zero” if the flight was on time, a positive integer if the flight was late, and a negative integer if the flight was early. There’s nothing wrong with adding those to the SatelliteFlight table and sending them downstream, but it’s also fine to leave that calculation for later in the data pipeline.
We’ve also added a ReasonForNewEstimate column, to capture why an estimated departure or arrival time changed. We are assuming this will be provided by the data stream from the plane, which is a pretty bold assumption.
So that’s our Data Vault 2.0 model. It’s one way to organize the data. There’s probably a better way. But we’ve at least thought through some of the implications, and we’ve followed the overall DV2 modeling guidelines.
Summary of Rules
We’ve covered a lot, so here is a quick summary of important rules to remember when modeling with Data Vault 2.0:
- Every column in the source database should be represented somewhere in the Data Vault 2.0 model
- Every table will have LoadDateTime and RecordSource; Satellite tables also have an EndDateTime, to capture the range during which a record was valid
- Besides the HashKey, LoadDateTime, and RecordSource, Hubs only have natural key(s)
- Hubs never have foreign keys!
- Links only exist to connect Hubs, and should always act like the join table in a many-to-many relationship
- Links should never have natural keys or date ranges
- Satellites should be separated based on the rate of change of their column data; keep columns that change infrequently in one Satellite, and columns that change often in another
- Satellites should never have foreign keys (except to the Hub parent table), and they should only have one parent Hub!
This article just scratches the surface of what Data Vault 2.0 can do. We didn’t cover more advanced topics like Point-in-Time tables and Bridge tables, but the resources below are a good place to learn more. Thanks for reading!