When populating dimension table records, I use Integer Identity columns to generate meaningless surrogate primary keys. I usually also create default “catch-all” records for “No Value Specified” or “Invalid Value Specified”
In other words let’s say my source data for a Gender Dimension looks like this:
Male |
Female |
<null> |
Male |
3 |
I would want to have a dimension table that looks like this:
-1 |
Invalid |
0 |
No Gender |
1 |
Male |
2 |
Female |
The way I usually accomplish this with SSIS is with a Execute SQL Task immediately before the Data Flow task:
The Execute SQL Task will have code like this:
SET IDENTITY_INSERT dbo.dim_Gender ON
INSERT dbo.dim_Gender (GenderID, GenderDescription) VALUES (0, ‘No Gender’)
SET IDENTITY_INSERT dbo.dim_Gender OFF
which injects the default record.
Then we pull directly from the source column, usually with a SELECT DISTINCT, and whatever translation or cleansing code is necessary in the Data Flow task.
An interesting thing was happening in a recent test: The results we were getting were:
0 |
No Gender |
2 |
Male |
3 |
Female |
In this case we were only inserting the one default record; we were not inserting a default “Invalid” dimension record, so I would have expected to see:
0 |
No Gender |
1 |
Male |
2 |
Female |
Two points:
1. These are meaningless surrogate keys. We never refer to these values anywhere, so technically it shouldn’t matter what the actual Primary Key values are
2. There are other ways to accomplish this type of operation, for example with a Derived Column or with a Union All SSIS DataFlow object
Nevertheless I decided I wanted to make the dimension record key values populate as I expected them to. The issue is with how SQL Server manages Identity values.
There are some good blog entries that cover this in greater detail here and here:
This is the code I ultimately wound up using was:
SET IDENTITY_INSERT dbo.dim_Gender ON
INSERT dbo.dim_Gender (GenderID, GenderDescription) VALUES (0, ‘No Gender’)
SET IDENTITY_INSERT dbo.dim_Gender OFF
GO
DECLARE @nextident INT
SELECT @nextident=MAX(IDENTITYCOL) FROM dim_Gender
DBCC CHECKIDENT (dim_Gender, RESEED, @nextident )
This last 3 lines force SQL to reseed the Identity Value. This code makes several assumptions:
1. That I have exclusive use of the table – there are not other people inserting/deleting simultaneously as I am running
2. That my incremental value is 1, and we are incrementing, not decrementing (Did you know that you can decrement an Identity column by specifying a negative incremental value?)
This accomplishes what I want, but strikes me as a bit clunky. Have you done this or even better, have you come up with a cleaner approach?