Perl ETL: Surrogate Key Lookup Caching

Here’s a tip I picked up from a co-worker (thanks Matt H.) a few years back during a big ETL rewrite he and I were doing for the company I worked for. First off, I don’t recommend doing ETL in Perl. Perl is interpreted, which makes it a little slower, and it also can become a maintenance nightmare. I love Perl but coding ETL in Perl is exhausting. I’d recommend using something like DataStage, Informatica, or Data Integrator instead. Those programs are expensive though, and if your company is on a tight budget they might not spring for any of those tools. Perl is free and that really is one of the cheapest ways to go about this. We’ll proceed under the premise that you’re left with no other options.

This code snippet can be used for a fact table load job. I’m presuming a working knowledge of Perl, ETL, SQL, and dimensional modeled (star schema) databases here. This also presumes you have knowledge of using Perl’s DBI to structure queries against a database. If not, then you’ll probably want to get those basics first.

So here’s the set up; you have some kind of source (most likely either a flat file or a database staging table) that contains rows of data that need to be inserted or updated in your data warehouse’s fact table. These rows contain one or more alternate keys from the source OLTP system, along with one or more facts for the target fact table. Dimension table loads have already been completed so what we need to do now is lookup the surrogate key for each row’s alternate key before we attempt an insert or update against the fact table.

At its basic level what we’ll do in this situation is go to the dimension table, lookup our surrogate key in the warehouse dimension table using the source record’s alternate key, and then use that surrogate key when we attempt an insert or update. This works, but all those database lookups can be slow. For instance, there might be some network latency between Perl and your database or you might be looking up against a “monster dimension” that’s not properly indexed. Maybe you have fifteen or twenty surrogate keys to look up for your fact table record or maybe your database server is just plain slow. Regardless of the reason you might find yourself waiting longer than you’d like while Perl looks up all those surrogate keys each and every time for every single record you need to load into your target fact table.

The trick here is to avoid this lookup, whenever possible, by using a local, in memory cache. The approach is to use a hash array to hold your surrogate key once you’ve looked it up so you don’t have to hit the database for it again. This won’t necessarily be a time saver when you have a lot of distinct dimensions, but it can save time when looking up against smaller dimension tables where the same key is used for many fact records. For example, let’s say you have a large number of fact records that require a product code surrogate key lookup. Since there are usually a smaller, finite number of product keys in a dimension table then we could save time by caching items we’ve already looked up since we can expect to do a lot of duplicate lookups. The same argument could be made for dates, locations, or times.

The approach is simple; first we check our cache to see if the alternate key exists. If it does, we use the value stored in the hash during our fact table insert or update. We skip the database lookup. If it doesn’t, we go look it up in the database, store it in the local hash, then use the value in the hash when we do our insert or update. The hash lookup is all in memory and will normally be faster than hitting the database for the same information. As long as your hash doesn’t exceed the local machine’s physical memory you can do this for most or all of your surrogate key lookups. How many you use it for is up to you and will be based on the type of data in your dimension table.

Of course you’re using the “strict” pragma.

Code:
use strict;

First we’ll want to instantiate an empty hash array to serve as our surrogate key cache.

Code:
my $surrogateKeyCache;

Then, during our loop through the source records (either from reading a flat file or from a staging table) we’ll want to perform our lookup in the hash before we try the database. I’m wrapping this in an if statement so that if the value isn’t there we can go lookup the surrogate key from the database. If we do find it then we just skip all the logic used to lookup up the key in the database and go directly to the insert or update.

Code:
if ( ! $surrogateKeyCache{$alternateKey} ) {

(The $alternateKey scalar has already been set during our loop through the source record.) If we can’t find the alternate key in the hash we’ll then go lookup our surrogate key in the database. In this example we’ve already created a SQL statement handle to perform the lookup, called $getSurrogateKey_sth.

Code:
$getSurrogateKey_sth->execute($alternateKey);

We’ll run a fetchrow_array against the statement handle to get our key back. This presumes only one key was returned (which a good lookup query should do anyway). You can probably do this with a fetchrow as well.

Code:
my $surrogateKey = $getSurrogateKey_sth->fetchrow_array();

If we don’t find it then we need to perform some kind of error handling. This is up to you, the ETL developer, to decide. You could, for instance, set the surrogate key’s value to a “no dimension” dimension record, log the error, etc.

Code:
if ( ! $surrogateKey ) {
    # Error handling here
    $surrogateKey = -1;
    print LOGFILE "Unable to retrieve surrogate key\\n";
}#end if

Now, presuming we found the key, we’ll populate the hash with the key/value pair from the lookup and close up our if statement.

Code:
    $surrogateKeyCache{$alternateKey} = $surrogateKey;
}#end if

Once we’ve dropped out of the if statement we’ve either found the key in the hash or looked it up in the database (or set it during error handling). We’ll then execute the insert or update statement to load the record into the target fact table.

Code:
$insert_sth->execute($surrogateKeyCache{$alternateKey},$qtyFact,$costFact);

Perl’s hash arrays are a great way to accomplish the caching required for this approach. They live entirely in memory (so lookups are fast) and are comprised of a key/value pair. This is perfect for caching alternate key/surrogate key pairs from your warehouse. If you have to code your ETL in Perl then hopefully this might increase the efficiency of your code and give you a little bump in performance.


2 Responses to “Perl ETL: Surrogate Key Lookup Caching”

  1. smita Says:

    Which type of value is stored by the surrogate key in a data warehouse?

  2. Brian Says:

    If you mean which data type is used to store surrogate keys then most often we see integers used. Sometimes in the case of the date dimension we’ll see the actual date itself being used (Kimball teaching is leaning this way now). I’ve also seen GUIDs used but they take up a lot more room and don’t normally bring a lot to the table. Personally I always use integers.

    As far as what the keys mean, barring the date dimension they’re only unique values within the table. There is no meaning assigned; they’re just unique keys to identify the record.

Leave a Reply