Skip to Content

early arriving facts

good day everyone,

what's the best or most common way to handle the case when the fact data arrive early than dimension data.  how do you handle for instance a sales fact row that salesmanid does not yet exists on salesman dimension table?

Thanks in advance.

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

1 Answer

  • Jan 14, 2015 at 10:50 AM

    The approach hasn't changed for over more than 10 years:

    Add comment
    10|10000 characters needed characters exceeded

    • I don't do dummy rows in dimension tables. I add extra columns to my facts table, one for each natural foreign key from the source.

      When loading the facts table, do the normal processing, i.e. lookup the surrogate key from the dimension (or use an outer join). For every dimension record that does not exist (the loookup returns null), store the natural key in the corresponding column.

      Build a 2nd dataflow that uses the facts table both as source and target. Make sure all normal row types are changed into updates (map operation, autocorrect load). Select only those records that have a not-null value in at least one of the extra columns. Do a lookup of the surrogate key for every not-null natural key.

      Run that datflow at any later point in time:

      1/. The dimension record has been added in the meantime: the lookup will return the natural key; put in its place and blank out the extra column.

      2/. The dimension record is not there yet: leave your fact record unchanged and cross your fingers, because it may arrive late and you can process it next time.

      There's almost no processing overhead, assuming the number of early arriving facts is low.

      There's almost no storage penalty when using column store tables (HANA 😉).