一、General
1. Concept
DM / Dimensional Modeling / 维度模型 |
The process and outcome of designing logical database schemas created to support OLAP and data warehousing solutions. |
Dimensional data structure |
Target of the ETL, include Fact tables, Dimension tables, Surrogate key mapping tables. |
Dimension / 维 |
Descriptive attributes, for query constraining and labeling, e.g.CCY, region, customer, date, gender. Dimension table 描述fact的数据,denormalized flat tables, seldom changed data. |
Fact / 事实 |
Business measures. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. |
Metadata 元数据 |
All the information in the data warehouse that is not the actual data itself. |
grain / granularity / hierarchy / 粒度 |
细粒度如存取记录数,粗粒度如资产、负债 |
E.g.
2. Flow
Identify reporting grain;
Identify dimensions that apply to each facttable;
Identify measures that will populate thefact tables;
二、Dimension
1. 模型
star dimension model |
1 fact to many dimensions |
snowflaked dimension model |
1 fact to many dimensions/Bridge Tables, 1 dimension/bridge to many subdimensions. (fact-dim,dim-subdim) |
parent-child |
field 1 1-to-many field 2 in the same dimension table (seldom see) |
2. Points
1. In contrast to a fact table,dimension tables are usually small and change relatively slowly (because DIM establishone-to-many relationships with facts, changes to DIM forces OLAP cube torebuilt, so I believe SCDs (especially type 1 and 3) is mainly in fact table).
2. Dimension tables are seldomkeyed to date.
3. rapidly changing / large DIMsolution: 1) split (e.g. separate rapidly changing part like demographics); 2)treat as fact (no foreign key base on it, this may be old-fashion solution)
4. Not all dimension need tocreated, as it may be too many foreign keys. It depends on query need.
5. Kimball suggest to usesurrogate keys(1,2,3…) instead of (20100113, 20100114 …) to be the key of timedimension.
6. Multidimensional model isusually stored within a relational database (multidimensional data is stored ina relational database).
3. 主要分类
1) Conformed dimension
Normal dimension, cuts across many facts.
2) Junk dimension
Combines several low-cardinality flags andattributes into a single dimension table rather than modeling them as separatedim. The attributes are not closely related.
Junk dimension is nothing but miscellaneousdata that does not fit in any base dimension hence stored in a separate table.
Characters: 1. The group of dimensions,depend on correlation. 2. Remaining when obvious dimensions have beenidentified (鸡肋).
Function: Use to reduce the number offoreign keys in a fact table.
Reason: If all of the yes/no flags arerepresented as single level hierarchy dimensions, you may end up with 30 ormore foreign keys for one fact table. Clearly, this is an overly complex design(cluttered design).
e.g. comment, yes/no, true/false. To solvethe situation like more than 30 dim attributes, combine or junk dim.
low cardinality – 低基数, means very few distinct valuesfor the column, e.g. gender. If need to build index on it, it’s better to usebitmap index.
Ideally, we’d keep the size of the junkdimension to less than 100,000 rows.
3) Role-playing dimension / dimensional roles
A dimension attached multiple times to thesame fact table.(e.g.COL_CLARK, COL_MGR map to same employee dim.) Use 1 tablemultiple views as solution.
一个维度,可以被多个Fact表引用,这个时候,是建立多个维度表,还是引用同一个维度表?
Kimball 的答案是建立一个维度表,从这个维度表,引出多个View
e.g. a "Date" dimension can be usedfor "Date of Sale", as well as "Date of Delivery", or"Date of Hire".
4) Degenerate dimension (DD,退化维)
Definition: a dimension key in the facttable that does not have its own dimension table. Want to have in fact but notmeasures.
1、退化维具有普通维的各种操作,比如:上卷,切片,切块等。
2、如果存在退化维,那么在ETL的过程将会变得容易。
3、它可以让group by等操作变得更快。
e.g. order no, ticket, credit cardtransaction, check no.
Use when a huge join as both Fact andDimension would have the same granularity, to better performance.
5) Slowly Changing Dimensions (SCDs, 缓慢维度变化)
A dimension that changes with time. 3types:
SCD 1 |
Overwrite old data with new data; (overwrite, master table, in-place update) |
SCD 2 |
tracks historical data by creating multiply records in the dimensional tables;(partitioning history, ACDH, row, history table) (e.g. effective date) Surrogate Keys required! e.g. id effect_dt active_ind id1 2006-09-29 N id1 2006-12-02 Y |
SCD 3 |
tracks changes using separate columns and its limited by a number of columns we design. (separate column, ACDM, column, rarely use) |
* can be Hybrid SCD
* Can handle SCD by SQL MERGE / ROW NUMBERetc.
* SCD 2最流行,简言之SCD 1 in-place update,SCD 2 insert row,SCD 3 add column。
4. 其它分类
Big dimension |
(e.g. commercial customer, 客户资料表) often has millions of records and a hundred or more fields in each record |
Small dimensions |
(e.g. transaction type) are often unique to each source system and thus do not need to be conformed |
multilevel hierarchies |
e.g. country (1st col.), state (2nd col.), city (3rd col.), postal code (4th col.) |
single level hierarchy |
Vise versa |
按性质可分为机构、人员、产品、时间、交易、科目、客户、合同等
5. 例子
CCY, region, customer, date, gender,status, code, category, title, date, weight, area, role
职务、单位、证号、代码、邮编、号、等级、状况、状态、类别、标志、备注、名称
三、Fact
1. 主要分类
1) Conformed fact
基本事实表.
2) Factless fact
Factless fact table – fact table that nomeasure available (only contains dimension keys). 1st type recordsan event, e.g. attendance of the student, 2nd type coverage table.
3) Early-arriving fact
Normally 1st load Dims, then 2ndload Facts. Early-arriving fact means fact arrived but dimension is not yetready.
2. 其它分类
1. Transaction grain (e.g.retail sales transaction, largest) |
2. Periodic snapshot (e.g.balance, monthly grain, Surrogate Keys required!) |
3. Accumulating snapshot (e.g.order fulfillment, definite beginning and end, large number of date foreign key), .e.g. order that created, committed, returned …; submitted date, approval date, processed date, settlement date … |
3. 例子
quantity, count, amount, percent
四、CDC
Change Data Capture改变数据捕获,以下归纳了数种方法。
1. timestamp / version / status
2. record / key-words compare
3. log scanner
4. trigger on tables
5. Tools: Attunity, Oracle golden gate, IBMInfoSphere Change Data Capture …
五、建议
Kimball suggests putting free-form textinto a separate dim rather than carrying that on every fact record.
多维合并除了Junkdimension外,还可考虑建立维度的snapshot表,将信息冗余,每隔一段时间全量刷新一次。
尽量把雪花模型转变成星型模型。