Data Modeling for Hierarchical Relationships
Hierarchical relationships are a little convoluted to model than normal relationships. This is because traversing a hierarchy programmatically could involve recursive traversals. Database designer must think of:
- the clearest logical path to express hierarchy in the data model
- make it easy and performant for programs (SQL) to traverse the hierarchy.
Where do we see hierarchies?
Think of these common examples: Organization hierarchy, Product hierarchy, Animal hierarchy. In any of these examples, the term hierarchy implies that there are sub-categories in that group (sub-category of employees, products or animals) where each sub-category is related to another sub-category. That relationship between sub-categories is what defines the hierarchy.
Why do some hierarchies involve recursion?
Employee has a Manager. Manager is an Employee and in turn reports to an Executive who is also an Employee. As you can see the sub-categorization based on employee designation doesn’t keep them from being Employees, they are all essentially Employees. This is recursive.
In ER Modeling parlance, the only way to avoid recursion here is if you duplicate Employee information into separate tables — one table for Employees who are at the bottom of hierarchy, a second table for Managers, a third for Executives etc. This is a bad design for Employee data because all employees essentially share the same attributes (name, designation, salary, address etc.)
Comprehending types of hierarchies
- Recursive 1:M
- Recursive M:M
- Non-recursive normalized (Nested)
- Non-recursive de-normalized
Hierarchies can be modeled in different ways so it is essential to understand how the sub-categorization of data is done within business.
- Recursive 1:M
Design
Employee and Manager hierarchy is a recursive 1:M relationship. One Manager may have many employees, but an Employee has only one Manager. Each Employee record has a foreign key which points their Manager that is another record on the same Employee table.
Access
To get Employee names and their Manager names:
SELECT e.Employee_Name, m.Employee_Name
FROM Employee e, Employee m
WHERE e.Manager_ID = m.Employee_ID (+)
To get Employees, their Managers and Executives (Managers’ Managers):
SELECT e.Employee_Name, m.Employee_Name, x.Employee_Name
FROM Employee e, Employee m, Employee x
WHERE e.Manager_ID = m.Employee_ID (+)
AND m.Manager_ID = x.Employee_ID (+)
Pros
- Easy to design.
- Easy to access via self-joined SQL.
- Ragged hierarchies, where one record may have just one parent and at the same time another record may have one parent plus a grand parent (and a great grand parent), can be easily represented without altering schema. The higher most parent will have a NULL for its parent. In other words hierarchy is represented through data itself, not through structure.
Cons
- If the height of hierarchy is too long, the number of self-joins increase proportionately. For a very large table this is quite an SQL. For example, if there are 8 levels from a low-level Employee to CEO and you want to find an Employee’s 7th level Manager, SQL becomes unwieldy with 7 self joins.
2. Recursive M:M
Design
We all work for 2 managers at a time once in a while. If we put this situation into a data model, we have a M:M situation. Each Employee can have one or more Managers. Managers certainly have one or more reporting Employees. The universal method to model M:M’s, adding a bridge table, applies here as well.
Note, startDate is part of Primary key of bridge table to track Supervision history. If we don’t want to track when the Employee’s assignment with a Manager started, we won’t end up adding a new record for the same Employee-Manager combination into the bridge table every time an Employee is assigned to that Manager. If that’s the case, EmployeeID+ManagerID combination will suffice for primary key.
Access
To get Employee names and their Manager names:
WITH Manager AS
(SELECT m.employeeID as EmpID, m.managerID as MrgID, e.lastName as MgrName
FROM Employees e, Supervisions m
WHERE e.employeeID = m.managerID)
SELECT e.lastName as EmployeeName, m.MgrName as ManagerName
FROM Employee e, Manager m
WHERE e.managerID = m.EmpID (+)
Pros
- Easy to design.
- If Employee and Manager played a different role in each assignment (or has other attributes specific to each assignment/supervision), those attributes can be added to the bridge table. Just that the PK of bridge table would have to be expanded to have that role (or start date like in the picture above).
- Ragged hierarchies can be easily represented.
Cons
- Same as in 1:M.
3. Non-recursive normalized (Nested)
Design
Recursive methods are called Adjacency Lists. When you have a fixed depth hierarchy you can use a nested tree structure.
Access
To get Product Groups and Product Types belonging to them:
SELECT pg.ProductGroupName, pt.ProductTypeName
FROM ProdGroup pg, ProdSubGroup psg, ProdType pt
WHERE pg.pgKey = psg.pgKey
AND psg.psgKey = pt.psgKey
Pros
- Each hierarchical entity can have different attributes. It’s flexible that way.
- JOINs could be less exhaustive than recursive joins (self joins) because records are split into smaller entities.
Cons
- Adding a new hierarchy involves adding structural changes (schema changes). In recursive models adding a new hierarchy may be achieved through simply introducing a new value to the column that defines hierarchy type. For Employee it’s probably a designation column.
- Ragged hierarchies cannot be represented without violating normalization. For example in A → B → C where A is the upper most parent, a ragged hierarchy is either impossible to represent or may require an additional relationship between A → C which breaks normalization (or breaks business rules for data quality, because the grain of C involves keys from either A or B).
4. Non-recursive de-normalized
Design
These are created by flattening out the nested structure above into a single table. That is, if we JOINed all the 3 product tables above and SELECTed all their attributes with no additional filtering conditions, the tabular result set obtained becomes the content of this de-normalized hierarchical table. This will look like a mapping table where each record has a Product Group, Product Subgroup and a Product Type.
Access
To get Product Groups, Subgroups and Types belonging to them:
SELECT ph.ProductGroupName, ph.ProductSubGroupName ph.ProductTypeName FROM ProductHierarchy ph
Pros
- For low volume hierarchies the access works like a charm. There are no JOINs. High volume hierarchies might work well with proper indexing.
Cons
- Careful maintenance and isolation is required. Screwing up the hierarchy is just one DML away.
All of these hierarchical implementations can benefit from using additional attributes to explain the hierarchy. For example, adding a level attribute will make SQL easier and lucid.