Data Governance
4 min readMay 5, 2018
Data Governance (DG in the rest of the story) is a broad term dealing with guidelines on managing the data of an enterprise. Depending on the organization, it might include some or all of the following areas.
- Data modeling
- Master data
- Reference data
- Metadata
- Taxonomy and Business Glossary
- Naming standards
- Review/Approve/Periodically audit project artifacts (like BRD, LLD, HLD, LDM, PDM etc.)
- Data Quality
- Data Lineage
- Data security
- Define ownership & boundaries (stewardship)
Here’s a brief description of each of these areas.
- Data modeling is a broad topic with respect to governance. It could mean how we organize data enterprise-wide (domain-specific vs. canonical), how we control master data, how rigid should normalization be for each app/business case, what drives broader modeling techniques (normalized vs. de-normalized vs. data vault), choice of schema-less or NoSQL databases vs SQL databases etc.
- Master data is the core data that defines the actors and objects in a business. (eg. Customer, Physician or Drugs). The events, transactions and logs that get created by the interplay of actors and objects may not qualify for master data (eg. credit transaction of $12, drug prescription or shipping approval). Master data sometimes deserve to be tightly controlled through MDM systems, sometimes through less expensive ways via incorporating into data warehouse dimensions.
- Reference data indicate types, codes etc. Reference data is usually looked up to add meaning to master data or transactional data. If there is a vendor table (which is master data), there usually is a vendor type table (reference) which is a parent table of the vendor in a normalized database. There might also be a table for state codes (AL — Alabama..) for the states in which the vendor operates. DG team has a say in defining type values, code standards, code translations etc.
- Metadata exists in different areas. Data integration statistics (ETL Metadata), error codes of an exception process, data about business entities are all metadata.
- Taxonomy is the classification of business terms under larger business areas based on industry standardized names and classifications. A business glossary is built out of this knowledge, which can be either a simple spreadsheet, a content management system or tools like IBM Business Glossary. This is an invaluable asset. Every project in an organization can refer to this shared glossary to get answers to their business questions without having to talk to a business person every now and then. This is an implementation of business metadata.
- Naming standards for entities and attributes has to be controlled by DG team. There has to be standards defined based on class words. For example, AMT for amount fields, PCT for percentage fields etc. There has to be a standard set of abbreviations. There has to be naming rules on what is abbreviated, when is it allowed, rules on casing, allowable special characters, maximum length of attribute names etc.
- Review and sign off on project artifacts that create new data elements. Periodically audit organization’s project artifacts and report any non-compliance. Project Managers will be held responsible for non-compliance.
- Data Quality rules for individual and critical data elements have to be established in coordination with business. For example, vendor type can be either ‘Time & Material’ or ‘Statement of Work’ but a vendor can never be both or never be empty type. Profile important data elements from time to time and publish the results to a DQ Dashboard (no. of nulls, percentage of one value vs. the rest of the values for a field, data type analysis, case mismatches etc.).
- Data Lineage shows graphically (when tools are used) or in a spreadsheet, where the data elements originate, what databases/applications do they pass through, whether any transformations are applied along the way, before they finally reach a target database/application. Often times in production crises, business users look at an attribute value in the application and ask “Can any one tell me where this attribute is coming from and how it is transformed?”. If there are multiple hops, data lineage document or diagram is where you should look. This should be managed by the data architect or DG team.
- Data security is a realm of shared ownership where the weakest link (eg. developers sharing database password, or a database hosted in a public subnet of a public cloud) could become the most vulnerable. Encryption, Role Based Access, Multi Factor Authentication, Principle of Least Privilege, Masking & Test Data Management etc. are examples of data security in practice.
- Stewardship is defining and documenting who owns what data in the taxonomy. Every single attribute down to the transaction tables may not have a data steward. Assigning data stewards is very helpful to get answers and assign accountability in a crisis situation or during an audit/compliance change.
None of these are a one-time activity. Because changes to enterprise happen year after year, project after project, data governance team continuously evaluates, defines and approves changes on all above pointers.