A Deep Dive into Data Cataloging
A data catalog is one of the more accessible outputs of a data governance project. It is valuable to anyone in your organization who needs data and likes to spend as little time tracking it down as possible.
Remember a couple of blogs ago when we said data problems are people problems? Developing a data catalog can also be something of a diplomatic mission, resulting in a resource that will save your coworkers’ time and demonstrate some of the benefits of good data governance.
Because the concept of data governance and its related terminology is multifaceted at best (and slippery at worst), let’s pause to establish some definitions used in this post: A data catalog for our purposes is a central source of basic information about an organization’s data resources. You could also think of it as an inventory or a directory. By “resources,” we mean whatever assets containing data created or collected by an organization that people need to do their jobs. While the focus of data catalogs is often individual data sets formatted as tables, you can (and, we would argue, should) catalog other resources created with that data such as dashboards, presentations, and written documents, which we will collectively call “reports.”
At Inciter, data cataloging is central to how we create an Impact Blueprint because we need to understand what data is important and how it’s being used across the organization.
We approach cataloging as an iterative process of collecting examples of reports and data, consulting the people working with the data and developing the reports, documenting what we’ve heard from data stewards and observed from the examples, and validating that we’ve accurately represented what we learned. The difference between the cataloging we do for an Impact Blueprint and the cataloging an organization would do as part of a data governance project is that the former is a snapshot. A data catalog maintained by an organization should be a living document with (and here’s where the governance part comes in) defined roles, accountability structures, and processes for maintaining it.
To develop a catalog you first need to determine what resources need cataloging. How you identify the resources your organization creates and/or uses will vary depending on the culture and your familiarity with and level of access to systems and data. If you feel confident that you generally know what’s out there, you may opt to start by listing all of the data sets and the representations / configurations of one or more of those data sets you can think of. However, if you are unsure what data people are using, ask around to identify what reports people need and who puts them together. Why focus on reports? Typically more people in an organization interact with parts of a data set - such as exports from a database - than with the entire “raw” data set with all columns and rows visible. While you eventually want to catalog both, starting with the “data products” can help:
- Engage more people in your data governance project by talking about data in formats they recognize and value.
- Create a resource that is useful to anyone at your organization who interacts with data, whether they are trying to find the appropriate data to analyze or simply trying to track down the link to that quarterly financial report their boss misplaced again.
- Identify organizational trends and outliers in the way data is accessed and used, which will come in handy for developing standards around data preparation and quality.
Talking to the people who put these reports together about what the report is for, where the data comes from, and where the report is stored will help you capture the basic information needed for cataloging and help you build relationships with data stewards in your organization. For any reports you can review and catalog without help, you can always engage data stewards by asking them to check your work (while also making them aware that they now have a central location to see what other reports are out there and where to find them!).
Creating the Catalog
Once you know what to start cataloging, you have to decide how to catalog it. Data catalog entries should, at a minimum, briefly explain what the report or data set is, where to find it, and who is responsible for maintaining it. The best design for a data catalog is whatever people in your organization will actually use and keep using. This can be as complex as dedicated software, (stay tuned for our next blog post on technology), or as simple as a bulleted list. We are partial to using spreadsheets because they are quick to set up and organize the metadata you create in a format that is easy to filter and even store in a more complex database later.
Let’s take a look at an example made in Google Sheets by cataloging two resources developed using fake data about the distribution of grant funding for workplace injuries. The first contains data about grant money distributed by state; and the second uses that funding data as part of a dashboard.
Resource #1: A spreadsheet of fake data detailing the distribution of grant funding by a hypothetical organization.
Resource #2: A dashboard displaying grant distribution data by state.
First, establish resource classifications: Using the terms from this post, ours would be “table” and “report.” These terms can vary between organizations and could also be expanded to include, for example, a separate category called “tool” for interactive resources like applications. Including standard classifications allows you to list all of your resources in a flat format that can be sorted or filtered.
Next, decide what information about your resources (metadata) will be required and how to break it down. We would suggest:
- Resource Title: What do most people call it? If it’s an acronym, adding the full name is also recommended. An example would be: JSR (Job Status Report).
- Description: One to two sentences on what the resource is and what it is for.
- Resource Owner: Which unit / department, or role is responsible for maintaining it?
- Owner Contact Information: This will likely be an email address.
- Access Instructions: This could be a link to a document or website for something public facing or widely accessible within the organization. For anything involving more sensitive data, additional information about how to request access may be included here.
From there, determine whether you want to leave space for other information that is useful for understanding the scope and use of the resource but not essential for finding it, such as links to other documentation or more technical details about data quality and maintenance standards.
Finally, store the data catalog somewhere people can find it and tell everyone - multiple times - that it exists!
Be sure to establish roles and a process for regular updates to the catalog so that it continues to be a relevant resource. Our example has a tab to schedule and monitor updates.
(And by the way, if you would like a copy of this template, send an email to firstname.lastname@example.org)
Additional data catalog examples
- Although the Consumer Financial Protection Bureau’s Public Data Inventory is a web-based resource, it uses a simple tabular format similar to our Google Sheets example. Note the different data type options and the addition of update frequency information.
- Data.CMS.gov is a great example of a highly detailed catalog that includes more technical information such as links to data dictionaries and related resources.
- Take a look at the left sidebar of the Urban Institute’s public data catalog to see a breakdown of how they use categories, content types, and tags to sort their data.
Other ways to build good documentation habits
Getting in the habit of embedding context into data resources is a good idea whether that information gets compiled into a data catalog or not. Ways to do this include:
- Establishing a naming convention for files that includes information about where a resource came from and/or when it was last updated
- Taking advantage of knowledge management features in your file storage system such as the ability to write descriptions in Google Drive folder and file details or create a list in Microsoft 365.
BI tools also have documentation features you can use for individual files (workbooks) and in their web-based portals. For example, Inciter’s BI tool of choice, Metabase, has a data browser feature that essentially functions as a data catalog for any resources stored there.
Here is the metadata for the workplace injury grants demo data from earlier:
From there, you can add information about individual tables…
…and even build out a more detailed data dictionary by including field (column) level descriptions.
We hope this deep dive into data cataloging has inspired and empowered you to take a look at your own organization's data and reports! Keep an eye out for our next post where we continue the broader conversation around Data Governance, with a focus on technologies.
And if you haven’t already signed-up for our newsletter, you can do so here to receive the next post in your inbox!