Big Data leaves a wake as it flows. Datasets already contain associated metadata: data about data. This may include the time it was created, the format, the number of records, and who (or what) created it. In addition, every data transform, process, or access generates even more metadata.
This metadata is crucial for data governance. To accomplish this, I designed Tracker at Cask, a Big Data startup. Tracker provides a view into the many dimensions of metadata that have been ingested by Hydrator – an extract, transform, and load (ETL) application – and used in the Cask Data Application Platform (CDAP).
I worked closely with the CTO, engineering team, front-end developers, and visual designer on this application. I led the UX process which involved a series of whiteboard sessions, design reviews, informal testing, feature demos, and detailed component specifications in JIRA, an agile project tracking tool.
Tracker automatically captures all technical, business, and operational metadata. The simple design allows users to easily search and explore all forms of metadata associated with entities such as datasets and streams.
The search screen provides high-level analytics on all entities. Users enter a query term, view results, and then proceed to explore entity details. Here, many expressions of metadata can be found such as schema and associated tags.
Two important dimensions, here, are lineage and audit log. Lineage, like a family tree, allows users to explore where an entity originated and to where it flows. Details about what programs access it and how can also be found in this view. A full listing of all programs acting upon an entity can be found in the audit log.
Rich metadata presented well helps enterprises monitor data quality. This is critical for ensuring data integrity and security.