Data Catalog and Lineage Tracking for Compliance Reporting
Background
A major international bank needed to quickly and reliably track data as it flowed through their systems and data sources. This was required in order to improve operational efficiencies and to comply with government regulations
Challenge
As many other organizations of their size, the bank had a complex and diverse infrastructure with thousands of systems and data sources making it challenging to understand connections between data sources and their role in bank’s processes and activities.
The bank operated thousands of disparate database silos across multiple systems, divisions and geographies. This resulted in limited visibility for business and technical stakeholders on what data and information resided in which databases, how they were related or who was using the data, thus resulting in major operations and regulatory issues.
Further, the firm started to offload their historical data from Oracle into a Hadoop data lake to minimize costs of storing large amount of historical data. To support this process they needed to ensure that data in the data lake was well organized, understood and stored in a way that made it easy to load datasets back into the operational databases should this be needed for reporting. They also need to keep track of the milestones and schedules for offloading different data sources.
Solution
The bank used TopBraid EDG to catalog and connect data, technical, and enterprise assets at the business and technical levels.
Figure 1 shows TopBraid EDG LineageGram™ – a tool that allowed the firm to quickly navigate, understand and assess information lineage. The figure additionally depicts connections between regulatory compliance business activity, reports needed to perform the activity and two databases used to produce the reports.
Aspects of the firm’s data and application landscape captured by TopBraid EDG included:
1. Data assets and their relationships.
2. Logical entities and business terms, including their connection to physical data elements.
3. Technical assets (software executables) and the inputs and outputs in these executables.
4. Enterprise assets such as business activities, policies, processes and reports.
All these assets are interconnected using the graph technology underlying TopBraid EDG. These connections make it possible to identify all aspects of technical and business lineage to the extent the client finds necessary.
Information was brought into TopBraid EDG from many sources using flexible import and discovery services. Sources that fed into TopBraid EDG included stored RDBMs, Erwin models, business objects reports, spreadsheets describing business systems and business glossaries.
To make it easier for users to see the end to end connectivity, TopBraid EDG presents a higher level, rolled up view of lineage. At the same time, connections between assets can be drilled down to see more detailed information. An example of drilling down to understand a link between two business applications is shown in Figure 2.
Results
Connections between enterprise assets were used to offer comprehensive data lineage. It was also user for impact analysis. For example, to identify what business activity may be impacted if an application is decommissioned or a server suffers an outage.
Additionally, the bank used the solution to harmonize big data tables with enterprise data structures and data types. TopBraid EDG used information about database schemas to automatically generate AVRO schemas for datasets to be exported from the databases. The archiving policies were also stored in EDG so that the firm knows when data can be archived.