Q&A from the webinar: “Data Cataloging with Knowledge Graphs”
We fielded several questions as part of our recent webinar (recording and slides available here): Data Cataloging with Knowledge Graphs.
The list below includes questions we were able to answer during the webinar as well as questions we did not get to in the webinar.
1. How does TopBraid fits with other Data Governance platforms like Collibra, Alation, informatica, Data.World, Immuta etc? Are they exclusive or complementary?
TopBraid EDG is a data governance platform with data cataloging capabilities. Each of the tools you have listed has a different combination of features. This means that degree of the overlap with TopBraid EDG may be somewhat different. Further, different users use these tools in different ways. We definitely have some customers that use one of these tools in addition to TopBraid EDG. In these cases, EDG is typically used as an overall knowledge graph and a catalog – importing some of the information from one of these tools which are used in a more siloed and pointed way.
2. What tool do you use to generate data dictionary from a physical database?
TopBraid EDG has a built-in import from JDBC connection. In the webinar we gave an example of import of a dataset file. In a similar way EDG could be asked to import data dictionary from a JDBC connection and also from a DDL file. When import uses JDBC connector, in addition to the standard data dictionary information (table and column names and datatypes), EDG can also perform data profiling, collecting data statistics, and import some data samples.
This short video, goes into some details of setting up the JDBC connector for importing information from the physical database, although it does not actually runs the import. When the import runs, it will populate tables, views and columns. Primarily keys, foreign keys, nullable fields and a lot of other rich information is gathered from the source.
3. Is not data dictionary limited to RDBMS’?
No, the idea of a data dictionary is not limited to RDBMS. Any data source can have a data dictionary describing fields that are present in a data source. We have showed in the example, a data dictionary for a spreadsheet. The data dictionary you can get from RDBMS will typically have more information.
4. Do ontologies play any role in respect to data catalogues? And if yes, in which way and how is it possible to generate corresponding ontologies semi-automatically?
Yes, TopBraid EDG comes with pre-built ontologies describing different types of data assets. For example, there is a class Database Table, Database Column, Dataset, etc. A rich set of properties is defined for these classes. Overall, EDG ontologies contain over 300 classes and well over 1,000 properties. Cataloging process creates instances of the classes and populates properties – metadata. See this blog for an overview of the built-in EDG ontologies.
There is no need to generate ontologies to support data cataloging since they are already built in. In the process of cataloging data sources EDG auto-generates metadata values. If you find that you need additional classes or properties, you can extend EDG ontologies. Technical metadata that can be automatically gathered from a source is already defined. Most of the business and operational metadata is also defined, but every organization is different. Your organization may want to capture additional information. EDG is highly flexible in supporting addition of the metadata that you require.
Unrelated to the cataloging process, to support other types of capabilities, EDG can generate classes and properties from RDF data and there is a wizard that lets you generate ontology property definitions from spreadsheet data.