Q&A from the EDW 2021 Presentation: “Data Cataloging with Knowledge Graphs” – part2
TopQuadrant participated in the recent Enterprise Data World conference with our CEO, Irene Polikoff, giving a presentation on “Data Cataloging with Knowledge Graphs“. This is part 2 of the blog series that explores interesting questions we received following the talk.
If you attended the conference, but missed the talk, recordings are available at the conference site. If you did not have a chance to attend, we invite you to listen to our recent webinar on the same topic. The webinar was not identical to the talk. However, it covered some of the similar content.
Questions below start with the number 6 because we already covered 5 questions in the part 1 of the series available here. We hope you will find these questions and answers interesting. We would like to hear your thoughts on these questions and our answers – as well as any additional questions you may have on this topic. Write to us, we always appreciate your input!
6. Can you use a knowledge graph to disseminate knowledge to different platforms like Confluence, Alation, Sharepoint, etc.? So that users have different ways to access the information that fits their needs.
Yes, absolutely.
Information in TopBraid EDG is readily available through APIs. It can be accessed from other tools in real time. Triggers and listeners can be set up to listen and alert or act on any changes. Information can also be exported and then imported into another tool.
7. Would a “business object” be part of the controlled vocabulary?
Answer to this question depends on its context. For example “business object” may be a term that is commonly used. If so, then it would be contained in a business glossary of terms. A good example of what may be managed as controlled vocabularies is enumeration of types of databases you use e.g., Oracle, MySQL, Sybase, etc. When a database is cataloged, the database type is included in the metadata describing it. The value for it would come from a controlled vocabulary.
For convenience, TopBraid EDG includes a set of pre-built (and fully modifiable) controlled vocabularies for describing data sources to be captured in the catalog.
8. Isn’t data modeling part of the steps for creating a data catalog? Or this is also done automatically?
In the presentation, we described the following common steps for creating a data catalog:
- Identify the initial Data Sources to be Inventoried
- Decide What Metadata is Important
- Establish Controlled Vocabularies for Metadata Items
- Decide on the Cataloging processes, including:
- Automating capture of Technical Metadata
- Defining roles, responsibilities and processes for enriching the catalog with business and operational metadata
5. Test with users, iterate and extend
Deciding on what metadata is important to capture (step 2) will typically include some data modeling activities.
As described in the part 1 of this blog series, TopBraid EDG comes with pre-built models. They provide a convenient and comprehensive base for a data cataloging initiative. Pre-built importers populate these models from the information gathered from the cataloged sources.
Having said this, every organization is different and most will want to customize the models to some extent. Establishing controlled vocabularies can also be considered to be a part of data modeling activity for a catalog.
9. It sounds like the controlled vocabulary ties into Reference Data Mgmt. Is it correct?
To some extent, yes. Reference Data Management deals with a subtype of controlled vocabularies.
Controlled vocabulary is a broad term. Reference data is a set of codes. For example, country codes, currency codes, product codes. Thus, each item in a reference dataset must have a locally (within the dataset) unique code. This code is typically a short string. For example, a two character code for United States is US. The code may also be numeric or contain a combination of numbers and letters. The important part is that it is unique within a reference dataset. A term in a controlled vocabulary, on the other hand, may not necessarily have such unique code.
Management of enterprise reference data is done to ensure consistency and quality of operational data that uses reference data and to facilitate integration of the operational data. Different operational data sources may use different sets of codes. In this case, it is important to create and maintain crosswalks. It is also important that codes don’t get deleted. If they should no longer be used, they get retired. If they were replaced or merged, this gets captured. This is because reference data is used to categorize other, operational data e.g., people born in the US or packages shipped to the US. If the code is deleted, operational data looses information.
Check our our Reference Data Management page for additional information on management of reference data.
In the context of a data catalog, controlled vocabularies are about aggregating and presenting unified information about diverse data sources. As opposed to integrating data from data sources. There will be some overlap with the reference data used within a data source, by the operational data. For example, when describing geographical scope of a cataloged dataset, you may use country codes just as you use in the address data or shipping data.
TopBraid EDG can support your reference data management and data cataloging needs in a single, extensible and open environment.