Knowledge Graph Database for AI/ML

Knowledge Graph Database for AI/ML

In my last blog, Using Your Knowledge Graph Database for Analytics, I discussed how knowledge graphs could be used to provide analytics and actionable intelligence and explained the easy things we can do with a knowledge graph. In this post, we will increase the complexity by taking the knowledge graph to the next level by using it as the data source for Machine Learning (ML) solutions.

Because knowledge graphs can have a robust list of properties and attributes associated with each node, allowing for new connections between nodes. There is also the ability for edge types (relationships) between the nodes to contain a list of properties, including temporal and geospatial descriptors. This connected data allows us to take on the questions of “What can be extrapolated from the data?” and “What can we predict will happen based on our understanding of the graph?” The answer to these questions is paramount to a successful ML solution.

Before we get started with ML using a knowledge graph, we need to make sure all of our entities in the knowledge graph are disambiguated and all associations are correct (to the best of our ability). For example, if we search for the word “cardinal,” we need to distinguish between the bird, football team, baseball team, and leaders of the Catholic Church. Imagine associating data associated with the baseball team to the football team. You would quickly lose customers because your facts are incorrect. It is better to not link data together than to link the nodes incorrectly. Making incorrect associations in the knowledge graph can negatively impact our ability to make AI/ML predictions.

As we organically grow our knowledge graph and ingest new data from disparate data sources, we need to find a way to associate the new data automatically and accurately with the existing nodes. Link prediction exposes nodes that are not currently connected that could be connected. We see this lightweight form of ML every time we log in to Facebook or LinkedIn when we see “People you may know” on these platforms. These simple algorithms can easily find potential relationships based on other connections to other nodes and properties located on the nodes. For example, you are connected to your secondary school. Based on your age, I can try to connect you with others who are connected to the school and are the same age as you.

We need to be aware of and address many challenges as part of the knowledge graph’s care, feeding, and maintenance. For example, how do we handle mergers and acquisitions of companies, the rebranding/renaming of products or services, or other consolidation of entities (nodes)? There is no single correct way to address these challenges, but how you handle these challenges is critical to the success of the solution.

You have created a robust and growing knowledge graph; now what? Here is where having data stored as a graph will allow you to truly leverage all of your data and the relationships between the entities. In addition, this robust entity/relationship solution will enable you to link unstructured data like documents, audio, and video, to your graph via ML, further expanding the power of your graph database.

Now let’s discuss deep learning. Deep learning is the ability of algorithms to mimic the behavior of the human brain via neural networks. These types of systems help make data refining, autonomous vehicles, and vocal AI (Siri and Alexa) possible. Imagine you are receiving thousands of documents per day. It is your job to correctly tag the document with the correct metadata, associate it to other similar documents, derive the sentiment from the article, and create new tags when needed. As humans, we can do this … for a little while before we get bored and make mistakes. If “trained” properly, computers can do this with 100% accuracy for both structured and unstructured data (documents, images, audio files, video files, etc.). This automated functionality can be done with no breaks and no errors. Not only can these deep learning algorithms complete the tagging and association tasks more accurately and efficiently, but they can also scale to tag thousands to hundreds of millions of records per day for the low cost of additional CPU and memory.

Now that you have seen how machine and deep learning can enhance your knowledge graph, it is time to leverage that enhanced information by building predictive models. There are several things to remember when building your predictive models. The first is to beware of bias entering our models. There are seven types of bias that can skew the results of your model: sample, exclusion, measurement, recall, observer, racial, and association.

Second, beware of overfitting and under-fitting your models. Overfitting occurs when your model matches the training data. To avoid overfitting, consider using an ensemble modeling method. Under-fitting is when the model is too simplistic and does not represent the data, most likely because you left out parameters or did not effectively sample the data.

Another major thing to remember is to constantly review the model to make sure it is still valid. Often you will experience model drift, which starts slow but can have a significant impact on the organization’s success. A recent example of this model drift can be seen in the issues with Zillow Offers[1]. Because Zillow failed to review the models regularly, the algorithms started to drift, causing the model to overestimate the values of the homes and make poor purchase investments leading to a substantial financial loss for the company.

When creating your models, the final thing to remember is to choose the correct modeling technique. Because of the uniqueness of datasets and the disparate questions you hope to answer. There are five main modeling techniques to choose from based on the model you are building. They are classification, clustering, forecasting, outliers, and time series. Let’s take a quick look at each one and see when we should use one over the others.

  • Classification models are used when you need to group data into a small number of categories. For example, this modeling technique is best used when finding a binary answer to a question.
  • Clustering models seek to find data clusters and tag them based on their attributes. Clustering is best used for targeted marketing campaigns.
  • Forecasting models use past performance and known factors about the future (weather, events, etc.) to predict future patterns. A restaurant chain may use this information to predict sales over a holiday weekend when it is expected to snow.
  • Outlier models focus on studying the anomalies in the data set. Credit card companies and banks use outlier models to detect fraudulent transactions.
  • Time-Series models use data points sequenced by time to make predictions. This modeling technique uses rich data history to address seasonality or other one-time events that can skew the results.

Because of the flexible schema of a graph database and the ability to have robust linkages and attributes associated with both nodes and relationships, graph databases provide a better data source for AI/ML solutions than relational databases. If possible, consider converting your relational database to a graph database to find those new connections in your data. Once you have a robust graph database solution, then utilize that newfound information to your advantage via predictive and prescriptive analytics.


[1] Inside Big Data, December 13, 2021 – The $500mm+ Debacle at Zillow Offers – What Went Wrong with the AI Models?


About the Author

Jim McHugh is the Vice President of National Intelligence Service – Emerging Markets Portfolio. Jim is responsible for the delivery of Analytics and Data Management to the Intelligence Community. Privacy Policy