GraphStorm is a low-code enterprise graph machine learning (GML) framework for building, training, and deploying graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly consider the structure of relationships or interactions between billions of entities that are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, search/retrieval problems, and more.
Today, we released GraphStorm 0.3, adding native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets for different nodes and edges within a single training loop. In addition, GraphStorm 0.3 adds new APIs for customizing GraphStorm pipelines. Now, only 12 lines of code are required to implement a custom node classification training loop. To help you get started with the new APIs, we have published two example Jupyter notebooks, one for node classification and one for link prediction tasks. We also released a comprehensive study on joint training of language models (LMs) and graph neural networks (GNNs) on large-scale graphs with rich text features, using the Microsoft Academic Graph (MAG) dataset from the KDD 2024 paper. The study showcases the performance and scalability of GraphStorm on text-rich graphs, as well as best practices for configuring GML training loops for better performance and efficiency.
Native support for multi-task learning on graphs
Many enterprise applications have graph data associated with multiple tasks on different nodes and edges. For example, retail organizations want to perform fraud detection for both sellers and buyers. Scientific publishers want to find more related research to cite in their papers and need to select the right subject to make their publications discoverable. To better model such applications, our customers have asked us to support multi-task learning on graphs.
GraphStorm 0.3 supports graph multi-task learning for the six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify the training targets through a YAML configuration file. For example, a scientific publisher can simultaneously define a paper subject classification task with the following YAML configuration: paper
Node and Link Prediction Tasks paper-citing-paper
Edge use cases for scientific publishers:
For more information about how to perform graph multi-task learning using GraphStorm, see Multi-task Learning with GraphStorm in the documentation.
New API for customizing GraphStorm pipelines and components
Since GraphStorm was released in early 2023, customers have primarily used the command-line interface (CLI), which abstracts the complexities of graph ML pipelines and allows you to quickly build, train, and deploy models using common recipes. However, customers have told us they want an interface that lets them more easily customize GraphStorm training and inference pipelines to their specific requirements. Based on customer feedback on the experimental API we released in GraphStorm 0.2, GraphStorm 0.3 introduces a refactored graph ML pipeline API. With the new API, only 12 lines of code are needed to define a custom node classification training pipeline, as shown in the following example:
To help you get started with the new API, we’ve also released new Jupyter notebook examples on our documentation and tutorials page.
A Comprehensive Study of LM+GNN for Large Graphs with Rich Text Features
Many enterprise applications have graphs with text features. For example, in a retail search application, shopping log data provides insights into how text-rich product descriptions, search queries, and customer behavior are related. The underlying large-scale language models (LLMs) alone are not suitable to model such data because the distribution and relationships in the underlying data do not match what the LLMs have learned from the pre-training data corpus. On the other hand, GML is ideal for modeling related data (graphs), but until now GML practitioners have had to manually combine GML models with LLMs to model text features and get the best performance for their use cases. This manual effort is difficult and time-consuming, especially when the underlying graph dataset is large.
In GraphStorm 0.2, we introduced built-in techniques for efficiently training Language Models (LM) and GNN models at scale on large text-rich graphs. Since then, customers have been asking for guidance on how to use GraphStorm’s LM+GNN techniques to optimize performance. To address this, in GraphStorm 0.3, we released LM+GNN benchmarks on two standard graph ML tasks (node classification and link prediction) using Microsoft Academic Graph (MAG), a large-scale graph dataset. The graph dataset is a heterogeneous graph, containing hundreds of millions of nodes and billions of edges, with most nodes assigned rich text features. Detailed statistics of the dataset are provided in the following table.
data set | Number of nodes | Number of edges | Number of node/edge types | Number of nodes in the NC training set | Number of edges in the LP training set | Number of nodes with text features |
Mug | 484,511,504 | 7,520,311,838 | 4/4 | 28,679,392 | 1,313,781,772 | 240,955,156 |
GraphStorm benchmarks two major LM-GNN methods: pre-trained BERT+GNN, a widely adopted baseline method, and fine-tuned BERT+GNN, introduced by GraphStorm developers in 2022. In the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features, and then train a GNN model for prediction. In the fine-tuned BERT+GNN method, we first fine-tune a BERT model on graph data, use the resulting fine-tuned BERT model to compute embeddings, and then use it to train a GNN model for prediction. GraphStorm offers different ways to fine-tune a BERT model depending on the type of task. For node classification, we fine-tune a BERT model on the training set using the node classification task. For link prediction, we fine-tune a BERT model using the link prediction task. In our experiments, we use eight r5.24xlarge instances for data processing and four g5.48xlarge instances for model training and inference. Our fine-tuned BERT+GNN approach achieves up to 40% better performance (link prediction on MAG) compared to pre-trained BERT+GNN.
The following table shows the model performance of the two methods and the overall computation time of the entire pipeline starting from data processing and graph construction. NC means node classification, LP means link prediction. LM time cost means the time spent on computing BERT embeddings and fine-tuning the BERT model for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.
data set | task | Data Processing Time | the goal | Pre-trained BERT + GNN | Fine-tuned BERT + GNN | ||||
LM Time Cost | One epoch time | metric | LM Time Cost | One epoch time | metric | ||||
Mug | North Carolina | 553 min | Thesis Topic | 206 min | 135 min | Accuracy: 0.572 | 1423 min | 137 min | Accuracy: 0.633 |
LP | Quote | 198 min | 2195 min | Average: 0.487 | 4508 min | 2172 minutes | Average: 0.684 |
We also performed benchmarking on large synthetic graphs to demonstrate the scalability of GraphStorm. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time for graph preprocessing, graph partitioning, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion-scale graphs in a few hours.
graph size | Data Preprocessing | Graph Partitions | Model Training | |||
# instance | time | # instance | time | # instance | time | |
1B | Four | 19 min | Four | 8 minutes | Four | 1.5 min |
10B | 8 | 31 min | 8 | 41 min | 8 | 8 minutes |
100B | 16 | 61 min | 16 | 416 min | 16 | 50 minutes |
For benchmark details and results, see the KDD 2024 paper.
Conclusion
GraphStorm 0.3 is released under the Apache-2.0 license to help you tackle large-scale graph ML challenges, and now provides native support for multi-task learning and new APIs for customizing pipelines and other components of GraphStorm. To get started, see the GraphStorm GitHub repository and documentation.
About the Author
Shomatsu He is a Senior Applied Scientist with AWS AI Research and Education (AIRE) where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new feature for Neptune that uses graph neural networks on graphs stored in graph databases. He currently leads the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in Computer Systems and Architecture from Fudan University, Shanghai in 2014.
Jean Jean is a senior applied scientist who has helped customers solve various problems such as fraud detection, decorated image generation, etc. using machine learning techniques. He has successfully developed solutions in graph-based machine learning, especially graph neural networks, for customers in China, the US, and Singapore. As an evangelist for AWS graph capabilities, Zhang has given many public presentations on GNNs, Deep Graph Library (DGL), Amazon Neptune, and other AWS services.
Florian Sope He is a Principal Technical Product Manager for AWS AI/ML Research, supporting scientific teams such as the Graph Machine Learning group and the ML Systems team working on large-scale distributed training, inference, and fault tolerance. Prior to joining AWS, Florian led Technical Product Management for Autonomous Driving at Bosch, was a Strategy Consultant at McKinsey & Company, and worked as a Control Systems/Robotics Scientist, where he holds a PhD.