GraphStorm 0.3: Scalable graph multi-task learning with a user-friendly API

GraphStorm is a low-code enterprise graph machine learning (GML) framework for building, training, and deploying graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly consider the structure of relationships or interactions between billions of entities that are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, search/retrieval problems, and more.

Today, we released GraphStorm 0.3, adding native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets for different nodes and edges within a single training loop. In addition, GraphStorm 0.3 adds new APIs for customizing GraphStorm pipelines. Now, only 12 lines of code are required to implement a custom node classification training loop. To help you get started with the new APIs, we have published two example Jupyter notebooks, one for node classification and one for link prediction tasks. We also released a comprehensive study on joint training of language models (LMs) and graph neural networks (GNNs) on large-scale graphs with rich text features, using the Microsoft Academic Graph (MAG) dataset from the KDD 2024 paper. The study showcases the performance and scalability of GraphStorm on text-rich graphs, as well as best practices for configuring GML training loops for better performance and efficiency.

Native support for multi-task learning on graphs

Many enterprise applications have graph data associated with multiple tasks on different nodes and edges. For example, retail organizations want to perform fraud detection for both sellers and buyers. Scientific publishers want to find more related research to cite in their papers and need to select the right subject to make their publications discoverable. To better model such applications, our customers have asked us to support multi-task learning on graphs.

GraphStorm 0.3 supports graph multi-task learning for the six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify the training targets through a YAML configuration file. For example, a scientific publisher can simultaneously define a paper subject classification task with the following YAML configuration: paper Node and Link Prediction Tasks paper-citing-paper Edge use cases for scientific publishers:

version: 1.0
    gsf:
        basic: # basic settings of the backbone GNN model
            ...
        ...
        multi_task_learning:
            - node_classification:         # define a node classification task for paper subject prediction.
                target_ntype: "paper"      # the paper nodes are the training targets.
                label_field: "label_class" # the node feature "label_class" contains the training labels.
				mask_fields:
                    - "train_mask_class"   # train mask is named as train_mask_class.
                    - "val_mask_class"     # validation mask is named as val_mask_class.
                    - "test_mask_class"    # test mask is named as test_mask_class.
                num_classes: 10            # There are total 10 different classes (subject) to predict.
                task_weight: 1.0           # The task weight is 1.0.
                
            - link_prediction:                # define a link prediction paper citation recommendation.
                num_negative_edges: 4         # Sample 4 negative edges for each positive edge during training
                num_negative_edges_eval: 100  # Sample 100 negative edges for each positive edge during evaluation
                train_negative_sampler: joint # Share the negative edges between positive edges (to speedup training)
                train_etype:
                    - "paper,citing,paper"    # The target edge type for link prediction training is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # train mask is named as train_mask_lp.
                    - "val_mask_lp"           # validation mask is named as val_mask_lp.
                    - "test_mask_lp"          # test mask is named as test_mask_lp.
                task_weight: 0.5              # The task weight is 0.5.

For more information about how to perform graph multi-task learning using GraphStorm, see Multi-task Learning with GraphStorm in the documentation.

New API for customizing GraphStorm pipelines and components

Since GraphStorm was released in early 2023, customers have primarily used the command-line interface (CLI), which abstracts the complexities of graph ML pipelines and allows you to quickly build, train, and deploy models using common recipes. However, customers have told us they want an interface that lets them more easily customize GraphStorm training and inference pipelines to their specific requirements. Based on customer feedback on the experimental API we released in GraphStorm 0.2, GraphStorm 0.3 introduces a refactored graph ML pipeline API. With the new API, only 12 lines of code are needed to define a custom node classification training pipeline, as shown in the following example:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=('paper')), fanout=(20, 20), batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=('paper')), fanout=(100, 100), batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=('paper')), fanout=(100, 100), batch_size=256, train_task=False)

model = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

trainer = gs.trainer.GSgnnNodePredictionTrainer(model)
trainer.setup_evaluator(evaluator)

trainer.fit(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

To help you get started with the new API, we’ve also released new Jupyter notebook examples on our documentation and tutorials page.

A Comprehensive Study of LM+GNN for Large Graphs with Rich Text Features

Many enterprise applications have graphs with text features. For example, in a retail search application, shopping log data provides insights into how text-rich product descriptions, search queries, and customer behavior are related. The underlying large-scale language models (LLMs) alone are not suitable to model such data because the distribution and relationships in the underlying data do not match what the LLMs have learned from the pre-training data corpus. On the other hand, GML is ideal for modeling related data (graphs), but until now GML practitioners have had to manually combine GML models with LLMs to model text features and get the best performance for their use cases. This manual effort is difficult and time-consuming, especially when the underlying graph dataset is large.

In GraphStorm 0.2, we introduced built-in techniques for efficiently training Language Models (LM) and GNN models at scale on large text-rich graphs. Since then, customers have been asking for guidance on how to use GraphStorm’s LM+GNN techniques to optimize performance. To address this, in GraphStorm 0.3, we released LM+GNN benchmarks on two standard graph ML tasks (node classification and link prediction) using Microsoft Academic Graph (MAG), a large-scale graph dataset. The graph dataset is a heterogeneous graph, containing hundreds of millions of nodes and billions of edges, with most nodes assigned rich text features. Detailed statistics of the dataset are provided in the following table.

data set	Number of nodes	Number of edges	Number of node/edge types	Number of nodes in the NC training set	Number of edges in the LP training set	Number of nodes with text features
Mug	484,511,504	7,520,311,838	4/4	28,679,392	1,313,781,772	240,955,156

GraphStorm benchmarks two major LM-GNN methods: pre-trained BERT+GNN, a widely adopted baseline method, and fine-tuned BERT+GNN, introduced by GraphStorm developers in 2022. In the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features, and then train a GNN model for prediction. In the fine-tuned BERT+GNN method, we first fine-tune a BERT model on graph data, use the resulting fine-tuned BERT model to compute embeddings, and then use it to train a GNN model for prediction. GraphStorm offers different ways to fine-tune a BERT model depending on the type of task. For node classification, we fine-tune a BERT model on the training set using the node classification task. For link prediction, we fine-tune a BERT model using the link prediction task. In our experiments, we use eight r5.24xlarge instances for data processing and four g5.48xlarge instances for model training and inference. Our fine-tuned BERT+GNN approach achieves up to 40% better performance (link prediction on MAG) compared to pre-trained BERT+GNN.

The following table shows the model performance of the two methods and the overall computation time of the entire pipeline starting from data processing and graph construction. NC means node classification, LP means link prediction. LM time cost means the time spent on computing BERT embeddings and fine-tuning the BERT model for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

data set	task	Data Processing Time	the goal	Pre-trained BERT + GNN			Fine-tuned BERT + GNN
data set	task	Data Processing Time	the goal	LM Time Cost	One epoch time	metric	LM Time Cost	One epoch time	metric
Mug	North Carolina	553 min	Thesis Topic	206 min	135 min	Accuracy: 0.572	1423 min	137 min	Accuracy: 0.633
Mug	LP	553 min	Quote	198 min	2195 min	Average: 0.487	4508 min	2172 minutes	Average: 0.684

We also performed benchmarking on large synthetic graphs to demonstrate the scalability of GraphStorm. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time for graph preprocessing, graph partitioning, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion-scale graphs in a few hours.

graph size	Data Preprocessing		Graph Partitions		Model Training
graph size	# instance	time	# instance	time	# instance	time
1B	Four	19 min	Four	8 minutes	Four	1.5 min
10B	8	31 min	8	41 min	8	8 minutes
100B	16	61 min	16	416 min	16	50 minutes

For benchmark details and results, see the KDD 2024 paper.

Conclusion

GraphStorm 0.3 is released under the Apache-2.0 license to help you tackle large-scale graph ML challenges, and now provides native support for multi-task learning and new APIs for customizing pipelines and other components of GraphStorm. To get started, see the GraphStorm GitHub repository and documentation.

About the Author

Shomatsu He is a Senior Applied Scientist with AWS AI Research and Education (AIRE) where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new feature for Neptune that uses graph neural networks on graphs stored in graph databases. He currently leads the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in Computer Systems and Architecture from Fudan University, Shanghai in 2014.

Jean Jean is a senior applied scientist who has helped customers solve various problems such as fraud detection, decorated image generation, etc. using machine learning techniques. He has successfully developed solutions in graph-based machine learning, especially graph neural networks, for customers in China, the US, and Singapore. As an evangelist for AWS graph capabilities, Zhang has given many public presentations on GNNs, Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Florian Sope He is a Principal Technical Product Manager for AWS AI/ML Research, supporting scientific teams such as the Graph Machine Learning group and the ML Systems team working on large-scale distributed training, inference, and fault tolerance. Prior to joining AWS, Florian led Technical Product Management for Autonomous Driving at Bosch, was a Strategy Consultant at McKinsey & Company, and worked as a Control Systems/Robotics Scientist, where he holds a PhD.

What's Hot

Hurricane Milton-hit hospital gets system to pump water from the air

House of the Dragon used unexpected gym equipment to film dragon riding scenes

The big interview gets even bigger

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Build a Multi-Agent System with LangGraph and Mistral on AWS

Ground truth generation and review best practices for evaluating generative AI question-answering with FMEval

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

people overestimate the immoral behavior of political opponents

Donald Trump posts AI image to attack Kamala Harris

9 tips for improving battery life on your Garmin watch

Most Popular

Chicago Sky vs Minnesota Lynx 2024 Live Stream: Watch WNBA Live

From Elon Musk to Carlos Espina, here are the influential people shaping the US election

New Study Suggests ‘Screaming Mummy’ Died in Agony

Our Picks

The theory of evolution can evolve without denying Darwinism.

Google Maps and Waze are getting new alert and navigation features

Well-funded Tether targets Microsoft, Google and Amazon

Subscribe to our newsletter

Subscribe to Updates

What's Hot

GraphStorm 0.3: Scalable graph multi-task learning with a user-friendly API

Native support for multi-task learning on graphs

New API for customizing GraphStorm pipelines and components

A Comprehensive Study of LM+GNN for Large Graphs with Rich Text Features

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter