The de facto models for training representations on graph-structured data are graph neural networks (GNNs). As a result, they began to be implemented in production systems. These models pass messages along the edges in the given graph with nonlinearities between different layers, updating node embeddings iteratively. Computed node embeddings for l layers include details of the l-hop neighborhood of the seed vertex. GNN models must be trained on billions of graphs to be used in production. Even on distributed systems, training these models can take days or even weeks. Although more difficult in this situation, minibatch training of GNNs is more efficient than using deep neural networks (DNNs) in general.

Graphs in the real world usually have a minimum diameter. The neighborhood explosion phenomenon (NEP), also known as the l-hop neighborhood, can very well span the entire graph if l is large. When there are l layers, this dependence spans the node’s l-hop neighborhood, since the embeddings of nodes in a GNN depend recursively on their set of neighbor embeddings. The researchers proposed sampling from a subgraph of the l-hop neighborhood of the batch to deal with these problems. Node-based, layer-based, and subgraph-based methods are the three main types of strategies. Node-based sampling techniques take separate iterative samples from each node.

Node-based methods were found to sample subgraphs that were too shallow or with a small number of edges to nodes. So, layer-based sampling techniques were proposed, where sampling is done collectively for each layer. In contrast, subgraph sampling methods typically use the same subgraph for all layers, rather than the layer-by-layer recursive sampling scheme used in node- and layer-based sampling methods. While some of these sampling techniques cache historical embeddings to reduce the variance of estimated embeddings, others consider embedding magnitudes.

There are techniques for selecting popular vertices from a vertex cache. Most of these methods can be combined with other sampling algorithms and are orthogonal. NEP affects node-based sampling methods the most. Still, they guarantee a good approximation for any embedding by ensuring that each vertex has k neighbors, the only hyperparameter of the sampling algorithm that minimizes the NEP. Since the number of vertices selected is a hyperparameter, layer-based sampling methods do not suffer as much from NEP. However, they cannot guarantee that any vertex approximation is sufficient, and their hyperparameters are difficult to reason about. The number of sampling nodes at each layer depends drastically on the structure of the graph. Methods that use subgraph sampling typically have higher bias rates than their node- and layer-based equivalents. As a result, they concentrate on node-based and layer-based sampling techniques in this paper and combine their advantages.

The following is a list of the main contributions of this work:

  • Using Poisson Sampling, they propose a new algorithm called LABOR that combines the advantages of neighbor and stratum sampling approaches. As a result of correlating the sampling techniques of the given seed nodes, a significant reduction in computation, memory and communication is achieved for the selected vertices from different seeds. Additionally, LABOR can be used as a surrogate because it shares the same hyperparameters as neighbor sampling.
  • They demonstrate through experimental verification of their results that their proposed sampling algorithm, LABOR, outperforms neighbor and layer sampling approaches.

The code implementation of this concept can be found as a pending pull request to the famous Deep Graph library on GitHub.

This Article is written as a research summary article by Marktechpost Staff based on the research paper '(LA)YER-NEIGH(BOR) SAMPLING: DEFUSING NEIGHBORHOOD EXPLOSION IN GNNS'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.
Please Don't Forget To Join Our ML Subreddit

Aneesh Tickoo is a Consultant Intern at MarktechPost. He is currently pursuing a BSc in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and he is passionate about building solutions around it. He likes to connect with people and collaborate on interesting projects.

Researchers at Georgia Tech Propose ‘LABOR’ (LAyer-neighBOR sampling), A New Sampling Algorithm-Based on Machine Learning