Graph Attention Networks Using Intel® Optimization for Tensorflow* on Intel® Xeon® Scalable Processors

Overview

Much of the world’s information are represented by graphs. Usage of graphs include social networks, brain connectivity network (sMRI), protein structures, molecular graph or chemical graph, financial transactions, etc.

There are real life examples of application of statistical machine learning on graph structures. Recent research has focused on application of deep learning to graph like Graph Convolutional Networks (GCN) or Graph Recurrent Networks (GRN). There are challenges to application of convolutional neural networks which were defined for images to graph. Compositionality is assumed when applying convolutions. A basic assumption while applying ConvNets to graph structured data are "Non-Euclidean data are locally stationary and manifest hierarchical structures".

The adjacency matrix for the graph can be very large for real world problems. The graph can be sparse or dense based on the number of edges compared to the number of vertices. In many practical cases the adjacency matrix will be sparse. The amount of memory required to load such large and sparse adjacency matrix can run into tera bytes for some real-world problems. The input data is formed from a graph which consists of nodes and edges. The data is very sparse leading to inefficiencies in some system because of limited memory (less than 32 GB). A set of experiments were conducted to evaluate the performance of CPU’s for training such sparse data and identify ideal settings for a training workload. The recommendations from the experiment are documented as general guidance for sparse workloads.

The neural networks are trained on private or proprietary datasets to increase the accuracy of predictions.

Hardware Configuration

The following is the hardware configurations used for comparison:

Architecture:x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):64
On-line CPU(s) list:0-63
Thread(s) per core:2
Core(s) per socket:16
Socket(s):2
NUMA node(s):2
Vendor ID:GenuineIntel
CPU family:6
Model:85
Model name:Intel® Xeon® Platinum processor 8513 @ 2.00GHz
Stepping:4
CPU MHz:1100.312
BogoMIPS:4000
L1d cache:32K
L1i cache:32K
L2 cache:1024K
L3 cache:22528K

Software Used

The following is the software configuration used:

Python* Version3.6
TensorFlow* Version1.10

Steps to Achieve Maximum Performance

  1. Intel® Distribution for Python* along with TensorFlow* will help in achieving maximum performance on the CPU. To activate the Intel® Distribution for Python*, the steps provided in the link should be followed1.
  2. The general guidance2 on inter_op_parallelism_threads being equal to the number of sockets does not seem to hold good for sparse workloads. A value of 8 was giving the best training time on a two socket system. Multiple configurations were tried out prior to arriving at the optimal value for this specific workload.
  3. The values of OMP_NUM_THREADS and intra_op_parallelism_threads are set to the number of physical cores as per general guidance. It was observed that setting inter_op_parallelism_threads and OMP_NUM_THREADS to the same value yields better result which is a deviation from general guidance.
  4. Block time of 2 performed better in this workload.
  5. A source build of TensorFlow* provides better speed up than a conda install.
  6. Applying numactl --interleave=all provided further speed up for the current workload.

Results

performance metrics
Figure 1.Training Time (Intel® Xeon® E5-2698 version 3 with GPU vs Intel® Xeon® Platinum)

performance metrics
Figure 2. Training Time (Intel® Xeon® E5-2686 version 4 with GPU vs Intel® Xeon® Platinum)

Conclusion

The adjacency matrix in real world graph is sparse. CPU systems without discrete hardware accelerators provide competitive training times for graph convolutions.

References

  1. Using Intel® Distribution for Python with Anaconda
  2. Optimizing TensorFlow* for CPU
  3. Convolutional Neural Networks on Graphs
有关编译器优化的更完整信息,请参阅优化通知