Profiling Tensorflow* workloads with Intel® VTune™ Amplifier

Machine learning applications are very compute intensive by their nature. That is why optimization for performance is quite important for them. One of the most popular libraries, Tensorflow*, already has an embedded timeline feature that helps understand which parts of the computational graph are causing bottlenecks but it lacks some advanced features like an architectural analysis. In this short tutorial, we will show how to combine the data provided by Tensorflow.timeline with options available in one of the most powerful performance profilers for Intel Architecture – Intel® VTune™ Amplifier.

Tensorflow.timeline generates the data in Trace Event Format that cannot be consumed by VTune Amplifier directly but can be converted to .csv format it supports. We will do this conversion at the end of the collection with the help of a special custom collector script listed below:

#! /bin/sh

if [ "$#" -ne 1 ]; then
    echo "Usage: collect.sh json_dir"
    exit 1
fi

JSON_FILES=$1/*.json

case "$AMPLXE_COLLECT_CMD" in
"start")
    rm -rf $JSON_FILES
    ;;

"stop")
    for f in $JSON_FILES
    do
        python $(dirname "$0")/convert.py $f $AMPLXE_HOSTNAME $AMPLXE_DATA_DIR
    done
    ;;

"pause")
    ;;
"resume")
    ;;

*)
    echo "unexpected value of AMPLXE_COLLECT_CMD"
    ;;
esac

 

This script uses a helper conver.py Python* script shown here:

#!/usr/bin/env python

import sys
import json
import os
import socket
import datetime

def convertTime(t):
    return datetime.datetime.utcfromtimestamp(t / 1000000.0)

if len(sys.argv) < 4:
    print("Usage: convert.py input_file.json host output_dir")
    exit(1)

fnInp = sys.argv[1]
host = sys.argv[2]
outPath = sys.argv[3]
fnOut = os.path.basename(sys.argv[1])
fnOut = os.path.splitext(fnOut)[0]
fnOut = os.path.join(outPath, fnOut + '-hostname-' + host + '.csv')

fInp = open(fnInp, 'r')
fOut = open(fnOut, 'w')

trace = json.load(fInp)
fOut.write('name,start_tsc.UTC,end_tsc,pid,tid\n')

for event in trace['traceEvents']:
    if event['ph'] == 'X':
        t = int(event['ts'])
        tbUtc = convertTime(t)
        teUtc = convertTime(t + int(event['dur']))
        s = event['name'] + ','
        s += str(tbUtc) + ','
        s += str(teUtc) + ','
        s += ',\n'
        fOut.write(s)

When configuring a VTune Amplifier project, go to the Analysis Target window and specify the path to the collect.sh script and a path to the .json files generated by Tensorflow.timeline  in the Custom collector field as follows:

$ <path_to_collect.sh>/collect.sh <path_to_dir_with_json_files>

For example:

Custom collector configuration dialog

The script accepts one parameter:a path to the .json files generated by Tensorflow.timeline, which should be specified for the custom collector script. The script will automatically pick up the .json files from that directory at the end of collection, convert them to the .csv format, put the converted files to the result directory next to other traces collected by VTune Amplifier. When collection is done, VTune Amplifier automatically loads all the data and shows everything on the same timeline, correlated:

VTune Amplifier timeline with data from Tensorflow*

and aggregated:

VTune Amplifier grid with data from Tensorflow*

The example above uses the Source Function / Function / Call Stack grouping instead of the default Function / Call Stack since Tensorflow was built with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)  support which does JITting. As a result, Intel MKL-DNN in some cases generates multiple instances of the same function. With the default Function / Call Stack grouping, the VTune Amplifier would show these instances as different functions, which could lead to an incorrect interpretation of the result where each instance is not hot by itself but the accumulation of all of them would be the hotspot.

The described technique allows to apply a full power of analyses available in the VTune Amplifier to Tensorflow-based applications. For instance, finding operations caused by the hotspots is just a matter of applying a proper Source Function / Frame Domain grouping. This grouping can be configured manually as a custom grouping:

Operations caused by the hotspots

NOTE

To discuss this article, visit the VTune Amplifier developer forum.

For more complete information about compiler optimizations, see our Optimization Notice.