Machine learning applications are very compute intensive by their nature. That is why optimization for performance is quite important for them. One of the most popular libraries, Tensorflow*, already has an embedded timeline feature that helps understand which parts of the computational graph are causing bottlenecks but it lacks some advanced features like an architectural analysis. In this short tutorial, we will show how to combine the data provided by Tensorflow.timeline with options available in one of the most powerful performance profilers for Intel Architecture – Intel® VTune™ Amplifier.
Tensorflow.timeline generates the data in Trace Event Format that cannot be consumed by VTune Amplifier directly but can be converted to .csv format it supports. We will do this conversion at the end of the collection with the help of a special custom collector script listed below:
#! /bin/sh if [ "$#" -ne 1 ]; then echo "Usage: collect.sh json_dir" exit 1 fi JSON_FILES=$1/*.json case "$AMPLXE_COLLECT_CMD" in "start") rm -rf $JSON_FILES ;; "stop") for f in $JSON_FILES do python $(dirname "$0")/convert.py $f $AMPLXE_HOSTNAME $AMPLXE_DATA_DIR done ;; "pause") ;; "resume") ;; *) echo "unexpected value of AMPLXE_COLLECT_CMD" ;; esac
This script uses a helper conver.py Python* script shown here:
#!/usr/bin/env python import sys import json import os import socket import datetime def convertTime(t): return datetime.datetime.utcfromtimestamp(t / 1000000.0) if len(sys.argv) < 4: print("Usage: convert.py input_file.json host output_dir") exit(1) fnInp = sys.argv host = sys.argv outPath = sys.argv fnOut = os.path.basename(sys.argv) fnOut = os.path.splitext(fnOut) fnOut = os.path.join(outPath, fnOut + '-hostname-' + host + '.csv') fInp = open(fnInp, 'r') fOut = open(fnOut, 'w') trace = json.load(fInp) fOut.write('name,start_tsc.UTC,end_tsc,pid,tid\n') for event in trace['traceEvents']: if event['ph'] == 'X': t = int(event['ts']) tbUtc = convertTime(t) teUtc = convertTime(t + int(event['dur'])) s = event['name'] + ',' s += str(tbUtc) + ',' s += str(teUtc) + ',' s += ',\n' fOut.write(s)
When configuring a VTune Amplifier project, go to the Analysis Target window and specify the path to the collect.sh script and a path to the .json files generated by Tensorflow.timeline in the Custom collector field as follows:
$ <path_to_collect.sh>/collect.sh <path_to_dir_with_json_files>
The script accepts one parameter:a path to the .json files generated by Tensorflow.timeline, which should be specified for the custom collector script. The script will automatically pick up the .json files from that directory at the end of collection, convert them to the .csv format, put the converted files to the result directory next to other traces collected by VTune Amplifier. When collection is done, VTune Amplifier automatically loads all the data and shows everything on the same timeline, correlated:
The example above uses the Source Function / Function / Call Stack grouping instead of the default Function / Call Stack since Tensorflow was built with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) support which does JITting. As a result, Intel MKL-DNN in some cases generates multiple instances of the same function. With the default Function / Call Stack grouping, the VTune Amplifier would show these instances as different functions, which could lead to an incorrect interpretation of the result where each instance is not hot by itself but the accumulation of all of them would be the hotspot.
The described technique allows to apply a full power of analyses available in the VTune Amplifier to Tensorflow-based applications. For instance, finding operations caused by the hotspots is just a matter of applying a proper Source Function / Frame Domain grouping. This grouping can be configured manually as a custom grouping:
To discuss this article, visit the VTune Amplifier developer forum.