Learn the definition of a bare metal environment and how to train and test the single-node TensorFlow* framework and the multinode Intel® Xeon® Scalable platform in that environment.
Welcome back to the AI practitioners guide for beginners. I'm Beenish Zia. And in this episode, you will get a quick overview of what's covered in the guide for training and testing the TensorFlow* framework in a single node, as well as the multinode Intel® Xeon® scalable platform in a bare-metal environment.
First, let's talk about bare-metal deployment. One of the ways smaller businesses and curious developers are deploying artificial intelligence models are workloads on the Intel Xeon scalable-based platform [sic] is via bare metal. What bare metal means here is you acquire an Intel Xeon scalable-based system [sic] with just the base hardware and need to assemble the hardware components together, and then configure all the software pieces including the OS and all AI packages yourself.
When it comes to bare metal, you could either be deploying on a single node or on a cluster with multiple nodes. Let's first look at the steps for single-node deployment. Single node means you have one server system with the latest Intel Xeon Scalable processors installed and all necessary hardware configurations done.
The hardware configuration can include selecting the right processor SKU, installing the right memory DIMMs, applicable SSDs, and as necessary, an Ethernet or InfiniBand* connection. In addition, the BIOS has to be up to date before you can start. The guide gives you a sample configuration for the hardware and software stack if you need some inspiration.
The next step is to install an operating system. In our example, we are using CentOS* and detail the [sic] steps on operating system installation is also included in the guide for your reference. Once you've installed the OS, you will need to configure YUM, install EPEL [Extra Packages for Enterprise Linux*], which provides 100% high-quality add-on software packages for Linux distribution. And lastly, install GCC* if it's not already part of your OS installation.
Once the OS installation with basic add-on software packages is done, next you will start installing TensorFlow. Installation of the framework can be done using various methods. In the guide, I've used virtual environment for installation. This includes installing the necessary dependencies. Once the dependencies are installed, install the virtual environment and activate it. And make sure all the dependencies are up to date before installing the latest version of Intel® Optimization for TensorFlow.
Now that you've installed TensorFlow, you should test your environment. In the guide, I've done this using CFR 10 training data. Training and testing your trained model will require you to run a couple of Python* scripts that are already available as part of the TensorFlow package. The expected results with details on each step is covered in the guide.
Now that we have covered single-node deployment, let's cover steps for multiple-node deployment. When it comes to multiple nodes, you have a cluster of systems. For example, you have two or more of the latest Intel Xeon Scalable-based server systems [sic] connected and managed by a host node.
The initial steps for multiple-node deployment are very similar to single node. Where the major change happens is after you've done your compiler installation. You will need to install OpenMPI followed by Python 3.6 installation, and then install Horovod. Horovod helps run TensorFlow in a distributed fashion.
Once that is done, the rest of the steps from training and testing your environment is similar to single note, with complete details mentioned in the guide. The choice of using single node versus multiple node depends highly on the user and the requirements for their specific applications.
Please check out the written AI Practitioners Guide for Beginners for complete details on deploying TensorFlow on a bare-metal environment. Thanks for watching and keep developing.