Training hangs

Training hangs

I have a Intel DL-Training tool setup in a CentOS Virtualbox. The Vbox host is a windows system. I am trying to create a image classification CNN model using GoogLeNet. I have successfully uploaded the images and also created datasets. But when I try to create a model using the dataset, I see that the training hangs at the following step:

I0926 13:27:53.117688   624 net.cpp:388] This network produces output accuracy
I0926 13:27:53.117692   624 net.cpp:388] This network produces output loss
I0926 13:27:53.117697   624 net.cpp:388] This network produces output loss1/loss
I0926 13:27:53.117709   624 net.cpp:388] This network produces output loss2/loss
I0926 13:27:53.352994   624 net.cpp:424] Network initialization done.
I0926 13:27:55.554013   624 solver.cpp:119] Solver scaffolding done.
I0926 13:27:57.511489   624 caffe.cpp:329] Starting Optimization
I0926 13:27:57.511575   624 solver.cpp:491] Solving OneMoreTry
I0926 13:27:57.511592   624 solver.cpp:492] Learning Rate Policy: step

The default LR Policy "poly" also hangs in the same manner. I also tried "step" as shown above. I also tried executing caffe command from the Jupyter shell/terminal. In all cases, it hangs at this same point.

There is no .caffemodel file created.

What could be the issue ?

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What is the default value for "snapshot_format" in Intel Caffe ? This is missing in my solver.prototxt. Adding it did not help - there was no .caffemodel file created. But I am looking at solver.cpp to see why the file is not being written and I don't know what the default value for this parameter is.

TRIAGED !!!

The caffe process is getting killed by OOM killer. So need to configure more RAM for the Virtualbox.

Leave a Comment

Please sign in to add a comment. Not a member? Join today