Overheating Xeon Phi 7110P

Overheating Xeon Phi 7110P

Hi,

We have built our workstation with two Xeon Phi 7110p based on Intel W2600CR2 motherboard. Our accelerators are passively cooled. We have noticed that just after mpss service has been started, micsmc shows temperature around 100 oC and raising. Just around 140 oC ( which takes few seconds) micctrl shows "node lost" and we can do nothing except switch off and on the host. Reboot doesn't work - Xeon Phis were not visible in lspci unless host was not completely turned off and on again manually.

We have checked Scientific Linux 6.4 with mpss_gold_update_3-2.1.6720-21 and mpss-3.1.2 as well as Windows 7 Ultimate. Described behavior was system independent.

We have flashed MICs with newer firmware - nothing changed.

Motherboard info:
BIOSVersion=SE5C600.86B.01.08.0003.022620131521
FWBootVersion=01.17      
FWOpcodeVersion=1.17.4151 

What can cause that kind of temperature raising just after MIC cards booted up?

Best regards,
Krzysztof

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Have you checked your fans?  Read the specs for airflow requirements.

I built my own system with dual 5110P's. This required fabricating fan ducts and fans with sufficient capacity to provide the required airflow. Attached are pictures of my handy work. I used the provided PCIe extension mounting brackets as a support for the fan ductwork. The "design" if you can call it that, permits dual card use. A different design would be required for quad card use. (this is why you pay the system integrators to do this work).

Jim Dempsey

Attachments: 

AttachmentSize
Downloadimage/jpeg 100_4154.JPG1 MB
Downloadimage/jpeg Fans.jpg100.22 KB
www.quickthreadprogramming.com

Hi, thanks for your quick response.

I need to admit, that I didn't read cooling requirements carefully. I think your solution is fine and would fit to our case also. Is it any cooling system (fans) designed specifically for xeon phi?

I do not think any specific fans were designed. Consult the Intel(c) Xeon Phi(TM) Coprocessor Datasheet.

5110P requires 20 cuft/min airflow
7120P requires 33 cuft/min airflow

I've seen some packaging where the coprocessors are placed together and where one larger fan supplies airflow to both from the outside of the chassis. You also have the option of using the 1.5" sq fans, but these have ~15,000 RPM motors and produce substantial noise. The fans I chose are rated at ~24 cuft/min and the noise is tolerable for workstation in office environment. The duct work was made from 1/8" art board (1/8" foam core with paper laminate on both sides). Cut with Xacto (razor) knife and taped together. If you have access to a 3D printer, that would be the way to go.

I tried out three different fan designs before I settled on the one in the pictures. One of the options I tried was placing the coprocessors adjacent to one another and mounting one of the metal extension brackets upside down (requiring a little file work on bracket to get screw holes to line up). This worked but required mounting the fan after card installation. A 3D printer would have made making it possible to fabricate a snap-on ductwork that would clip on to the dual coprocessors without use of the extension brackets. (Spring loaded pins to pop into screw holes.)

I haven't added fan speed control yet, they are both set at max speed for now.

Jim Dempsey

www.quickthreadprogramming.com

The W2600CR2 is only compatible with the actively-cooled 3120A (see picture) coprocessor (the "A" stands for active, referring to the built-in fan).  The 7110P (the "P" stands for passively cooled) requires a server-type system that blows air through the chassis.

See motherboard compatible products (click on "compatible products" then "processors"): 
http://ark.intel.com/products/56338/Intel-Workstation-Board-W2600CR2​

If you change your 7110P for 3120A, then your Xeon Phi should no longer run a fever.  If you opt to stick with the 7110P, then jimdempseyatthecove provided an example of a custom DIY fan setup that is possible.  Here are the basic system requirements to support the Xeon Phi coprocessor:

  •   Double-wide x16 PCI Express slot for each coprocessor in the system
  •   BIOS support for memory mapped I/O address ranges above 4GB
  •   Up to 300W power delivery and sufficient cooling (varies by SKU, see pages 22 & 43 of the datasheet for more info)

If you're looking to get out of the system design business altogether, then for the best experience we recommend a supported system from one of our OEM partners - many of whom also offer fully-configured starter kits.

Happy New Year!

 

@jimdempseyatthecove

Can you tell me, what was the results of your soultion? What temperature level you are able to maintain? Did you use any stress tests to check the temperature level under heavy load of MIC card?

I have tried to make something similar. Just after booting the temperature is maintained at around 60C. But during heavy load it is raising too 100C and calculations need to be stopped.

 

@jimdempseyatthecove

Hi,

Do you have much use for the 2 card setup.  I am thinking of purchasing the passively cooled 31 series and was wondering if the second card provides noticeable boost in compute time in your case.  Thank you

On my system, both MIC's under heavy load are well under 100C (~80C depending on room temperature). My office temperature varies widely and the MIC temperatures will follow that.

As for 1 or 2 MIC's, this totally depends on your application.

Add to Eric's requirement list

A BIOS that not only supports above 4GB, but also provides for device windows of larger than 4GB

When you have a "P" series MIC, check the specifications (Xeon-Phi-Coprocessor Datasheet) as to the cooling requirements.

3.3.2.1 System Airflow for 5110P SKUs

In order to ensure adequate cooling of the 5110P SKUs with a 45oC inlet temperature, the system must be able to provide 20 ft3/min of airflow to the card with 4.3 ft3/min on the secondary side and the remainder on the primary side. The total pressure drop (assuming a multi-card installation conforming to the PCI Express* mechanical specification) is 0.21 inch H2O at this flow rate.

Note: For systems with reversed airflow, the corresponding airflow requirement is expected to be within +/-5% tolerance of the values shown in the following tables.

If the system is able to provide a temperature lower than 45oC at the card inlet, then the total airflow can be reduced according to the graph and table in Figure 3-6.

If the 5110P SKU is powered by a 2x4 and a 2x3 connector, the card can support an additional 20W of power for maximum TDP of 245W (see Section 2.1.5 for more details). In this case, the corresponding airflow requirement for cooling the part as a 245W card is shown in Figure 3-8.

3.3.2.2 Airflow Requirement for SE10P/7120P/3120P Passive Cooling

Solution

In order to ensure adequate cooling of the SE10P/7120P/3120P 300W SKUs with a 45oC inlet temperature, the system must be able to provide 33 ft3/min of airflow to the card with 7.2 ft3/min on the secondary side and the remainder on the primary side. The total pressure drop (assuming a multi-card installation conforming to the PCI Express* mechanical specification) is 0.54 in H2O at this flow rate.

If the system is able to provide a temperature lower than 45oC at the card inlet, then the total airflow can be reduced according to the graph and table in Figure 3-7.

I have dual 5110P's each has its own fan and duct work that provides ~24cuft/min, or 20% above the minimum.

*** Note, the airflow rate has to be measured through the MIC. This is not the same rating as a fan in open air. IOW you may have to overspec the fan rating to attain the minimum requirement flow.

Jim Dempsey

www.quickthreadprogramming.com

Thanks Jim

Leave a Comment

Please sign in to add a comment. Not a member? Join today