ibscif/Infiniband problems

ibscif/Infiniband problems

Hello,

I am having problems with the actual mpss 3.2 relase in combination with CentOS 5.3 and the Intel 7.2.2.0.8 OFED stack.

The kernel version is 2.6.32-279-11.1 so I ahd to recompile the kernel modules.

As soon as I start the ofed-mic service, the infniniband connections on the host is not working any more (I tested this with a ibv_rc_pingpong). Communication with the card over scif0 interface is possible, but only with a poor bandwith (73.34 Mbit/sec). dmesg shows follwing supicous messages

IB Proxy Server v0.1 Build 6720-23
Copyright (c) 2011 Intel Corporation
ibscif: OpenFabrics IBSCIF Driver v0.1 Build 6720-23 built Mar 28 2014 12:12:04
ibscif: max_pinned=50, window_size=40, blocking_send=0, blocking_recv=1, fast_rdma=1, host_proxy=0, rma_threshold=1024, scif_loopback=1, new_ib_type=1
ibscif_add_one: my node_id is 0
RDMA CMA: cma_listen_on_dev, error -38, listening on device scif0
fmr_pool: Device scif0 does not support FMRs
Error creating fmr pool
fmr_pool: Device scif0 does not support FMRs
ibscif_get_pollep_list: ep=ffff88086621a400 (0:listen)
ibscif_get_pollep_list: count=1
ibscif_do_connect: 0-->64625
ibscif_get_conn: ERROR: cannot get connection (0-->64625) after waiting, state=-1
ibscif_mr_get_mreg: conn==NULL
ibscif: ibscif_xmit_wr: fail to set up RMA addresses for the work request.
ibscif: ibscif_send_disconnect: ERROR: qp->conn == NULL

Does anybody have the same problems or know what I did wrong?

 

kind regards,

Christian

publicaciones de 14 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Just a quick note - the MPSS does not support any version of RHEL before 6.0 and in fact, earlier versions of RHEL have incompatibilities that will cause problems with the MPSS. Although CentOS is not explicitly supported in the MPSS, the RHEL support implies that CentOS 5.3 will have more problems than just OFED issues. The solution is to upgrade to a later version (6.0 or beyond) of CentOS.

He meant to write centos 6.3, not 5.3. Kernel which he is using indicates that (2.6.32-279-11.1)

Hello,

I meant  centos 6.3, but it is not relevant any more as we haven a currently unsupported configuration with 2 Xeon Phi cards and only one Infiniband adapter.

 

kind regards,

Christian

Christian,

How is your configuration unsupported? A configuration not being supported doesn't mean MPSS will not work, only that the MPSS team doesn't validate against it.

We do our best to help everyone who is using the coprocessor.

Please elaborate.

Regards
--
Taylor
 

Hello Taylor,

the main problem was that for the overlay filesystems located under

/var/mpss/mic[01]/

were working only for one of the two cards. The other one was then running only the with the base image. Which of the two cards was running with only the base image was completely arbitrary. 

As we had another host image with an working older mpss version, I decided to wait for another mpss release.

kind regards,

Christian

Christian,

Just so we can make sure the problem isn't there in the next release, the problem showed up when you had:

two coprocessor cards (it's probably not important but do you know if they are C0 or B1 cards?)  running MPSS 3.2

one IB adapter (TrueScale as opposed to a Mellanox?) running Intel 7.2.2.0.8 OFED stack

with CentOS 6.3 and kernel version is 2.6.32-279-11.1

Is that correct?

And you tracked the problem down to one of the cards still running the initial boot RAM disk?. (The way the card boots, it first creates a RAM disk containing just the base file system, then later creates a second RAM disk containing both the base and overlays and switches to that disk. So not having the overlay files implies that the boot terminated while it was still using the first RAM disk image.)

Can you also tell me what MPSS version you are running now and if you are still using the same OFED and CentOS? And when you were running MPSS 3.2, did you try booting without OFED and then bringing that up later?

Frances

Do you see following error on the serial console which fails to boot correctly:

[    2.634597] Initramfs unpacking failed: junk in compressed archive

This happens before loading ofed-mic service and happens randomly. MPSS 3.2 and 3.2.1 are affected.

Hello Tommi,

yes I see exactly the junk message, just the time differs a bit

[    2.974297] Initramfs unpacking failed: junk in compressed archive

Is there a solution for this, except waiting for a new mpss release?

@Frances

Thw Ofed Stack is OFED-3.5-1-MIC-beta1

The MPSS is

mpss-3.2-rhel-6.3.tar

 

kind regards,

Christian

I'm confused

 - which InfiniBand card?

 - which OFED stack - you are mentioning Intel 7.2.2.0.8 OFED and OFED-3.5-1-MIC-beta1

My setup:

Mellanox Technologies MT27500 Family [ConnectX-3]

2x Xeon Phi coprocessor SE10/7120

MPSS-3.2 + OFED 1.5.4 + el6.3 kernel seems to work

MPSS-3.2.1 + OFED-3.5-1-MIC-beta1 / OFED-3.5-2-MIC-beta2 + el6.5 kernel has this  "Initramfs unpacking failed"-problem

Today I tried mpss-3.2.1/ofed 1.5.4/el6.5 kernel combo and it still failed.

Btw, issues in this thread are too mixed, it's clear that author of this thread had some conflicting packages installed:

"SCIF Driver v0.1 Build 6720-23" vs. "I am having problems with the actual mpss 3.2 relase"

MPSS 3.x libscif build id should be 3.x, not 6720-23 which is from MPSS 2.x era.

Yes the topic and the issues in this thread are mixed. In order to clarify this, with all versions and combinations (I have tried several ones) I get the error

ibscif

Only with the newer release (3.2) I get the error with the junk  which makes only one of the cards usable, this error drove me nuts and so I have decided to wait for a new mpss, which hopefully will fix this.

Just to try to clean up here. The issue with the "Initramfs unpacking failed: junk in compressed archive" message, where only one card would come all the way up and the other would get stuck, was a corrupted image file. As Tommi found in the MPSS 3.2.3 release notes, there is a known problem where a corrupted mic0.img.gz file can be generated "if /var/mpss/mic0/... contains softlinks to files not existing in that file system tree", for example, if you are using OFED-3.5-1-MIC-beta1 / OFED-3.5-2-MIC-beta2. The recommendation is to use OFED 1.5.4.1. Also, everyone is being urged to move to the MPSS 3.2.3 release because it contains some important patches in it.

Will this "fix" the original problem with the InfiniBand connections on the host being lost? Probably, although the solution is not ideal for people who would rather be running a more recent version of OFED. 

 

 

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya