Bug on elision lock

Bug on elision lock

Dear Intel Forum

   A server with  Intel Xeon E5-2660v4. The server is running fine, however, some applications on Debian 8.5 showing the error mensagem "general protection", in this case was torque 6.0.2:

gdb:

#0  __lll_unlock_elision (lock=0x5100480, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1  0x0000000000495a0d in unlock_ji_mutex (pjob=0x5115fd0, id=0x510160 <svr_enquejob(job*, int, char const*, bool, bool)::__func__> "svr_enquejob",
    msg=0x50eb85 "1", logging=0) at svr_jobfunc.c:4011
#2  0x000000000048f31e in svr_enquejob (pjob=0x5115fd0, has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false, being_recovered=true)
    at svr_jobfunc.c:421
#3  0x000000000045340a in pbsd_init_reque (pjob=0x5115fd0, change_state=0) at pbsd_init.c:2785
#4  0x0000000000452de1 in pbsd_init_job (pjob=0x5115fd0, type=1) at pbsd_init.c:2623
#5  0x00000000004513f9 in handle_job_recovery (type=1) at pbsd_init.c:1764
#6  0x0000000000451f10 in handle_job_and_array_recovery (type=1) at pbsd_init.c:2061
#7  0x00000000004525be in pbsd_init (type=1) at pbsd_init.c:2277
#8  0x00000000004591ff in main (argc=2, argv=0x7fffffffdec8) at pbsd_main.c:1883

 

dmesg:

 traps: pbs_server[22249] general protection ip:7f9c08a7a2c8 sp:7ffe520b5238 error:0 in libpthread-2.19.so[7f9c08a69000+18000]

valgrind:

==22381== Memcheck, a memory error detector
==22381== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==22381== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==22381== Command: pbs_server
==22381==
==22381==
==22381== HEAP SUMMARY:
==22381==     in use at exit: 18,051 bytes in 53 blocks
==22381==   total heap usage: 169 allocs, 116 frees, 42,410 bytes allocated
==22381==
==22382==
==22382== HEAP SUMMARY:
==22382==     in use at exit: 19,755 bytes in 56 blocks
==22382==   total heap usage: 172 allocs, 116 frees, 44,114 bytes allocated
==22382==
==22381== LEAK SUMMARY:
==22381==    definitely lost: 0 bytes in 0 blocks
==22381==    indirectly lost: 0 bytes in 0 blocks
==22381==      possibly lost: 0 bytes in 0 blocks
==22381==    still reachable: 18,051 bytes in 53 blocks
==22381==         suppressed: 0 bytes in 0 blocks
==22381== Rerun with --leak-check=full to see details of leaked memory
==22381==
==22381== For counts of detected and suppressed errors, rerun with: -v
==22381== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== Process terminating with default action of signal 11 (SIGSEGV)
==22383==  General Protection Fault
==22383==    at 0x72192CB: __lll_unlock_elision (elision-unlock.c:33)
==22383==    by 0x4E7E1A: unlock_node(pbsnode*, char const*, char const*, int) (u_lock_ctl.c:268)
==22383==    by 0x4B7A66: mom_hierarchy_handler::make_default_hierarchy() (mom_hierarchy_handler.cpp:164)
==22383==    by 0x4B898C: mom_hierarchy_handler::loadHierarchy() (mom_hierarchy_handler.cpp:433)
==22383==    by 0x4B8AE7: mom_hierarchy_handler::initialLoadHierarchy() (mom_hierarchy_handler.cpp:472)
==22383==    by 0x452629: pbsd_init(int) (pbsd_init.c:2299)
==22383==    by 0x4591FE: main (pbsd_main.c:1883)
==22382== LEAK SUMMARY:
==22382==    definitely lost: 0 bytes in 0 blocks
==22382==    indirectly lost: 0 bytes in 0 blocks
==22382==      possibly lost: 0 bytes in 0 blocks
==22382==    still reachable: 19,755 bytes in 56 blocks
==22382==         suppressed: 0 bytes in 0 blocks
==22382== Rerun with --leak-check=full to see details of leaked memory
==22382==
==22382== For counts of detected and suppressed errors, rerun with: -v
==22382== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== HEAP SUMMARY:
==22383==     in use at exit: 325,348 bytes in 186 blocks
==22383==   total heap usage: 297 allocs, 111 frees, 442,971 bytes allocated
==22383==
==22383== LEAK SUMMARY:
==22383==    definitely lost: 134 bytes in 6 blocks
==22383==    indirectly lost: 28 bytes in 3 blocks
==22383==      possibly lost: 524 bytes in 17 blocks
==22383==    still reachable: 324,662 bytes in 160 blocks
==22383==         suppressed: 0 bytes in 0 blocks
==22383== Rerun with --leak-check=full to see details of leaked memory
==22383==
==22383== For counts of detected and suppressed errors, rerun with: -v
==22383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
~

After some search i believe there is a bug on elision lock on this xeon version ? Could be ?

  Best regards.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Was necessary to recompile glibc with --enable-lock-elision=no, this can be a bug on xeon ?

Best Reply

Hi,

it is a bug in the torque software which unlocks an unlocked pthread mutex. Such usage has undefined behavior. Citing pthread manual page: "If a thread attempts to unlock a mutex that it has not locked or a mutex which is unlocked, undefined behavior results.". With lock elision unlocking a free lock is not tolerated anymore:"pthread_mutex_unlock() detects whether the current lock is executed transactionally by checking if the lock is free. If it is free it commits the transaction, otherwise the lock is unlocked normally. This implies that if a broken program unlocks a free lock, it may attempt to commit outside a transaction, an error which causes a fault in RTM."

I would suggest to file a bug to torque software because generally unlocking a free lock seems fishy (a potential issue with correctness of lock protection scopes) and relies on a specific implementation of handling of undefined behavior.

Thanks,

Roman

Leave a Comment

Please sign in to add a comment. Not a member? Join today