Very Slow getc() in Concurrency Analysis

Very Slow getc() in Concurrency Analysis

Imagen de Huan Z.

Recently I was using VTune to profile some parallel benchmarks. I found some programs took much longer time to execute on vtune than directly executing them. After some tests, I found that the system function "getc" is very slow when I use VTune, therefore the time of loading big input files is very long. To make this scenario easy to reproduce, I compiled a simple md5sum program (ftp://quatramaran.ens.fr/pub/madore/misc/md5sum.c) which calls getc and then do a concurrency analysis on VTune.

$ gcc -o md5sum -g -O2 md5sum.c
$ time ./md5sum webdocs_250k.dat
f44f49ac4fb609005ba3bd2fb511df54 webdocs_250k.dat
real 0m2.997s
user 0m2.959s
sys 0m0.036s

$ amplxe-cl -collect concurrency -knob enable-user-tasks=true -- ./md5sum webdocs_250k.dat
f44f49ac4fb609005ba3bd2fb511df54 webdocs_250k.dat
Using result path `/home/zhang/parsec-3.0/pkgs/apps/freqmine/run/r005cc'
Executing actions 34 % Resolving information for `libc-2.3.4.so'
Warning: Cannot locate symbols for file `/opt/intel/vtune_amplifier_xe_2013/lib64/pinruntime/glibc/libc-2.3.4.so'.
Executing actions 36 % Resolving information for `libc-2.12.so'
Warning: Cannot locate symbols for file `/lib64/libc-2.12.so'.
Executing actions 36 % Resolving information for `libtpsstool.so'
Warning: Cannot locate symbols for file `/opt/intel/vtune_amplifier_xe_2013/lib64/libtpsstool.so'.
Executing actions 50 % Generating a report
Summary
-------

Average Concurrency: 1.000
Elapsed Time: 135.801
CPU Time: 135.798
Wait Time: 0.007
CPU Usage: 1.000
Executing actions 100 % done

As you can see, when I run md5sum directly, it only takes about 3s. But if I use vtune (2013update2), it takes about 130s! The analysis result shown in VTune GUI (see the picture attached) clearly indicates that the "getc" function call takes 127s! I wonder if it is a known issue of VTune, or I have to use some correct commandline arguments to get expected getc speed, or it is related to my system configuration (kernel, glibc, etc.).

Here is my system information:

$ cat /etc/issue
CentOS release 6.3 (Final)
Kernel \r on an \m

$ uname -a
Linux yang 2.6.32-279.9.1.el6.x86_64 #1 SMP Tue Sep 25 21:43:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ amplxe-cl --version
Intel(R) VTune(TM) Amplifier XE 2013 Update 2 (build 253325) Command Line Tool
Copyright (C) 2009-2012 Intel Corporation. All rights reserved.

Any suggestions are appreciated. Thanks!

AdjuntoTamaño
Descargar vtune.png120.2 KB
Descargar md5sum.c7 KB
publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Sergey Kostrov

>>...I found that the system function "getc" is very slow when I use VTune...

Could you try to use "getc_nolock" which also reads a character from a stream but without locking of the main application thread?

I never use "getc" because it is indeed a very slow function and this is because it reads only one character when it is called.

Imagen de Sergey Kostrov

>>...it takes about 130s! The analysis result shown in VTune GUI (see the picture attached) clearly indicates that the "getc" function
>>call takes 127s!

Interesting and my question is how is it possible?

Imagen de Huan Z.

Thank you so much for your fast response!

I tried to use "getc_unlocked" in the md5sum test program and it works fine. In this way vtune can generate a reasonable result.
However, the problem I am facing is that I have a large amount of source code that uses the slow getc/putc/fgetc/fputc/fgetwc/fputwc..., and some of them are in shared libraries (libjpeg, mesa, etc), so it is quite annoying to modify all of them and guarantee the correctness.

I am trying to write a wrapper redirecting all getc/putc and so on to their unblocked versions, and use the LD_PRELOAD trick to override their default behaviors. However, the problem is that getc and putc are optimized as macros and there are actually no library calls. Are there any easier ways to do this? Thanks in advance.

Imagen de Sergey Kostrov

>>...I have a large amount of source code that uses the slow getc/putc/fgetc/fputc/fgetwc/fputwc...

Even if you have hundreds of places in the source files with calls to these functions it is a matter of time to change all these calls.

>>...However, the problem is that getc and putc are optimized as macros and there are actually no library calls. Are there any
>>easier ways to do this?

I resolved almost the same problem with CRT functions wrappers. It means, that "unwrapped" calls to CRT functions are NOT allowed. In a project I currently work for it looks like:
...
#define CrtPrintf printf
#define CrtPutc putc
...
CrtPrintf( ... );
CrtPutc( ... );
...
and so on. However, the project was designed and implemented with such constraints from the beginning.

Inicie sesión para dejar un comentario.