New user of Intel fortran on Win XP-64

New user of Intel fortran on Win XP-64

Hi,

I am a new user to the Intel Fortran Forum.
I have come here to gain access to a 64-bit compiler and also manage multi processor computation.
I have a lot of experience in the 32-bit microsoft windows environment, using Lahey and Salford compilers and am hoping to get a new level of performance in this new environment.

To date I have managed to compile link and run my 32-bit program using the Intel 64-bit compiler, with only 2 minor problems:
1) Is there a routine of the form Get_IOSTAT_Message (IOSTAT, Message_return). I couldn't find it.
2) I notice RECL= is a 4-byte word count rather than the fortran standard byte count.
This program uses about 1.7gb of memory.

I have also run a simple test program which allocates large arrays (8gb) in size which appears to confirm I am getting access to more than 2gb of addressable memory. So far so good.

My next problem is to try and introduce some processor efficiency.
I tried "ifort /?" and was swamped with options about optimisation and hardware options. Should I hnow what this means or do I have to go through a learning curve to use these options !!
I have a dual core Xeon pc, but was lost as to what to select to:
- get better performance suited to this processor,
- be able to target multi-thread calculations.

My software is basically Finite Element calculations, where for large problems, there are two "3-line do loop"routines that do most of the work. (dot_product : c = vec_a . vec_b and vector subtraction : vec_a = vec_a - c . vec_b) Do I compile just these rotines only with the special options, or also the main procedure or all the code ? Also,what are the options ?

Hopefully I can post this message on the forum and start to learn this new compiler environment which appears to have so much potential. I thank you for your assistance.

John

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

John,

You could try using the IOMSG keyword in the statements generating the IOSTAT value - this should return the error message.

david

The Fortran standard does not specify the units of RECL= in OPEN. It wasn't until F2003 that it was "suggested' that the unit be 1 byte. Before that, it was just "storage unit" and many compilers took that as "numeric storage unit", including the DEC compilers of the 1970s which are ancestors of today's Intel Fortran. /assume:byterecl sets the unit to 1 byte.

For processor targeting options, I suggest you use /QxHost , which looks at the processor you compile on and chooses the appropriate setting for that processor. You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.

Steve - Intel Developer Support

Thank you for your prompt replies. I shall investigate these options.

Can I include /QxHost on all files, but also/Qparallel on only the one file that contains the two routines ? I have about 60 *.f95 files in 4 directories to compile.

As these routines can have vector sizes from 1 to 10,000, should I provide a covering routine to select a single thread routine for vectors of size less than (say) 6 elements ?

John

I included the option /QxHost on all files. I observed a 2% reduction in run times.

The following code contains the two key routines, which I have nowwritten in fortran array syntax. I previously used a DO loop and F77 style syntax.
I tried /Qparallel for this file. All code options I have tried have not achieved a multi-processor performance. I fear I am mis-understanding what can be achieved. Any recommendations on where I am going wrong ?

Also, excuse my ignorance, but where would I find the new "Guided Auto Parallelization" feature ?

      SUBROUTINE VEC_SUB (A, B, FAC, N)
!
!   Performs the vector opperation  [A] = [A] - [B] * FAC
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (inout) :: a
      real*8,    dimension(n),   intent (in)    :: b
      real*8,                    intent (in)    :: fac
!
      integer*4 i
      real*8    c
!
      c = -fac
!  do i = 1,n
!    a(i) = a(i) + b(i) * c
!  end do
      a = a + b * c
      return
!
      END

      REAL*8 FUNCTION VEC_SUM (A, B, N)
!
!   Performs a vector dot product  VEC_SUM =  [A] . [B]
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      real*8    c
!
!  c = 0
!  do i = 1,n
!    c = c + a(i) * b(i)
!  end do
!
      c = dot_product ( a, b )
      vec_sum = c
      return
!
      END

Assuming you have checked that you get vectorization of these functions, parallelization isn't likely to show additional speedup unless the loop counts are several thousand. You don't have any local information which would tell the compiler if this is the case, so it will probably optimize for loop count (100). Alternatively, you could use /Qpar-threshold to make the parallelization more aggressive, recognizing that many of the additional parallelizations will be unproductive. Even if the loop counts are long enough for combined vectorization and parallelization, parallelization will be more effective if it can be done outside these function calls.
The current compilers are unpredictable about whether the f77 or dot_product syntax is more likely to vectorize. If the f77 loop doesn't vectorize (e.g. supposing you set /fp:source), you should be able to get vectorization by directive
!dir$ simd reduction(+: sum)
do i=1,n
....
If you find useful opportunities for parallelism, you may need OpenMP directives to specify in detail which parallelizations you want.
The guided auto-parallel should make suggestions about vectorization and parallelization. It should help to set e.g.
!dir$loop count(3000)
(substitute appropriate value) on all loops in order to get better suggestions.
Documentation on GAP is here . Needless to say, the writer of the doc forgot about the existence of Fortran.

One of the differences between /Qxhost (for core i7) and default /arch:sse2 for double precision is the compiler's choice between scalar and parallel loads for data of unknown alignment. While the Core i7 and later CPUs are more likely to benefit from the parallel loads (almost certainly so if the data are actually aligned, although the compiler doesn't know it), it isn't always the best choice. If you do know that the data are always aligned, you can use
!dir$ vector aligned
to make these loops more efficient. This will cause run-time failure if your assertion is incorrect.

Tim,

Thank you very much for your comments. Infortuately I do not yet fully understand what you have stated, as I'm a bit new to the Intel termonology.

You have discussed vectorisation and parallelization, which I need to better understand what they mean.
My initial expectation was to be able to compile the dot_product, so as to use both processors to perform the calculation, when the loop size was big enough, with the compiler manageing all the multi-thread handshaking. I thought this would be parallelization. Is this a possibility, for the explicit DO loop I provided ? If so I will be asking how later !
Do you have a simple description of vectorization ?

I have had some success today.
1) I have managed to get my 64-bit program version running and have solved 2 problems, one with a 2.4gb matrix size and one with a 5.8gb matrix size, using a single allocated array. They both worked, maxing out at 50% CPU on my 2-processor machine, giving comparible results to the out of core 32-bit application. Unfortunately, I only have 6gb of physical memory so the elapsed time was not as good as could be hoped. More hardware required.
2) I also calculated a histogram of the loop size. For the smaller problem, there were 271,679,207 Dot_Product calls, 18% were in the range 404-665, 31% in the range 666-1096, 17% 1097-1808 and 10% > 1808. Only 8.5% were < 150. (could not insert the excel chart) It would appear that a substantial number of calls could benefit from multi threading ?

I'll try to learn more tomorrow.

John

John,

>>My software is basically Finite Element calculations, where for large problems, there are two "3-line do loop"routines that do most of the work. (dot_product : c = vec_a . vec_b and vector subtraction : vec_a = vec_a - c . vec_b)

FE systems are oftencomposed of in one of two ways:

a) A single large mesh
b) Multiple smaller mesh objects

In the case of the single large mesh you can add OpenMP parallelization at the inner loop level:

c = -fac
!$omp parallel do
do i = 1,n
a(i) = a(i) + b(i) * c
end do
!$omp end parallel do

In the case of multiple smaller mesh objects you move the parallelization outwards in your program:

!$omp parallel do
do iObj = 1, nObjs
call VEC_SUB(Objects(iObj)%A, Objects(iObj)%B, aFAC, Objects(iObj)%nVerts)
end do
!$omp end parallel do
...

The above case assumes sufficiently large number of objects of similar number of verticies.
The above case partitions the iteration space 1:nObjs into a number of partitions equal to the number of available threads (typically the logical CPU count). When the objects have approximately the same number of verticies then this partitioning works ok. However, when the number of verticies vary, equal partitioning may not yield equal amount of work. In this situation you might want to modify the above loop's !$omp directive

!$omp parallel do schedule(static,1)

The above cause each thread in turn to pick the next object. There are other scheduling clauses for varying circumstances.

The compiler's auto parallelization knows little about your program and would parallize only the inner most statements.

A general rule to follow:

Parallize outer
Vector inner

Parallizing the inner loop of the DOT Product requires a little care

c = 0.0
!$omp parallel do reduce(+:c)
do i = 1, n
c = c + a(i) * b(i)
end do
!$omp end parallel do

With the reduce(+:c) clause, each thread starts with a different variable for c, initialized to 0, and produces a partial sum for its slice of the iteration space 1:n. Upon exit from the parallel region, the thread private partial sums are atomically summed to produce the correct result.

Without the reduction clause, the end summation has a high probability of being incorrect (due to floating point addition not being multi-thread-safe atomic by nature).

Jim Dempsey

www.quickthreadprogramming.com

Here are some good resources (in addition to the documentation for the compiler installed on your your system) for learning how to optimize with the Intel compiler and more about vectorization:

http://software.intel.com/en-us/articles/step-by-step-application-perfor...

http://software.intel.com/en-us/articles/being-successful-with-the-intel...

http://software.intel.com/en-us/articles/vectorization-with-the-intel-co...

http://www.intel.com/intelpress/sum_vmmx.htm

Hopefully some of these will be helpful.

------

Wendy

Attaching or including files in a post

Jim,

Thanks for your comments. Again I will try to understand what you have said.

My FE calculations are based on a Skyline direct solver, which was popular in the 80's. Finite Element systems are large collections of discrete elements, when converted to a system oflinear equations becomes a symmetric matrix of significant variation in bandwidth(profile). In a direct solution approach, the forward reduction of the stiffness matrix and the forward reduction of the load vectors is based on dot_product, while the backwards substitution of load vectors is based on Vec_Sub.
My understanding to date of vectorisation is this is based on parallel calculation of small vectors ( probably size 3 to 6) that is carried out in a graphics transformation in a single supporting CPU, while parallelization is for multi threading into multiple CPU's.
Based on this a good place to start, as an initial learning experience, I am trying to achieve some vectorisation or parallelisation of my basic dot_product routine. After reading some documentation, I have progressed tothe following code idea, but was not able to achieve any change to run time performance or an increase above 50% CPU in task manager. I am hoping someone could review the code and suggest an appropriate ifort compile option that might work.

The latest code "vec_new.f95" I tried was:

REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VEC_SUM = [A] . [B]
! account is NOT taken of the zero terms in the vectors
! c = dot_product ( a, b )
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
real*8 c
integer*4 i
!
c = 0.0
if ( n > 100 ) then
!DEC$ PARALLEL
do i = 1,n
c = c + a(i)*b(i)
end do
else
!DEC$ NOPARALLEL
do i = 1,n
c = c + a(i)*b(i)
end do
end if
!
vec_sum = c
RETURN
!
END

The compile command I used was:
ifort /c /source:vec_new.f95 /free /fast /QxHost /Qparallel

I have been using: Intel Visual Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1.065. I hope this supports what I am trying to achieve.
My processor is Intel Xeon CPU W3505 @ 2.53GHz
I am assuming this is a dual cpu processor. I have no idea if it supports vectorisation.

I am hoping this is a good place to start in identifying some improved performance. I see this as a fairly concise problem to address.

John

I must admit I am becoming confused ! I have been searching through a number of articles on Vectorizing and Parallelizing of Fortran Dot_Product. While the discussions appear to be very complex and rich in terminology, I find a lot of confusing discussions which do not provide a clear and concise answer. It may be the clear answer does not exist !!

However I will try !

My goal is to modify the following two code examples to produce improved performance, via either Vectorizing or Parallelizing. The code as best I can summarise is:

      SUBROUTINE VEC_SUB (A, B, FAC, N)
!
!     Performs the vector opperation  [A] = [A] - [B] * FAC
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (inout) :: a
      real*8,    dimension(n),   intent (in)    :: b
      real*8,                    intent (in)    :: fac
!
      integer*4 i
      real*8    c
!
      c = -fac
      do i = 1,n
         a(i) = a(i) + b(i) * c
      end do
      return
!
      END

      REAL*8 PURE FUNCTION VEC_SUM (A, B, N)
!
!     Performs a vector dot product  VEC_SUM =  [A] . [B]
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      integer*4 i
      real*8    c
!
      c = 0.0
      do i = 1,n
         c = c + a(i)*b(i)
      end do
      vec_sum = c
      RETURN
!
      END

Can someone provide a suggestion of a combination of code modification and compatible compiler options that improve the performance ?

From my previous posts, the value of n varies considerably between 1 and 10,000, withless than 10% n < 150.Based on my understanding of vectorisation, that it isrelated to instruction sets that suit graphics vector calculation, I'd expect this would be the better option, if the cpu supports vector calculations for real*8 in a 64-bit environment.

If a concise vectorized solution to Dot_Product does exist, where the inputs are rank-one vectors, why doesn't ifort use this if /fast is selected ?

If this is not the case, then why not ??

I'm not trying to be overly critical as a new user of this forum, but as a long time fortran programmer, my best advicehas always beenKISS, keep it simple s... Our goal should be to provide a simple solution, as I'm sure I'm not the first one to travel down this path.

John

>>My FE calculations are based on a Skyline direct solver, which was popular in the 80's. Finite Element systems are large collections of discrete elements, when converted to a system of linear equations becomes a symmetric matrix of significant variation in bandwidth(profile).

Are the dot_products and Vec_Subs performed:

a) On each discrete element individually with additional code handling any element-to-element interaction.

or

b) On a single large matrix built from the concatenation/assemblage of the discrete elements

If a) then perform the parallelization on the element-by-element loop.

If b), then perform the parallelization inside the dot_product and vec_sub loops

>>My understanding to date of vectorisation is this is based on parallel calculation of small vectors ( probably size 3 to 6) that is carried out

The vector size is depending on your system architecture. If using SSE vector instructions this handles 2 REAL(8)s or 4 REAL(4)s sized small vectors. If using AVX 4 REAL(8)s or 8 REAL(4).

When the data is organized favorably, and when the code can make use of small vectors, vectorization can significantly improve performance. On a multi-core system, parallelization also improves performance. So the ideal is to use both vectorization and parallelization.

>>Based on this a good place to start, as an initial learning experience, I am trying to achieve some vectorisation or parallelisation of my basic dot_product routine.

Good for you to take this initiative.

When the data is organized favorably, and when the code can make use of small vectors, vectorization can significantly improve performance. On a multi-core system, parallelization also improves performance. So the ideal is to use vectorization .AND. parallelization.

>>>After reading some documentation, I have progressed to the following code idea, but was not able to achieve any change to run time performance or an increase above 50% CPU in task manager.

This suggests either code not parallized, or not entering parallel region, or parallel loops are too small to warrant parallization.

I suggest you turn off auto-parallization and use OpenMP for parallization. OpenMP gives you better control over when and where you perform the parallization.

!DEC$ PARALLEL

do i = 1,n

c = c + a(i)*b(i)

end do

becomes

Depending on the version of OpenMP, you can add a threshold clause

!$OMP PARALLEL DO IF(n > 10000)

do i = 1,n

c = c + a(i)*b(i)

end do

!$OMP END PARALLEL DO

Starting up and shutting down parallel regions has some overhead. The loop code x number of iterations has to be sufficiently large enough to absorb the overhead. For a simple DOT product, 100 is too low, 10,000 could be more than necessary to warrant parallization.

>>I'm not trying to be overly critical as a new user of this forum, but as a long time fortran programmer, my best advicehas always beenKISS, keep it simple s... Our goal should be to provide a simple solution, as I'm sure I'm not the first one to travel down this path

This forum is the best place to seek advice.

Jim

www.quickthreadprogramming.com

1) Maybe you should be interested in IOMSG=
2) -assume:byterecl switches the RECL units to bytes

You haven't responded to a fraction of the answers you've already received to your questions. As we've already pointed out, you should first check whether your important loops have been auto-vectorized, and get some idea what their typical length is.
Profiling might soon become important (if you haven't already been doing it), and Windows severely limits the options there (Intel VTune, AMD CodeAnalyst).

Forgive my butting in, but this appears a good point to ask Jim/Steve about one aspect of the sort of code that the poster is trying to maximally optimise for speed.

Given the code

do i = 1,n

c = c + a(i)*b(i)

end do

where the variable c is local to this particular patch of code, are all modern compilers intelligent enough to know that it is enough to increment 'c' in a local register and to neither send each incremented value for 'c' on the left of the '=' sign to the address assigned to 'c' nor to access from the address of 'c' the value required on the right of the '=' sign each time an increment is to be applied to it? Thus one memory set and one memory fetch is eliminated from the whole sum represented by the code inside the loop? Clearly on exiting the loop, the 'final' value of 'c' must be set at the address of variable 'c' eventually.

(My Assembler knowledge was only used many years ago on CDC machines, but even then, there were available a handful of floating-point registers useful for storing such incremented values without requiring memory calls and also a handful of integer registers useful for computing 2-D and 3-D array element locations by utilising increments by stride lengths to integer register values without having to get/fetch values from the addresses of array index variables. Thus, a standard compiled version of the Gauss-Jordan inversion of a symmetric real matrix was spectacularly improved by switching to assembler code making maximum use of local registers)

Quoting anthonyrichards
Forgive my butting in, but this appears a good point to ask Jim/Steve about one aspect of the sort of code that the poster is trying to maximally optimise for speed.

Given the code

do i = 1,n

c = c + a(i)*b(i)

end do

where the variable c is local to this particular patch of code, are all modern compilers intelligent enough to know that it is enough to increment 'c' in a local register and to neither send each incremented value for 'c' on the left of the '=' sign to the address assigned to 'c' nor to access from the address of 'c' the value required on the right of the '=' sign each time an increment is to be applied to it? Thus one memory set and one memory fetch is eliminated from the whole sum represented by the code inside the loop? Clearly on exiting the loop, the 'final' value of 'c' must be set at the address of variable 'c' eventually.

(My Assembler knowledge was only used many years ago on CDC machines, but even then, there were available a handful of floating-point registers useful for storing such incremented values without requiring memory calls and also a handful of integer registers useful for computing 2-D and 3-D array element locations by utilising increments by stride lengths to integer register values without having to get/fetch values from the addresses of array index variables)

This is one of the more crucial points where it's important to check the optional compiler optimization reports. If -Qvec-report tells you this loop has been auto-vectorized, that includes parallel register accumulation of the result, and final addition of the individual batched sums.
If you set an option which inhibits such optimization, such as /fp:source, you can still over-ride that for the individual loop by !dir$ simd reduction(+: c) in the current ifort.

You are right to some extent (up to the last part)

do i = 1,n

c = c + a(i)*b(i)

end do

Assume for example a and b are REAL(4) arrays and assume you compile using optimizations for P4 and later processors. Without parallization the above loop is vectorizable. Ignore for the moment that n could be very small (1, 2, 3,...) and that the compiler code will generate hidden tests for minimal size of n and alignments for a(1) and b(1).

The pseudo code for the above loop might look like

temp_c[1:4] = {0.0, 0.0, 0.0, 0.0} ! one instruction xorps xmm0,xmm0
loop:
temp_a[1:4] = a(i:i+3) ! one instruction "movaps xmm1,[reg_a+reg_i*4]"
temp_a[1:4] *= b(i:i+3) ! one instruction "mulps xmm1,[reg_b+reg_i*4]"
temp_c[1:4] += temp_a[1:4] ! one instruction "addps xmm0,xmm1"
reg_i +=4 ! addreg_i, 4
if(reg_i <= n) goto loop ! two instructions "cmp reg_i,reg_n; jle loop"
temp_c[1] = temp_c[1] + temp_c[2] + temp_c[3] + temp_c[4]
! above in two instructions "haddps xmm0,xmm0; haddps xmm0,xmm0"
! finally
c += temp_c[1] ! *** following three instructions
movss xmm2,[reg_c] ! xmm2 = c
addps xmm2,xmm0 !xmm2 += temp_c[1] (and 2, 3, 4)
movss [reg_c],xmm0 ! c = xmm2
! *** or two instructions
addss xmm0,[reg_c] ! temp_c[1] += c
movss [reg_c],xmm0 ! c = xmm2

For this example we also assume loop count n is multipl of 4 (compiler adds code to test and take additional code paths for residual array elements).

In the above sketch you can observe that 4 elements of each array are processed at each step (REAL(8) would process two at a time).

At theexit of the loop you find two approximate representations of how the eventual resultsum is added to c.

Note, that you are unable to perform the final summation to c in a single floating point instruction.
Even for integer summation, direct to memory (add [c],temp_c) the single instructionoperation would be performed as a Read/Modify/Write (Read / Add / Write).

In a multi-threaded environmentboth floating point summation and integer summation to memory have atime interval between the Read and the Write. In your case of SSE floating point, the instruction latencies up to the write could be anywhere between 25 and 250 clock ticks (approximate). It is probable that two threads might do the following

C = 0.0
enter parallel
Thread A | Thread B
Read old C (0.0) to register inA
| Read old C (0.0) to register in B
Add A's partial sum to register in A(0.0 + Asum)
| Add B's partial sum to register in B (0.0 + Bsum)
Write new C(0.0 + Asum)
| Write new C (0.0 + Bsum)
exit parallel

C now == 0.0 + Bsum, not 0.0 + Asum + Bsum

Note, that you may find the correct result 99.99% of the time (the two threads not finishing up at the same time). The probability is real that you will at times end up with only one of the threads sum.

Jim Dempsey

www.quickthreadprogramming.com

There are ways to handle the race condition

c = 0.0 ! return complete DOT

!$omp parallel do reduce(+:c)

do i = 1,n

c = c + a(i)*b(i)

end do

!$omp end parallel do

----------------

c_temp = 0.0 ! accumulate DOT

!$omp parallel do reduce(+:c_temp)

do i = 1,n

c_temp = c_temp + a(i)*b(i)

end do

!$omp end parallel do

c = c + c_temp ! accumulate to caller

-----------------

!$omp parallel private(c_temp) shared(c)

c_temp = 0.0 ! thread local accumulate DOT

!$omp do

do i = 1,n

c_temp = c_temp + a(i)*b(i)

end do

!$omp end do

!$omp critical ! or in this case !$omp atomic

c = c + c_temp ! accumulate to caller

!$omp end critical ! or omit for !$omp atomic
!$omp end parallel
----

You would choose the last method if you have more code to perform in the critical section.

Jim Dempsey

www.quickthreadprogramming.com

Tim,

>> You haven't responded to a fraction of the answers you've already received to your questions <<
Thanks for your advice on using this forum, I will try to answer questions in each post and try to progress.

Your points 1) and 2) have been understood and are no longer an issue.

>> As we've already pointed out, you should first check whether your important loops have been auto-vectorized, <<
I have used the following command:
ifort /c /source:vec_new.f95 /free /O3 /QxHost /Qparallel /Qvec-report /Qpar-report /Qdiag-file:vec_new.lis
( I could not find the option to include these reports in the code listing ( ie a /list:file option )

I received the following report:

E:\Intel_Sap_64\test\vec_new.f95(18): (col. 7) remark: LOOP WAS VECTORIZED.

E:\Intel_Sap_64\test\vec_new.f95(40): (col. 7) remark: LOOP WAS VECTORIZED.

This appears to show that vectorisation was implemented, but no auto parallelism. When I run this version, I do not see the benefits of the vectorisation in my run times. ( Xeon 3505 not supporting, or some other problem ? )
Given that the code is vectorised, why do I not see any run time improvement ?
Is it a problem with the compile command for the main program or other routines that call Vec_Sum, or a problem with the link command ? Are you able to comment on this observation.
My general compile command for other procedures is : ifort /c /O2 /source:%1.f95 /free /QxHost
My link command is basically: ifort *.obj ..\lib\*.obj/Feprog.exe /map:prog.map

To reply to Jim Dempsy #11 comments:
>> Are the dot_products and Vec_Subs performed: ... <<
Basically b). In a skyline solver, the equation solution acts on the full equation matrix as a set of linear equations, with element grouping of equations basically removed. "N" is the equation bandwidth. This differs from "Frontal" solvers which retain some of the element grouping of equations. A skyline solver has a number of distinct stages of solution, being:
1) minimisation and determination of equation profile (bandwidth minimisers)
2) assembly of complete "global" stiffness matrix in a profile form, from all element stiffness matrices, as a set of equations.
3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.
4) assembly of the load vectors
5) forward reduction of the load vectors.
6) backward substitution of the load vectors.

In a large system of equations, 3) is by far the greatest computation phase.
In a system where the global matrix is stored on disk, 3) does a read and a write, while 2) does a write and 5) and 6) each do a read of the matrix. In a eigenvector subspace itteration of multiple load sets, repeating steps 5) and 6), these reads can become the significant run time. Hence the approach of a 64-bit in-memory solution.

Step 5) is basically solving a large set of simultaneous linear equations, using a direct solution approach. This is what I am trying to get to run faster. I have introduced 64-bit to store the equations in memory and now I want the run time improved, both of which I thought I could achieve when I read the Intel Fortran glossy.

>> Starting up and shutting down parallel regions has some overhead. The loop code x number of iterations has to be sufficiently large enough to absorb the overhead. For a simple DOT product, 100 is too low, 10,000 could be more than necessary to warrant parallization. <<
Ifind Jim's comments about the loop size a bit or a worry. Is 10,000 realy what we are looking for before multiple threads become effective? I did a histogram of the value of N for Vec_Sum for my medium size problem of 154,000 equations, using a log scale for N range. I was hoping for an N value of say 50 to 100.

Histogram of Vec_Sum N value

N Range Count

1 155,375

2 165,126

3-4 310,705

5-7 463,258

8-12 912,718

13-20 1,311,197

21-33 2,157,660

34-54 3,605,227

55-90 5,579,890 2%

91-148 8,846,732 3%

149-244 14,766,515 5%

245-403 25,187,559 9%

404-665 49,307,473 18%

666-1096 84,614,467 31%

1097-1808 46,562,808 17%

1809-2980 23,469,101 9%

2981-4914 2,380,589 1%

4915-8103 1,722,149

8104-13359 160,658

I previously reported there were very few values below 150 (8.5%), but as the table shows, there are also very few > 8100 (.06%).

I will try to take the OpenMP approach and see what can be done, although the threshold of 10000 is not very promising. Any advice on ifort command ?

My focus of attention has now come down to first showing some improvement in the dot_product calculation, seeing what I can learn from that. Is someone able to provide a code revisionand ifort command that may demonstrate this capability?

John

>>>In a skyline solver, the equation solution acts on the full equation matrix as a set of linear equations, with element grouping of equations basically removed. "N" is the equation bandwidth.

3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.

I did a histogram of the value of N for Vec_Sum for my medium size problem of 154,000 equations, using a log scale for N range. I was hoping for an N value of say 50 to 100.

Histogram of Vec_Sum N value

N Range Count

1 155,375

8104-13359 160,658

<<<

Are we talking the same N (loop iteration count for vec_sum and vec_product)?

What this tells me (correct me if I am wrong)

You are using the Skyline solver to build sets (plural) of equation solutions, each set is passed through steps 1-6 with step 3) containing vec_sum and dot_product. Your Histogram indicates this is so. IOW if you only had one set, you would only have had one entry in your histogram with one count.

If my assumption is correct, then consider

parallel do loop for each set of equation solution

1) minimization and determination of equation profile (bandwidth minimizes)

2) assembly of complete "global" stiffness matrix in a profile form, from all element stiffness matrices, as a set of equations.

3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.

4) assembly of the load vectors

5) forward reduction of the load vectors.

6) backward substitution of the load vectors.

end parallel do loop

>>>In a system where the global matrix is stored on disk, 3) does a read and a write.

???

With 99.04% of vector size <= 8100, how can there be any disk I/O?

So I have to assume we are talking apples and oranges.

If the sets (plural) of equations are on disk (millions of sets), then you will need to notch-up your

parallel programming skill set and learn about parallel pipelines.

I would suggest you begin without parallel pipelines because the learning curve is less

and in the process you will assure your steps 1-6 are thread safe (and usable in the parallel pipeline)

Start with smaller set sizes such that all sets can be read into memory

serial section-1

read all sets into memory as RAW data do no other work

serial section-2

identify set demarcations in RAW data

parallel do i=1,nSetsFound

using set demarcation i

1) minimization

2) assembly

6) backward substitution

end parallel do

Once this is working, then the above can be easily reworked into a parallel pipeline.

Using parallel pipeline techniques you can make this a continuous operation

without regard to size of RAW data.

Jim Dempsey

www.quickthreadprogramming.com

According to the data you presented, the vectorization you achieved in the inner loops should be effective. Did you compare the performance with vectorization removed (e.g. -O1)? If it doesn't speed up by a factor of 2 or more with vectorization, we would suspect that your performance is limited by cache ineffectiveness. Your histogram shows insufficient use of loops long enough to benefit from combined vectorization and parallelization in the inner loops, according to consistent advice given by more than one of us.
You should get significant advantage from parallelizing the outer loop of the factorization by OpenMP. Again, as several of us said, you can't expect the auto-parallelizer to do the job, even if you take advantage of the options. In general, in a skyline solver, it will be difficult to balance the work among the threads without losing cache locality, but it will be interesting to try. As you no doubt saw with your search engine, several organizations considered this worthy of a research project.

TimP,

>> In general, in a skyline solver, it will be difficult to balance the work among the threads without losing cache locality

However, under specific situations, a parallel pipeline can maintain cache locality...

Read, read, read, read, read, read,
minimization and determination of equation profile, minimization...
assembly of complete "global" stiffness matrix (for set), assembly...
forward reduction of the "global" stiffness matrix, forward ...
assembly of the load vectors, assembly ...
forward reduction of the load vectors, forward...
backward substitution of the load vectors, backward ...
Write, write, write

The above illustrating an 8-stage parallel pipeline
Stage 4 being John's 3) with vec_sum and dot_product

A Core i7 4-core with HT could do a reasonable job of running this pipeline
A Core i7 6-core with HT would do better (pipeline has 6 computational pipes and 2 I/O pipes)
An 8-core or with or without HT may be slightly better than 6-core w/HT
A 2 x 4 core might be slightly worse if single memory subsystem
A 2 x 4 core could be better if multiple memory subsystems (e.g. Nehalem)

Depending on computational loads, some of the pipes may be broken into multiple pipeline stages (more HWthreads)

Jim Dempsey

www.quickthreadprogramming.com

Jim,

To clarify some of your assumptions.

>>> Are we talking the same N (loop iteration count for vec_sum and vec_product)?

What this tells me (correct me if I am wrong)

You are using the Skyline solver to build sets (plural) of equation solutions, each set is passed through steps 1-6 with step 3) containing vec_sum and dot_product.
<<<
There is only one build set of equations.
N is the array size in dot_product calls.

For the problem I am taking statistics from, we are talking of only one set of equations, with one pass through steps 1-6.
The simultaneous equation set is 154,000 equations, with half the coefficients are stored, as the matrix is symmetric. As real*8, the half matrix of 280 million non-zero values between the skyline profile and above the diagonal. Thistakes about 2.2gb of memory. This is a good example as for the larger in-memory problem, there are significant paging delays.
To do the forward reduction of this matrix (step 3 in post #17), there are 271,679,207 calls to a dot_product routine. The statistics I have quoted in post #17 is a histogram, giving a count of N: the size of vectors in the dot_product call. (I have used a log scale for the bucket sizes, to better identify the frequency of small values of N).Step 5uses Vec_Sum calls while step 6 uses Vec_Sub. The profile of the value of N shows a higher average value, but still few > 10,000.
This reduction (step3) now takes about 4 minutes on my PC, while steps 5-6 takes 5 seconds.
By contrast the lahey and Salford 32-bit versions take about 6 minutes, but these do (buffered) disk I/O for the solution.
There are other solution strategies ( non-linear or eigen-vector calculations ) which have multiple cycles of steps 4-6 (say 100 times) within multiple cycles of steps 2-3+ (say 20 times) for this modified sets of 150,000 simultaneous equations (in steps 2-3).
I've chosen a single pass solution to learn on, as multiple cycle solutions change their itteration flow, depending on the relative elapsed time of step 3 compared to5 & 6.
{Back in the days when I studied computer science, there was a strong connection to numerical methods/calculations. Most of us were civil or mechanical engineers. Times have changed.}

Back to what I am trying to achieve: based on this definition of N: the array size in the Vec_Sum (dot_product) call, is a size of 10,000 necessary before parallel can be effectively implemented ?

My next step is to take TimP's recommendation and test a /O1 option and see the difference. It may be that the Xeon only supports 2 xreal*8 computations in the vectorized approach and I should see if there is a processor that supports 4 x real*8 computations.
Before I started this investigation, I washopeing to gain significantly from a quad-core PC, which implies parallelization. I anticipated this would have been achieved by re-writing the vec_sum code to drive multiple processors, when N > say 50. I was not then aware of the vector instruction set in each CPU.
At present my hope is diminishing.

John

I have now carried out a number of tests, comparing Salford and Lahey 32-bit compilers with the Intel 32-bit and 64-bit compilers, for the 2.2gb matrix size problem.
The 32-bit solutions partition the matrix into 4 blocks of max size 630mb, while the 64-bit solution is a single block. All runs were done on an XP-64os with Xeon processor and 6gb of memory.

The performance times were:

Compile St3_cpu St3_IO St56_cpu St56_IO Description
Lahey_32 359.75 1.02 11.16 0.03 Lahey 32 bit compier
Salford_32 364.03 0.74 9.17 0.04 Salford 32 bit Compiler
Intel_32 231.77 0.64 5.86 0.03 Intel 32bit /O2 /QxHost
Intel_O1 358.06 0.62 7.30 0.00 Intel 64bit /O1 /QxHost
Intel_O2 233.17 0.10 5.02 0.00 Intel 64bit /O2 /QxHost
Intel_O3 234.66 0.45 5.06 0.09 Intel 64bit /O3 /QxHost

The 4 times quoted for each test are:
1 CPU time for the step 3 matrix reduction, using Vec_Sum
2 Other time (elapsed - CPU) for step 3
3 CPU time for the steps 5 & 6 : Load case solution, using Vec_Sum then Vec_Sub
4 Other time for steps 5,6
CPU time was taken from "call Cpu_Time", while elapsed time was taken from "call System_Clock".

These times show there is a significant improvement from /O1 to /O2, I think, indicating I am getting vectorizing improvement. Is there differences between processors in the size of the vector for vectorised calculations? Would different processors (newer Xeon or Core5-7) produce a significantly better solution time ?
Also of note is the lack of elapsed time penalty for the out of memory storage in the 32-bit cases. This is due to the significant disk buffereing in these cases from the 6gb of physical memory. This is especially the case in steps 5-6 where there is 3.6gb of virtual disk reads.
I also did a problem with matrix size of 5.5gb, with unfortunately only 6gb of physical memory. This gave very interesting results and all cases suffered form either less disk buffering efficiency or significant paging delays for the in-memory 64-bit solution.The in-memory results were better and my next test will use more memory.
These results show the performance of 32-bit programs that use moderate disk I/O appear to perform favourably in comparison to 64-bit versions, but also there is not a significant penalty in going to 64-bit addressing of larger arrays.

The compiler options I have used for the Vec_Sum (Dot_Product) to attempt both vectorisation and parallelisation has not produced the automatic result I had hoped.

ifort /c /source:vec_new.f95 /free /O3 /QxHost /Qparallel /Qvec-report /Qpar-report /Qdiag-file:vec_new.lis

Is someone able to provide an example or assist in a compatible change to the Vec_Sum code and compiler options, (which include reporting of parallelization achieved) so I can test different loop sizes for parallel implementation. Dot_Product looks like a relatively simple example of what I am trying to achieve, however achieving is not as simple.

John

As I have not been able to implement a parallelized version of my program, I have tried the MKL Blas routine "ddot", which claims to support multi threading.

The new Vec_Sum code is now:

REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VEC_SUM = [A] . [B]
! account is NOT taken of the zero terms in the vectors
! c = dot_product ( a, b )
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
real*8 c
real*8 ddot
external ddot
!
c = ddot(n, a, 1, b, 1)
vec_sum = c
RETURN
!
END

I linked with :
ifort *.obj ..\lib\*.obj mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib /Feprog.exe /map:prog.map

Then ran the new prog.exe
Windows task manager now shows the .exe running with up to 100% cpu, which is very promising.

My run time reporting has a few problems, as for the Step 3 matrix reduction I get

CPU = 380.78 sec and I/O = -138.74or a total of 242.04, compared to 233.3 for the vectorised version (/O2)or 358 sec for the non-vectorised version(/O1).

Total elapsed time for the run is now 277.4 sec compared with 267.4 seconds for /O2.

While it was good to see the cpu % climb above 50%, the results are not promising.
If anyone has some advice as to a better approach, possibly which better tunes ddot, based on the value of N, I'd appreciate your suggestions.

John

>> N is the array size in dot_product calls.
>>To do the forward reduction of this matrix (step 3 in post #17), there are 271,679,207 calls to a dot_product routine

This indicates that you should concentrate on moving your parallization efforts to the caller of the dot_product and not focus on parallization inside the dot_product. IOW:

!$OMP PARALLEL DO
DO I=1, 271679207
   CALL SHELL_FOR_DOT_PRODUCT(I)
END DO
!$OMP END PARALLEL DO

SUBROUTINE SHELL_FOR_DOT_PRODUCT(I)
   INTEGER :: I
   
   ! DETERMINE VECTORS AND RESULT LOCATION
   
   ! THEN CALL __serial__ version of DOT_PRODUCT
   CALL DOT_PRODUCT(VECTOR_A, VECTOR_B, N, RESULT_C) 
   
END SUBROUTINE SHELL_FOR_DOT_PRODUCT

SUBROUTINE DOT_PRODUCT(VECTOR_A, VECTOR_B,N, RESULT_C)
  INTEGER :: N
  REAL :: VECTOR_A(N), VECTOR_B(N), RESULT_C
  INTEGER :: I
  RESULT_C = 0.0
  DO I=1,N
     RESULT_C = RESULT_C + VECTOR_A(I) * VECTOR_B(I)
  END DO
END SUBROUTINE DOT_PRODUCT

as opposed to

DO I=1, 271679207
   CALL SHELL_FOR_DOT_PRODUCT(I)
END DO

SUBROUTINE SHELL_FOR_DOT_PRODUCT(I)
   INTEGER :: I
   
   ! DETERMINE VECTORS AND RESULT LOCATION
   
   ! THEN CALL DOT_PRODUCT
   CALL DOT_PRODUCT(VECTOR_A, VECTOR_B, RESULT_C)
   
END SUBROUTINE SHELL_FOR_DOT_PRODUCT
...
SUBROUTINE DOT_PRODUCT(VECTOR_A, VECTOR_B,N, RESULT_C)
  INTEGER :: N
  REAL :: VECTOR_A(N), VECTOR_B(N), RESULT_C
  INTEGER :: I
  RESULT_C = 0.0
  !$OMP PARALLEL DO REDUCE(+:RESULT_C)
  DO I=1,N
     RESULT_C = RESULT_C + VECTOR_A(I) * VECTOR_B(I)
  END DO
  !$OMP END PARALLEL DO
END SUBROUTINE DOT_PRODUCT

In the former case (recommended) you enter and exit the parallel region only once
In the latter case (not recommended) you enter and exit the parallel region 271679207 times

Note, if it is more convenient, the parallel loop for the 271,679,207 individual dot_products partitioning could be replace with a parallel loop for the 154,000 equation sets loop.

*** relying on MKL to parallelize the inner dot_products induces a 271,679,207 fold increase in the number of times you enter and exit a parallel region.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Thanks for your comments. To address you point of going back one loop, to reduce the thread initiation from 270 million elements of the matrixdown to 150 thousand equations/columns, I will present you an abbreviated description of the forward reduction of a set of linear equations, using the Crout Skyline approach:

In it's simplest form, the Crout Skyline approach is
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
a(i,j+1) = a(i,j+1) - dot_product ( a(i,1:j), a(j,1:j) )
end do
! reduce column i by itself
...
end do

This basically shows that each element of the matrix is subject to a dot_product modification. dot_product is a PURE function.

To step back to the J loop; the column reduction, each element of the column a(i,:) is sequentially modified by the previous columns. Through the loop the columns 1:i-1 do not vary but the values of columni do.This is not a PURE loop. As such can it still be parallelized ?

To include the column storage of the matrix, the main J loop becomes:
locate column_i
identify ti : which is the start of column i
do j = ti, i-1
locate column_j
identify t : which is the lower start of both columns i:ti and j:tj
column_i(j+1) = column_i(j+1) - dot_product ( column_i(t:j), column_j(t:j) )
end do

My limited understanding of parallel calculations is that non-PURE loops are not suitable.

I have also done some tests, varying the minimum dot_product loop size "N"for use of "ddot"
For my PC, the best performance I got was limiting the size of N to greater than 1600, with a total elapsed time of 259.2 seconds, compared to 261.8 with only vectorisation ( N > 99999). Not an effective solution.
Using ddot does not show me how to write parallelized code.

John

John,

Set aside paralle programming for the momemt and lets look at your serial programming

>>
In it's simplest form, the Crout Skyline approach is
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
a(i,j+1) = a(i,j+1) - dot_product ( a(i,1:j), a(j,1:j) )
end do
! reduce column i by itself
...
end do

This basically shows that each element of the matrix is subject to a dot_product modification. dot_product is a PURE function.

To step back to the J loop; the column reduction, each element of the column a(i,:) is sequentially modified by the previous columns. Through the loop the columns 1:i-1 do not vary but the values of columni do
<<

Then why are you doing all these redundant dot_products?

Try something along the line of

doi = 1,neq
prior_dot_product(i) = 0.0
end do
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
current_dot_product = dot_product ( a(i,j:j), a(j,j:j) ) ! or a(i,j:j) * a(j,j:j)
a(i,j+1) = a(i,j+1) -current_dot_product- prior_dot_product(j)
prior_dot_product(j) = current_dot_product+ prior_dot_product(j) ! j not i
end do
! reduce column i by itself
...
end do

*** test the code, the above is untested***

The basic concept is: eliminate redundant work first in your serial code, then look at parallization.

dot products are accumulative

dot(all sections) == dot(first part) + dot(remaining part)

When dot(first part) does not change (other than for the next step containing the incrimental accumulation of the prior step), then the dot(first part) for next step is redundant work (as you already know this from the prior step).

========================

A second problem with your current code is it is formulated something like this

A(iOuterLoop,jInnerLoop+1) = DOT(A(iOuterLoop,jRangeInnerLoop),A(jInnerLoop,jRangeInnerLoop)

IOW the data "stream" is propigating from the right most index.

In FORTRAN, the adjacent data is held in the left most index (progressing in larger stridesto the right most index).

Therefore, if you were to take a little up front time to convert A to Atranspose

do i=1,neq
do j=1,neq
Atranspose(i,j) = A(j,i)
end do
end do

The perform your original code above, substituting Atranspose for A and transposing the indexs then you would have better cache utilization.

*** remember to re-transpose Atranspose back to A afterwards ***

Transposing A will improve vectorization as well as cache utilization.

(sorry I did not comment on your second loop)

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Thanks very much for your comments.
You stated "Then why are you doing all these redundant dot_products?"

I need to think about this, as my initilal response is No, they are not redundant.

Loop J-1 is dot_product (column_i, column_j-1)
loop J is dot_product (column_i, column_j)
I don't see any redundant calculations as these 2 dot_products are significantly different.

However, except for the last valuea(i,j-1) of loop Jchanging, both dot_products could be calculated at the same time, as a dual calculation.

It is important to note for the skyline solver, the dot_product size is dependent on the profile height of the 2 columns j and j-1 so there is some variability in the dot_product sizes.

I'll review your comments some more and reply again.

John

John, in reexamining the suggestion for saving prior dot products,
I no longer think that there are appropriate products to be saved.
Disregard the prior post. (sorry for the red herring)
Combining the loops may have benefits.



 
Jim Dempsey
www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today