# Using /Qparallel with no effects

## Using /Qparallel with no effects

Hi!

I've set the compiler option Parallelization to "/Qparallel", but it takes no effects. I've tried this code. It's a DLL called by a .NET application:

```subroutine LRZERLEGUNG(AMatrix, DimensionN, AbsolutB, ResultX)
!DEC\$ ATTRIBUTES DLLEXPORT::LRZERLEGUNG

implicit none
! Variables
real :: T1, T2

integer, intent(in) :: DimensionN
double precision, dimension(DimensionN,DimensionN) :: AMatrix
double precision, dimension(DimensionN) :: AbsolutB
double precision, intent(out), dimension(DimensionN) :: ResultX
double precision, dimension(DimensionN)::y

integer :: i,j,k

! Body of lrZerlegung
CALL CPU_TIME(T1)
! Berechnung von L (Ax=LRx=b)
!dir\$ parallel
do i = 1,DimensionN-1
do j = i+1,DimensionN
! Bestimme die i-te Spalte von L
AMatrix(j,i) = AMatrix(j,i)/AMatrix(i,i)
do k = i+1,DimensionN
! Datiere die j-te Zeile auf
AMatrix(j,k) = AMatrix(j,k) - AMatrix(j,i)*AMatrix(i,k)
end do
end do
end do

! Vorwaertseinsetzen
!dir\$ parallel
!dir\$ loop count min(4)
do j = 1,DimensionN
y(j) = AbsolutB(j)
do k = 1,j-1
y(j) = y(j) - AMatrix(j,k)*y(k)
end do
end do

! Rueckwrtseinsetzen
!dir\$ parallel
!dir\$ loop count min(4)
do j = DimensionN,1,-1
ResultX(j) = y(j)
do k = j+1,DimensionN
ResultX(j) = ResultX(j) - AMatrix(j,k)*ResultX(k)
end do
ResultX(j) = ResultX(j)/AMatrix(j,j)
end do
CALL CPU_TIME(T2)
write( *, * ) T2-T1
end subroutine LRZERLEGUNG
```

Best regards, Marc

38 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Hello mfangmeyer,

In the example you gave to us, there isn't a value for the loop count, since we don't know the value of "DimensionN". In fact if this value is too small, say less then 100 loop counts (just as an example) you will don't have any speed-up, since there is always some overhead. In fact there is compiler switch /Qpar-threshold:[] where you can define the minimum value of loop iterations in order to get some profit from parallelize the code. So, if you are saying that you haven't any profit, perhaps you need to increase the number of loop counts.
This advice is not related to this particular question, but looking at your code I noticed that you use "doubleprecision" to declare some variables. It should be avoided since it is compiler dependent. Therefore, you should always use real(KIND=8) or just real(8) instead, assuming that your compiler takes the single precision as real(4) (which is almost certain). I hope this helps you.

Pedro

Hello Pedro,

the loop count is 1000 minimum. Looking at the Taskmanager says that the CPU usage is maximum 50% for the process. I have a dual core CPU.

OK, in future I will use real(8) instead of db declaration.

Marc

mfangmeyer,

I have looked more carefully to your code, and I have detected some situation where there is data dependency on your loops, which prevents auto-parallelization to occur. The compiler can automatically detect these situations. Under Visual Studio open your project properties, then FORTRAN>Diagnostics>Optimization Diagnostics>Optimization Diagnostics Level. Then select medium (should be enough). Recompile your code and you will see the diagnostics messages. Hope this helps you.

EDIT: I forgot to say. You could also use the Guided-Auto-Parallelism that emit advices. To use it right click your source file then Intel Visual Fortran Composer XE 2011>Guided-Auto-Parallelism>Run Analysis. Give it try. This a very useful feature.

Pedro

Quoting psantos
...
This advice is not related to this particular question, but looking at your code I noticed that you use "doubleprecision" to declare some variables. It should be avoided since it is compiler dependent. Therefore, you should always use real(KIND=8) or just real(8) instead, assuming that your compiler takes the single precision as real(4) (which is almost certain).

It is really the other way round - using a number (a literal integer constant) for a kind is more compiler dependent than using DOUBLE PRECISION. There's no guarantee that a kind of 8 even exists on other compilers, let alone that 8 means double precision (though for a number (the majority?) of compilers that is exactly what a real kind of 8 means). But all fortran compilers must support DOUBLE PRECISION, and that must be more precise than a default REAL (as of F2008 there are other requirements on it that mean that it will be at least as good as what everyone typically regards "double precision" to be).

If you wanted to save some typing, you could define an integer parameter that represents the kind of real that you typically want to use - something like:

```MODULE MyKinds
...
INTEGER, PARAMETER, PUBLIC :: rk = KIND(1.0D0)
...
END MODULE MyKinds

USE MyKinds
...
REAL(rk), dimension(DimensionN,DimensionN) :: AMatrix
REAL(rk), dimension(DimensionN) :: AbsolutB
REAL(rk), intent(out), dimension(DimensionN) :: ResultX
REAL(rk), dimension(DimensionN) :: y
```

This is portable to different compilers and, if at some stage in the future you want to change the kind in use, you can just edit in one place the the expression in the parameter declaration.

(Edit to change the parameter name from dp to rk to avoid confusion if the expression that defines the parameter did get changed in future).

Hello IanH,

I agree with you when you put the king value in a separate module.
When we use the single precision terms we don't really know what precision is being used, since there are compiler switches to change the behaviour of this keywords. So perhaps the best is to use the intrinsic SELECT_REAL_KIND(), which will return the kind number that is needed to accomplish a specified precision and range. With this we ensure that the compiler will automatically select the kind based on the programmer request and ensure portability. So I change your code to:

```MODULE MyKinds
...
INTEGER, PARAMETER, PUBLIC :: rk = SELECT_REAL_KIND(p=13)
...
END MODULE MyKinds

USE MyKinds
...
REAL(rk), dimension(DimensionN,DimensionN) :: AMatrix
REAL(rk), dimension(DimensionN) :: AbsolutB
REAL(rk), intent(out), dimension(DimensionN) :: ResultX
REAL(rk), dimension(DimensionN) :: y  ```

Note that I have chosen 13 decimal digits for precision, which will return a KIND=8 under Intel Fortran.

Pedro

"I have looked more carefully to your code, and I have detected some
situation where there is data dependency on your loops, which prevents
auto-parallelization to occur. The compiler can automatically detect
these situations. Under Visual Studio open your project properties, then
FORTRAN>Diagnostics>Optimization Diagnostics>Optimization
Diagnostics Level. Then select medium (should be enough). Recompile your
code and you will see the diagnostics messages. Hope this helps you."

Indeed, there ist not enough algorithmic independency in my code. By the way it is a LU decomposition to solve linear equations.
So I tried a matrix multiplication with matmul(). Same behaviour as before. The CPU usage is maximum 50%. A matrix multiplication is predestinated for parallelization!

Additionaly I'v set compiler option "Use Intel Math Kernel Libary" to Parallel (/Qmkl:parallel).

Can anybody give me a suitable code example?

Hello mfangmeyer,

I didn't understand when you say "Additionaly I'v set compiler option "Use Intel Math Kernel Libary" to Parallel (/Qmkl:parallel)." If you are not using the MKL, why including it? This will not make any difference.

If this a LU decomposition perhaps you should consider using the "getrf" from MKL. See MKL library manual for more details. The MKL are very optimized and will certainly give you better performance.

Pedro

OK, I just want to test (auto) parallelization. Can anybody give me a code example to learn more about it? My examples seem to be unsuitable.

I'm with you on this. I tried to get similar help 2 weeks ago and was not able to get a simple example of using /Qparallel. I was hoping to learn from this and better understand how to use this feature.

What we need isa simple case study of code and compiler options to demonstrate parallelization working. A few worked examples of different approaches to the same code could help us understand how it best works.

The code example I tried to use was the dot product loop, which is the inner loop of a LU Crout decomposition.

Steve Lionel suggestedin an earlier post :You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.
Unfortunately I could not find this feature (?)

I should add that I was given a lot of very good advise, but for a new user to Intel Visual Fortran and /Qparallel, a simple introductory case study would be more helpful to get us started.

John

Of course, REAL (KIND = 8) is also compiler-dependent. DOUBLE PRECISION is easier to work out the intent.

You really want REAL (SELECTED_REAL_KIND(15, 308)), or, better, INTEGER, PARAMETER:: dp = SELECTED_REAL_KIND(15, 308); REAL (dp):: MyValue,

remembering that this will fail should the requirements not be possible.

Hi John Campbell! Good to know you "on my side"... :-)

Hello bendel boy,

when I used SELECTED_REAL_KIND I only specified the number of decimal digits I want. It is perfectly valid, since both arguments are optional (but you have to specify at least one). So, I really want what I have written and nothing more. Just for reference, in the F2008 standard, a new argument was introduced: the radix.

Pedro

If you want an example of what the compiler should do when left to itself: Running on a Core 2-duo workstation, this simple Dot-product routine, compiled using IVF 11.1.067 without /Qparallel and with no OpenMP directives, runs as a console program in Release configuration at approx 60% on both cores, according to Task Manager, and takes about 8.3 seconds to do 1,000,000 iterations of it.

With /Qparallel selected, it runs at 100% on both cores and takes on average about 5 seconds. 100*(5/8.3) = 63% so it is consistent.

Note that to use the more accurate timer function OMP_GET_WTIME(), you must have /Qopenmp selected (even though there are no OpenMP directives) in order for the library containing the function to be linked in.

program timedotproduct
implicit none
INTEGER, PARAMETER::N=10000
REAL(8) A(N), B(N),Y
REAL(8) T1, T2, T3,OMP_GET_WTIME
REAL(4) TDOTPROD,TGENERATELOOP
INTEGER(4) I,J, JMAX,K1
TDOTPROD=0.0D0
T1=OMP_GET_WTIME()
do i=1,N
A(I)=dble(I)
B(I)=2.0D0*dble(I)
enddo
T2=OMP_GET_WTIME()
TGENERATELOOP=T2-T1
PRINT *,"TGENERATELOOP =",TGENERATELOOP
!DOTPROD returns the dot product of arrays A and B up
! to the Kth element. The dot product is returned in Y
!The total should be 2N(N+1)(2N+1)/6
K1=N
JMAX=1000000
CALL CPU_TIME(T3)
DO J=1,JMAX
CALL DOTPROD(A,B,N,K1,Y)
end do
T3=OMP_GET_WTIME()
TDOTPROD=T3-T2
PRINT *,"JMAX= ",JMAX,", dotprod = ",Y,", TDOTPROD =",TDOTPROD
PAUSE
end program timedotproduct

SUBROUTINE DOTPROD(A,B,N,K,SUM)
! Simplest dot-product code to compute the dot product up to the kth element
INTEGER(4) N,K
REAL(8) A(N),B(N), SUM
INTEGER(4) I
SUM=0.0D+0
DO I=1,K
SUM=SUM+A(I)*B(I)
END DO
RETURN
END

Quoting John Campbell

Steve Lionel suggestedin an earlier post :You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.
Unfortunately I could not find this feature (?)

I think we've spent a lot of time here giving advice which has been ignored.

The gap options are written up in the html docs. The compile line spelling was changed to "guide" some time after the "gap" terminology became widespread, as there are both auto-parallel (-guide-par) and auto-vector (-guide-vec) options, in case you want the categories separated. They write suggestions e.g. about directives at compile time. It's worth while to put in loop count directives before generating gap advice, if loop counts are significantly different from default assumptions (although gap may advise you to do that if you haven't).
gap is heavy on advice to use IVDEP directives even when there are superior alternatives.

OK, that's it! It works fine. But, as I must say, it is a trivial example. Such loops are easy to parallelize. Just take half of the loop count or N/c where c= Number of threads/cores.

What about nested loops? What is when there are (low) data dependencies? Does this automatic parallelization it's job only for such simple cases?

Furthermore I want do set the maximum number of threads with "export OMP_NUM_THREADS=value". I don't know where to set it. In my code?

What is the different between /Qopenmp and /Qparallel?

Many questions... Thanks for help!

/Qparallel is the "auto-parallel" option. The compiler looks at loops and decides for itself whether it can parallelize a loop. /Qopenmp says that you will be using OpenMP to parallelize your program - this requires you to add OpenMP directives naming specific loops to parallelize and providing information about variables. You can get better results with OpenMP, but it is more work on your part.

/Qparallel with the new guided-auto-parallelism (GAP) feature helps you get better results out of auto-parallelism without requiring the more extensive changes of OpenMP.

Retired 12/31/2016

To respond to timintel comment, I have not ignored the advice, but as a new user to the intel compiler, I have found some advice difficult to understand.
What are the html docs? A file name would help.

Thank you anthonyrichards for your example.
I have utilised this to generate a modified program which does achieve parallel performance.
I have grouped the main loop in a routine "test_loop" and provided reporting of performance in "report_time"
I have replaced the elapsed time routine with System_Clock and also introduced a processor time via cpu_time. These avoid the use of /Openmp, which provides potential confusion as to what parallelization is being used.

My compilation command is : ifort test_dot_yes.f90 /Qparallel /Qpar-report

There are 3 calls to Test_Loop, all of which report LOOP WAS AUTO-PARALLELIZED."
This report is on a subroutine call and not on the do loop?
If I put the test_loop call into a do loop, then test_loop is no longer auto-parallelized. It's a fickle option!
Importantly, why has it been stopped? I anticipated this would not be a big change to the program structure.
What are the criteria for Test_Loop to be auto-parallelized.
I'm a bit worried by this, as the call to DOTPROD returns the same value JMAX times. I'm not sure what /Qparallel is achieving.
In the case of my LU decomposition, where the J loop was changed to give a different value each loop and the values are dependent on the previous J itteration, would we still have achieved auto-parallelized? Perhaps accumulating the Y error in the JMAX loop may be more effective.

If anyone wants to repeat these tests, I have attached 3 files.

test_dot_yes.f90 which does perform parallelization
test_dot_no.f90 which does not perform parallelization
test_dot.log which records the run time of the two alternatives.

It is my aim to better understand what can be achieved with /Qparallel, before contemplating /Qopenmp.

With regard to use of real(8), could I point out that real*8 is more portable and not excluded by the 95/03 standard.

John

## Attachments:

AttachmentSize
3.33 KB
3.41 KB
1.72 KB

Marc,

Reviewing your original post, you should note the difference between CPU time (via call cpu_time) and elapsed time (via call system_clock). With parallelization, the CPU time actually increases, due to the thread initiation overhead, while the elapsed time hopefully decreases.
You also need to judge the advantage of increased processor utilisation against the increased conflict with other background processes that are running. My pc also runs multiple svhost.exe and a virus scanner, which at times it appears as if that is all my pc does.

John

Quoting John Campbell
With regard to use of real(8), could I point out that real*8 is more portable and not excluded by the 95/03 standard.
Is that a typo? The 8 might be processor specific, but specifying it via "REAL*8" is definitely not standard Fortran (but again, there are a number (and probably even the majority) of compilers that support it as an extension, but it's a bit of a stretch to call any sort of extension "more portable" than the standard syntax). From F2003:

R501: type-declaration-stmt is declaration-type-spec [ [, attr-spec ] ... :: ] entity-decl-list

R502: declaration-type-spec is intrinisic-type-spec
...

R403: intrinsic-type-spec is ...
REAL [ kind-selector ]

R404: kind-selector is ( [ KIND= ] scalar-int-initialization-expr )

So having a * after REAL (or INTEGER or LOGICAL) in the context of a type declaration is excluded by the syntax rules. If you apply the appropriate standards checking switches to ifort it will give you an almost appropriate whack around the ears for it too (the compiler calls it a length specification, which is not quite right...).

Not that this has the slightest thing to do with parallellisation...

Back on that topic - when I compile test_dot_no.f90 with "Fortran > Diagnostics > Guided Auto Parallelism Analysis" set to Extreme (/Qguide:4), it tells me (amongst some other things) that I should "Insert a "!dir\$ loop count min(16)" statement right before the loop at line 96 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 16 iterations". When I do that (and then remember to turn the GAP thing off...) I get pretty similar runtimes. I presume this means that the additional loop in the main program has suficiently obfuscated the range of loop counts that will be used for the loop in test_loop.

(Note apart from the odd !\$OMP thing in bleedingly obvious places I don't play on the parallel swings very often - and the dependency tracking/constant folding/loop unrolling available in my head compiler is well and truely exceeded here...).

IanH,

The use of REAL*8 is neither a deleted or obsolete feature in the 1990, 1995 or 2003 Fortran standard. As such it is standard Fortran.

I know of no 95 or 03 compiler that does not support this syntax and as such is more portable than REAL(8). It is a concise and clearly understood definition of precision.

Having worked with code from many pre 90 compilers, when encountering the declaration "REAL A", you had little idea what precision was required. Since the introduction of KIND, it is still much better to read REAL*8, than having to look for a KIND parameter value which is often hidden in another difficult to find or not supplied file. What would you expect the declaration"REAL(rf) A" to mean? The intrinsic SELECTED_REAL_KIND implies a flexibility of precision that is not available, with typically only 2 or 3 possible successful outcomes.

Back on the topic, is Guided Auto Parallelism Analysis available in Version 11 ?

The do loops presented in the example above are very simple. I certainly need to understand what complexity can be accommodated by /Qparallel

John

Quoting John Campbell

The use of REAL*8 is neither a deleted or obsolete feature in the 1990, 1995 or 2003 Fortran standard. As such it is standard Fortran.

That's an interesting viewpoint; apparently, the fact that REAL*8 was never in the standard, and so was never explicitly obsoleted, makes it standard, even though it was used for many years to hinder porting applications to machines other than those they were written on. I guess I'll have to agree that nearly all f95 and later compilers were tied to architectures which supported an 8 byte floating point data type.

GAP is a highly advertised new feature of the "12.0" xe 2011 compilers. The loop count directive is an idea resurrected from compiler versions prior to 11.0.

Why does this example not run parallel?

```program matrix
implicit none
real(8),allocatable:: a(:,:), b(:,:),c(:,:)
integer:: N = 200, i
real:: t0, t1

allocate( a(N,N), b(N,N), c(N,N) )
call random_seed()
call random_number(a)
call random_number(b)

call cpu_time(t0)

!dir\$ parallel allways
do i = 1, 500
c = matmul(a,b)
end do
call cpu_time(t1)
write(*,'(a, f10.3)') ' Time for matmul        = ', (t1 - t0) / 5
pause
end program matrix```

The "12.0" compilers have a specific option for parallelized matmul (by library call):
-Qopt-matmul
Did you set that option?
If you're trying to find out how efficiently the threads are used, you will want to measure elapsed time as well as the total CPU time of all threads added up.

The documentation which says that /Qopt-mamtul is enabled by default when /Qparallel is specified. Diagnostic information also reflects matmul being parallelized and I see that both cores are used 100%. Without /Qparallel, both cores' usage is about 60%.

To get the correct comparison I used the openmp timer. CPU_Time does not reflect correct values.

Most likely my test below (which is little bit different in that I use random numbers in each loop) does not have large enough matrix sizes but I don't see any significant performance improvement when using /Qopt-matmul.

Abhi

---

```      Program Test_Qmatmul

Implicit None

Real(8), Allocatable:: a(:,:), b(:,:), c(:,:)
Integer :: N = 200, M = 500
Integer :: i
Real(8) :: ts, te, OMP_GET_WTIME

Allocate( a(N,N), b(N,N), c(N,N) )
Call random_seed()

ts = OMP_GET_WTIME()
!Call CPU_Time(ts)
do i = 1, M
Call random_number(a)
Call random_number(b)
c = matmul(a,b)
end do
te = OMP_GET_WTIME()
!Call CPU_Time(te)

Write(*,'(A, F0.3)') 'Time for matmul = ', (te - ts)

End Program Test_Qmatmul```

TimP,

From the 2003 Standard document linked in this forum, "Section 1.6.3 FORTRAN 77 compatibility" implies that REAL*8 is in the standard. I have converted many programs between compilers, including between 8-byte and non 8-byte architectures. Much public domain numerical software was developed on Control Data in 70's and 80's. With a program of unknown origin, it can be very difficult to know the required precision. KIND did not solve the problem, as often the KIND module listing is not included with the code listing. Have a look through some of the code examples in this forum.

John

The Intel docs do, unfortunately, sometimes refer to f77 or f90 compatibility, meaning compatibility with extensions present in a predecessor compiler, not in Fortran standard. Once or twice I asked for corrections and got turned down.
Posts on this forum often refer to legacy code containing many extensions as "f77" perhaps because the f77 standard didn't require a way of diagnosing extensions, as f90 and later standards do.
I've been around long enough to have used f77 compilers which didn't support the REAL*8 extension, including those for Honeywell 36-bit and Harris 24-bit platforms, as well as CDC compilers which certainly didn't accept REAL*8. You might argue that when we old-timers argue about such widely accepted extensions, we're pining in vain for the return of some of the old architectures.

The *n syntax for REAL, INTEGER, LOGICAL and COMPLEX, has never ever been part of the Fortran standard. It has always been an extension. I see nothing in section 1.6.3 of F2003 that relates to this - which text are you referring to?

Retired 12/31/2016

You may be right !

Unfortunately I can't find my hard copy of the Fortran 77 standard.
On reflection, recalling the influence of CDC and IBM on that standard, it probably wasn't.
Was there never a reference to the REAL*8 syntax as an extension?

My apologies for the assumption of REAL*8 being included in the standard.

As a long time Fortran user, I will maintain *byte is the most concise and informative definition of required numerical precision for the numerical calculation being performed.
I recall the frustration when converting programs to run on Prime/Vax or PC, of not knowing the required precision for the calculations.
I have always been disappointed with the KIND structure and the allusion of dialing up a precision in SELECTED_REAL_KIND. What do you do when someone selects SELECTED_REAL_KIND (7,38) or SELECTED_REAL_KIND (3,3) which is the example in an oldLahey manual ?
My computer science background has been more based on numerical methods and providing the ability to use SELECTED_REAL_KIND (3,3) is not a significant improvement in the use of Fortran. What would you think the original programmer was hoping to achieve ?
If I see REAL*10 or REAL*16 in new code, at least I know I have a problem, while REAL*8 says the conversion is easy. The statement USE PRECISION is still an unknown. (Actually the use of REAL or REAL*4 was always the biggest worry)

John

I have a hardcopy of F77 (and F66) and can guarantee you that REAL*8 was never there.

The problem with using the byte size is that this tells you nothing about the precision or range of the datatype. You were obviously never a VAX user, where there were two different representations of REAL*8, one of them with the same range as REAL*4, or VMS on Alpha where there were THREE REAL*8 representations (and two REAL*4), each with different precision and range. This is why the SELECTED_REAL_KIND intrinsic is so useful as it frees you from such issues and lets you pick the needed precision and range and lets the compiler decide which type best fits.

SELECTED_REAL_KIND(7,38) would get you REAL*4 on VAX. SELECTED_REAL_KIND(3,3) is easily satisfied by single precision - remember that these are minimums. The compiler must pick the smallest decimal precision kind that meets the criteria for both range and precision. If there is more than one, the smallest kind value is chosen. If the implememtation doesn't have a kind that meets the requirements, you'll get -1 for a kind which will almost certainly result in a compile-time error.

Consider also traditional Cray systems, where single precision is 60 (!) bits. There are also 16-bit real types used in some graphics processors.

Of course there are times when you do need to know the byte size, such as C interoperability, which is why there are constants for C_FLOAT, etc.

Retired 12/31/2016

Steve,
Thanks for your comments. I do now recall the longer real*8 precision. Was there a F90 VAX compiler that allowed the different real*4 and real*8 formats to be used in the same program? Our Vax went in the late 80's.
While the Lahey documented precision of REAL*4 is about 7-8 digits and 10^38 exponent, SELECTED_REAL_KIND(7,38) will get you REAL*8 on ifort and most win-32 compilers, which illistrates my problem: What did the original programmer require for his calculation and what did he get, ie what accuracy did the calculation require.
The importance of byte size has not gone away, especially when using direct access files, where it is still possible to legally mix integer, real and character, not to mention the illegal mixing via subroutine arguments that some of us still find convenient.
Never wrote code for a Cray, however CDC, IBMand ICL were a challenge.I think ICL had48 bit reals. 48 bits gave a new level of uncertainty to REAL for code from the UK.
REAL*8 is a lot more certain than these past offerings. Now, listing code without the defining module takes me back to those annoying times.

KIND isn't likely to be changed. I do use it for portability when selecting higher precision than real*8

John

VAX single precision had the same precision (but slightly different range)as the later IEEE standard single precision, so I don't think selected_real_kind(7,38) should have produced it. We never saw f90 on the VAX. Our last VAX compiler (I didn't workin the computer industry then)did adopt the f77 subset ofDATE_AND_TIME (which we then put into g77) to facilitate Y2K work.
I've never heard the CDC 60-bit format called "traditional Cray," although Seymour did have a hand in that.The Cray branded systems started out with a 64-bit format (but not with IEEE precision).

VAX never had a F90 compiler but Alpha did. DEC Fortran for Alpha VMS systems allowed you to select among the two VAX double formats or the IEEE double format, though you were permitted to use only one at a time. This was an arbitrary restriction made for simplicity and not based on the hardware's requirements.

F-float (VAX single precision) had a range from .29E-38 to 1.7E38 and 23 fraction bits, which qualifies for (7,38). IEEE single precision has effectively one less power of 2 of range due to the encodings for denormals, etc. and the range is more asymmetric than for VAX F-float (1.2E-38 to 3.4E38).

I don't think Lahey ever supported VAX floating, so their example is a bit strange.

There is also accomodation in Fortran 2008 for decimal floating point with the addition of an optional RADIX argument, so you might have more than one kind of a given byte size.

Retired 12/31/2016

This Code by using "/Qopenmp" runs parallel:

```program matrix
!use omp_lib
Implicit None

Real(8), Allocatable:: a(:,:), b(:,:), c(:,:)
Integer :: N = 200, M = 500
Integer :: i
Real(8) :: ts, te, OMP_GET_WTIME
Allocate( a(N,N), b(N,N), c(N,N) )
Call random_seed()

ts = OMP_GET_WTIME()
!Call CPU_Time(ts)
!\$OMP PARALLEL DO
do i = 1, M
Call random_number(a)
Call random_number(b)
c = matmul(a,b)
end do
te = OMP_GET_WTIME()
!Call CPU_Time(te)

Write(*,'(A, F0.3)') 'Time for matmul = ', (te - ts)
pause
end program matrix```

But /Qparallel has no effects! I thought /Qparallel would parallelize the Code without using any explicit directives?!

OpenMP allows you to force parallelism with minimal checking for validity. As you didn't set up private copies of the arrays, it doesn't look like your OpenMP could give well defined results. Besides, if there were a parallel random number library, it could not give results consistent with serial execution, and /Qopenmp would not perform an automatic switch to parallel random_number. It looks like the only opportunity for auto-parallel is via the opt-matmul transformation.
If it were not for a requirement to trace a history in random_number, the compiler might start out by the shortcut of replacing doi=1,M by doi= M,M when you don't set /Qopenmp. I don't think the compiler would perform the analysis to decide that your call to random_seed would eliminate a requirement to follow a full random_number sequence.

OK, but why does the analyzer not auto-parallelize the matrix multiplication? That is the point I am surprised about.

The code in the original post would auto-parallelize if the loop count directive said 8 instead of 4.

When I build the code with matmul and the version 12 compiler, it tells me it auto-parallelized the matmul:

c:\Projects\t.f90(16): (col. 13) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(17): (col. 13) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(18): (col. 12) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(18): (col. 12) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.

If I add /Qguide I see this:

c:\Projects\t.f90(16): remark #30525: (PAR) Insert a "!dir\$ loop count min(64)"
statement right before the loop at line 16 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.
c:\Projects\t.f90(17): remark #30525: (PAR) Insert a "!dir\$ loop count min(64)"
statement right before the loop at line 17 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.
c:\Projects\t.f90(18): remark #30525: (PAR) Insert a "!dir\$ loop count min(64)"
statement right before the loop at line 18 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.

16 and 17 are the calls to RANDOM_NUMBER, so I would ignore that. I tried adding a directive before the MATMUL and it didn't change the output, though as I note the compiler thinks it did parallelize it.

Retired 12/31/2016

Steve, many thanks to you! For now I will pause testing. Later I will resume to parallelize my algorithms.