Problem with LNBLNK and len_trim

Problem with LNBLNK and len_trim

I have a small subroutine transferring text from keyboard entries, etc from lower to uppercase letters. I used this routine for CVF and the version shown below I also apply for IVF Win32 applications. What happens is that the program crashes with the error message:Unhandled exception at 0x7790DF7C (ntdll.dll) in CaFeMS1-IVF.exe: 0xC0000005: Access violation writing location 0x002A0FFC... before the line ILENGTH = len_trim(STRING) - 1 where a breakpoint is set.

The subroutine is linked from  a library file.

Can anyone give me a tip?

Program listing follows here:

! IF(I.GT.IEND)GOTO 2000 ! 25/5-2000

! ICOM is character*2

READ(IDEDIT,202)ICOM

CALL TRANS_TO_UPPERCASE(ICOM,2)

SUBROUTINE TRANS_TO_UPPERCASE(STRING,LENGTH)

use IFPORT

INTEGER*4 KAR,ILENGTH,LENGTH

CHARACTER*(LENGTH)STRING

c CHARACTER*(40)STRING

! ILENGTH = LNBLNK(STRING)

ILENGTH = len_trim(STRING) - 1

C....Transfer STRING to uppercase letters....................

DO 1000 I=1,ILENGTH

KAR=0

KAR=ICHAR(STRING(I:I))

IF(KAR.GE.97)STRING(I:I)=CHAR(KAR-32)

1000 CONTINUE

RETURN

END

 

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
character(len=*) :: string 

Strings length has been defined in the caller. You can get in in the sub with len(string). What is the length passed in mean to be the decalred length or a sub string length?

I would use ascii >=65 to 90<= BTW

As you probably know by now, fragments of code with some bug are nearly impossible to diagnose in a reasonable amount of time, and the number of ways in which the code could have failed can be more than you want to read about. You did not show the code where the subroutine is called, so we do not know whether the actual arguments match the dummy arguments.

The subroutine that you showed has the following potential bugs:

1. The second argument is not used except to declare the desired length of the first argument. What happens if the caller passes a value in LENGTH that is not equal to the length of STRING? What happens when STRING contains a single non-blank character followed by blanks?

2. Consider what happens with several edge cases:

   (i) The trimmed string length is 1

   (ii) One or more characters in the part of STRING that are to be uppercased are in the set ['{','}','|','~']

3. What is IFPORT needed for? You commented out the line with a reference to LNBLNK.

Optimal methods of changing character case have been around for a long time.  Try this:

SUBROUTINE updncase (string, updn)
	IMPLICIT NONE
	CHARACTER(LEN=*), INTENT(INOUT)	:: string
	INTEGER, INTENT(IN)				:: updn
	INTEGER							:: j, nc
	CHARACTER(LEN=26), PARAMETER	:: lower = 'abcdefghijklmnopqrstuvwxyz'
	CHARACTER(LEN=26), PARAMETER	:: upper = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
	
	SELECT CASE (updn)
	CASE (1)		
		DO j = 1, LEN(string)
			nc = INDEX(lower, string(j:j))
			IF (nc > 0) string(j:j) = upper(nc:nc)
		END DO
	CASE (2)		
		DO j = 1, LEN(string)
			nc = INDEX(upper, string(j:j))
			IF (nc > 0) string(j:j) = lower(nc:nc)
		END DO
	END SELECT	
END SUBROUTINE updncase	 

 

Thank you all for comments. It turned out that the root cause was due to something that happened during migration from CVF to IVF. 

Paul: If I understand your code, the routine updncase change the case either from upper lo lower or vice versa.. My routine takes a mixed string and transfer all characters to uppercase.

Quote:

reidar wrote:
Paul: If I understand your code, the routine updncase change the case either from upper lo lower or vice versa. My routine takes a mixed string and transfer all characters to uppercase.

In that case, you need the following half (case updn = 1) of Paul's code. On X86 and X64, the intrinsic INDEX can use the fast string search (SCASB) instruction with a repetition count of 26. If you are going to pass strings with trailing blanks to this subroutine, you may wish to use LEN_TRIM instead of LEN, or, if your strings will have no embedded blanks, exit the DO loop when a blank character is encountered.

SUBROUTINE upcase (string)
   IMPLICIT NONE
   CHARACTER(LEN=*), INTENT(INOUT)	:: string
   INTEGER				:: j, nc
   CHARACTER(LEN=26), PARAMETER	:: lower = 'abcdefghijklmnopqrstuvwxyz'
   CHARACTER(LEN=26), PARAMETER	:: upper = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
	
   DO j = 1, LEN(string)
      nc = INDEX(lower, string(j:j))
      IF (nc > 0) string(j:j) = upper(nc:nc)
   END DO
END SUBROUTINE upcase	 

 

Quote:

mecej4 wrote:
.....the intrinsic INDEX can use the fast string search (SCASB) instruction with a repetition count of 26. 

Very Interesting. I have long realised the INDEX is a wonderful (and quick) tool (which I use frequently) but was unaware of  SCASB which I have just read up on.  In this instance however I would have thought checking a character against 26 characters 'quickly' within a loop would not be significantly different in time to checking if an ascii integer was between upper and lower limit values within a loop. I am not going to test this hypothesis as even if there was a 'big' difference (x5) it would not make a worthwhile benefit as I am not doing gazillions up upp/low casing. Anyhow thank you for the education!

Andrew, if you found the SCAS instruction interesting, you may find the much less used XLAT instruction ( http://www.felixcloutier.com/x86/XLAT:XLATB.html )equally interesting.

One could set up one 256-byte array, say UPR,  set to Z'00',...,Z'FF', except that UPR(97:122) should be set to Z'45',...,Z'5A'. This sets up the "translation table" which, combined with string LODS, XLATB and STOS in a tight loop would be very fast.

On the other hand, the C library the IFort uses already contains the standard functions strupr(), etc., which could be used as a more portable, safer, alternative.

I was trying to find an expression that determined whether a character was a letter or not that wasn't too expensive to evaluate...

module M
   implicit none
   contains
      subroutine upcase(string)
         character(*), intent(INOUT) :: string
         call sub(string,len(string))
         contains
            subroutine sub(string,len)
               integer, intent(in) :: len
               character, intent(INOUT) :: string(len)
               string = merge(achar(ibclr(iachar(string),5)),string, &
                  ibits(dim(ibclr(iachar(string),5),64)-1,0,8)<26)
            end subroutine sub
      end subroutine upcase
end module M

program P
   use M
   implicit none
   integer i
   character(40) string
   string = 'yOU WiLl `FInD'' iT @ThE [13{ oaK tRee.'
   write(*,'(a)') trim(string)
   call upcase(string)
   write(*,'(a)') trim(string)
end program P

Output with ifort:

yOU WiLl `FInD' iT @ThE [13{ oaK tRee.
YOU WILL `FIND' IT @THE [13{ OAK TREE.

 

RO, you're slipping! I would have expected at least five ICEs here....

Steve (aka "Doctor Fortran") - Retired from Intel

The implicit point of updncase() is that a precomputed or preset lookup table (LUT) is usually (always?) much more efficient than on-the-spot computation, regardless of how fascinatingly compactly (and incomprehensibly) the computation can be written [merge(achar(ibclr(iachar(   ... really?].

Many years ago I wrote an image-processing code for a fluorescence system which computed the ratio image of two video frames by using the numerator and denominator pixel values as indices into a LUT whose values were the ratio for each combination of n/d; was very fast, worked at frame rates, where an attempt to actually compute the ratio image would have killed the project. 

LUTs touch memory or blow out cache. Computation is fast on 21st century CPUs. Three filse are needed for this test: UpCase.asm

; UpCase.asm
; fasm UpCase.asm
format ms64 coff
section '.code' readable executable
public UpCase
UpCase:
   push rsi
   push rdi
   sub rsp, 40
   test rdx, rdx
   js epilog
   mov r8, rcx
   mov rsi, rcx
   neg ecx
   and ecx, 31
   vbroadcastss ymm1, [bias]
   vbroadcastss ymm2, [lower]
   vbroadcastss ymm3, [bit]
   jz main_part
   mov rdi, rsp
   sub rdx, rcx
   cmovl ecx, edx
   mov eax, ecx
   REP movsb
   vmovdqu ymm0, [rsp]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqu [rsp], ymm0
   mov rsi, rsp
   mov rdi, r8
   mov ecx, eax
   REP movsb
   mov rsi, rdi
   jle epilog
main_part:
   cmp rdx, 31
   jle tail
   vmovdqa ymm0, [rsi]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqa [rsi], ymm0
   add rsi, 32
   sub rdx, 32
   jmp main_part
tail:
   test rdx, rdx
   jz epilog
   mov r8, rsi
   mov rdi, rsp
   mov ecx, edx
   REP movsb
   vmovdqu ymm0, [rsp]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqu [rsp], ymm0
   mov rsi, rsp
   mov rdi, r8
   mov ecx, edx
   REP movsb
epilog:
   add rsp, 40
   pop rdi
   pop rsi
ret
public RDTSC1
RDTSC1:
   RDTSC
   shl rdx, 32
   or rax, rdx
   ret
section '.data' readable writeable
temp rb 32
bias dd 05050505h
lower dd 65656565h
bit dd 20202020h

And test.f90:

! test.f90
! ifort test.f90 UpCase.obj
module M
   use ISO_C_BINDING
   implicit none
   private
   public UpCase, RDTSC, updncase
   interface
      subroutine UpCase(string, length) bind(C,name='UpCase')
         import
         implicit none
         character(KIND=C_CHAR), intent(INOUT) :: string(*)
         integer(C_SIZE_T), value :: length
      end subroutine UpCase
   end interface
   interface
      function RDTSC() bind(C,name='RDTSC1')
         import
         implicit none
         integer(C_INT64_T) RDTSC
      end function RDTSC
   end interface
   contains
      SUBROUTINE updncase (string, updn)
         IMPLICIT NONE
         CHARACTER(LEN=*), INTENT(INOUT) :: string
         INTEGER, INTENT(IN)    :: updn
         INTEGER       :: j, nc
         CHARACTER(LEN=26), PARAMETER :: lower = 'abcdefghijklmnopqrstuvwxyz'
         CHARACTER(LEN=26), PARAMETER :: upper = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
 
         SELECT CASE (updn)
         CASE (1)  
            DO j = 1, LEN(string)
               nc = INDEX(lower, string(j:j))
               IF (nc > 0) string(j:j) = upper(nc:nc)
            END DO
         CASE (2)  
            DO j = 1, LEN(string)
               nc = INDEX(upper, string(j:j))
               IF (nc > 0) string(j:j) = lower(nc:nc)
            END DO
         END SELECT 
      END SUBROUTINE updncase  
end module M

program P
   use M
   use ISO_C_BINDING
   implicit none
   character(LEN=:,KIND=C_CHAR), allocatable :: Bible
   integer(C_INT64_T) t0, t1
   integer size
! http://ebible.org/kjv/kjvtxt.zip
   open(10,file='kjv.txt',status='old',access='stream')
   inquire(unit=10,size=size)
   write(*,'(*(g0))') 'File size = ',size
   allocate(character(LEN=size,KIND=C_CHAR) :: Bible)
   read(10) Bible
   close(10)
   t0 = RDTSC()
   call UpCase(Bible,len(Bible,C_SIZE_T))
   t1 = RDTSC()
   open(10,file='UpCase.txt',status='replace',access='stream')
   write(10) Bible
   close(10)
   write(*,'(*(g0))') 'Clock cycles for UpCase = ', t1-t0

   open(10,file='kjv.txt',status='old',access='stream')
   read(10) Bible
   close(10)
   t0 = RDTSC()
   call updncase(Bible,1)
   t1 = RDTSC()
   open(10,file='updncase.txt',status='replace',access='stream')
   write(10) Bible
   close(10)
   write(*,'(*(g0))') 'Clock cycles for updncase = ', t1-t0
end program P

Also it requires kjv.txt. To get it, go to http://ebible.org/kjv/kjvtxt.zip unzip it and run combine.bat. I didn't post it here because it's 4 MB. Then you can assemble UpCase.asm with FASM and compile test.f90 with ifort. My output:

File size = 4523438
Clock cycles for UpCase = 897107
Clock cycles for updncase = 375973272

The first time is for UpCase  and the second for updncase. Windiff.exe found the output files UpCase.txt and updncase.txt to be identical. subroutine UpCase was about 400X faster than subroutine updncase because it could process 32 bytes in the 11 instructions of its main_part: loop. That's the kind of thing I was trying to achieve with my earlier Fortran code, but there's a bit of a problem that Fortran doesn't define overflows as nicely as it could. Still I could have done better in my original Fortran code. Couldn't get it to ICE even when I incorporated it into an initialization expression -- must be a compiler bug :)

EDIT: The program had the file size hardwired incorrectly, so I fixed that by using INQUIRE to get the correct size.

First of all, THANKS to RO for his interesting posts.

Out of curiosity, I added RO's Fortran Upcase from #9 to his test program of #12, and removed the UPDN=2 code from UPDNCASE. On a laptop with an I5-4200U CPU, I get

Clock cycles for UpCase    =      1104393 (  1.1E6)
Clock cycles for ROUpcase  =     22040992 ( 22E6)
Clock cycles for PCUpcase  =    328797223 (329E6)

Some comments:

  • Repeating the runs gave slightly different clock counts. Not more than the first three digits are repeatable.
  • The run (time for reading the bible three times, converting and writing the uppercase version three times) took 0.2 s. Of this time, 0.05 s was spent on file I/O.
  • The timing will vary with the CPU generation and model
  • I opened kjv.txt in a text editor, Select+All, Uppercase Selection; this took a fraction of a second. Undo (to revert) took about a second.
  • RO's upcase.asm will require a CPU with AVX instructions (Sandy Bridge and newer).
  • Based on marketing literature, I tended to think of AVX as more for floating point and image processing. RO has taught us that AVX instructions can be used for text processing.
  • RO can probably backport upcase.asm so that it will work on CPUs with SSE2, and thereby improve his standing as a repeat offender.

 

I really enjoy following this type of thread but back in the real world the conclusion is that all the methods are 'fast' for most practical purposes.

It gets more fun when you need to support character sets that don't have a 1:1 mapping between lower and upper case!

Steve (aka "Doctor Fortran") - Retired from Intel

Quote:

andrew_4619 wrote:
I really enjoy following this type of thread but back in the real world the conclusion is that all the methods are 'fast' for most practical purposes.

Agreed, but it is fun to escape from the real world for a break now and then.

Also, when someone shows us things that we had never imagined were possible, sometimes they can become very useful.

See this Gfortran bug report for something of this sort: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78549 . One of the other posters here and on C.L.F. had come across the problem earlier, but his complaints were being sidelined. Zeroing in on the problem (a tenfold slow-down in formatting real numbers) helped getting the it to be recognized and fixed.

Quote:

Steve Lionel (Ret.) wrote:
 It gets more fun when you need to support character sets that don't have a 1:1 mapping between lower and upper case!

And no fun at all when the character set has no concept of case, such as Indian languages ?

I tried a couple of translations of my assembly code into Fortran:

      subroutine trial1(string,length) bind(C)
         use ISO_C_BINDING
         implicit none
         character(KIND=C_CHAR), intent(INOUT), target :: string(*)
         integer(C_SIZE_T), value :: length
         type(C_PTR) address
         integer(C_INT8_T), pointer :: data(:)
         integer(C_INT8_T), parameter :: &
            bias = int(Z'05',C_INT8_T), &
            lower = int(Z'65',C_INT8_T), &
            bit = int(Z'20',C_INT8_T)
         address = C_LOC(string)
         call C_F_POINTER(address,data,[length])
         where(data+bias > lower) data = iand(not(bit),data)
      end subroutine trial1
      subroutine trial2(string,length) bind(C)
         use ISO_C_BINDING
         implicit none
         character(KIND=C_CHAR), intent(INOUT), target :: string(*)
         integer(C_SIZE_T), value :: length
         type(C_PTR) address
         integer(C_INT8_T), pointer :: data(:)
         integer(C_INT8_T), parameter :: &
            bias = int(Z'05',C_INT8_T), &
            lower = int(Z'65',C_INT8_T), &
            bit = int(Z'20',C_INT8_T), &
            truemask = int(Z'FF',C_INT8_T), &
            falsemask = int(Z'00',C_INT8_T)
         address = C_LOC(string)
         call C_F_POINTER(address,data,[length])
         data = iand(not(iand(merge(truemask,falsemask, &
            data+bias > lower),bit)),data)
      end subroutine trial2

It seemed that /arch:core-avx2 helped the WHERE version out a lot, but not the more literal translation:

D:\>ifort /nologo /arch:host test2.f90 UpCase.obj

D:\>test2
File size = 4523438
Clock cycles for UpCase = 875272
Clock cycles for updncase = 393948858
Clock cycles for trial1 = 35775283
Clock cycles for trial2 = 12334499

D:\>ifort /nologo /arch:core-avx2 test2.f90 UpCase.obj

D:\>test2
File size = 4523438
Clock cycles for UpCase = 847216
Clock cycles for updncase = 379932665
Clock cycles for trial1 = 9246359
Clock cycles for trial2 = 9139662

Windiff considered UpCase.txt, updncase.txt, trial1.txt, and trial2.txt to be identical.

 

Quote:

Repeat Offender wrote:

.. Couldn't get it to ICE even when I incorporated it into an initialization expression -- must be a compiler bug :) ..

Mega kudos to Intel Fortran team on the lack of ICEs thus far!

Is the posted assembler derived from the non-standard C function strupr?  I can't use FASM at the moment and the posted assembler code won't work with Microsoft ML64.  Perhaps someone will post the tick counts with the strupr C function, here's the code snippet toward it!  

module m

   use, intrinsic :: iso_c_binding, only : c_char, c_ptr, c_loc, c_size_t

   implicit none

   private

   interface

      function strupr( sptr ) bind(C, name="strupr") result(ups)
      ! non-standard C function; #include <string.h>
      ! char *strupr(char *string);
         import :: c_ptr
         implicit none
         ! Argument list
         type(c_ptr), intent(in), value :: sptr
         ! function result
         type(c_ptr) :: ups
      end function

   end interface

   public :: UpperCase

contains

   subroutine UpperCase( string )

      ! Argument list
      character(kind=c_char, len=*), intent(inout), target :: string

      ! Local variables
      type(c_ptr) :: add_s

      add_s = strupr( c_loc(string) )

      return

   end subroutine UpperCase

end module m

program p

   use iso_c_binding, only : c_char, c_size_t
   use iso_fortran_env, only : output_unit
   use m, only : UpperCase

   implicit none

   ! Local variables
   character(len=*), parameter :: fmtg = "(*(g0))"
   character(kind=c_char,len=:), allocatable :: Bible
   integer(c_size_t) :: fsize
   integer :: lun

   ! http://ebible.org/kjv/kjvtxt.zip
   open( newunit=lun, file='kjv.txt', status='old', access='stream' )

   inquire( unit=lun, size=fsize )
   write( output_unit, fmt=fmtg ) 'File size = ', fsize
   allocate (character(len=fsize, kind=c_char) :: Bible )

   read(unit=lun) Bible
   close(unit=lun)

   call UpperCase( Bible )

   write( output_unit, fmt=fmtg ) Bible(:42)

   stop

end program p

Upon execution I get:

C:\Temp>ifort /standard-semantics /warn:all p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R
) 64, Version 18.0.1.156 Build 20171018
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.12.25835.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\Temp>p.exe
File size = 4523438

THE FIRST BOOK OF MOSES, CALLED GENESIS

C:\Temp>

 

I tried strupr and _strupr_s

   interface
      function strupr_s(str, numberOfElements) bind(C,name='_strupr_s')
         use ISO_C_BINDING
         implicit none
         integer(C_INT) strupr_s
         character(KIND=C_CHAR), intent(INOUT) :: str(*)
         integer(C_SIZE_T), value :: numberOfElements
      end function strupr_s
   end interface
   interface
      function strupr(str) bind(C,name='strupr')
         use ISO_C_BINDING
         implicit none
         integer(C_INT) strupr
         character(KIND=C_CHAR), intent(INOUT) :: str(*)
      end function strupr
   end interface

Results with \arch:core-avx2 were

File size = 4523438
Clock cycles for UpCase = 882187
Clock cycles for updncase = 392973204
Clock cycles for trial1 = 9601161
Clock cycles for trial2 = 9834643
Clock cycles for strupr = 35929257

No output for _strupr_s, which always crashed. BTW, you can probably persuade ml64 to assemble at least RDTSC1; if not I posted a pure Fortran version of RDTSC somewhere.

Quote:

Repeat Offender wrote:

.. Results with \arch:core-avx2 were

File size = 4523438
Clock cycles for UpCase = 882187

Holy XXX!  Remarkable indeed, hats off!

Presumably C, C++ based implementations all do similar pointer-based traversals and end up with similar tick counts, here's a result with C++ STL and Microsoft toolset:

#include <iostream>
#include <fstream>
#include <sstream>
using namespace std;

extern "C" int64_t RDTSC1();

int main( )
{
   std::ifstream ifs ("kjv.txt", std::ifstream::in);

   stringstream strStream;
   strStream << ifs.rdbuf();
   string Bible = strStream.str();

   ifs.close();

   int64_t t0 = RDTSC1();
   transform(Bible.begin(), Bible.end(), Bible.begin(), ::toupper);
   int64_t t1 = RDTSC1();

   std::cout << "Clock cycles for C++ STL Transform: " << (t1-t0) << "\n";
   std::cout << Bible.substr(0,41) << "\n";

   return 0;
}
C:\Temp>cl /c /EHsc ucase.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

ucase.cpp

C:\Temp>link ucase.obj UpCase.obj /subsystem:console /out:p.exe
Microsoft (R) Incremental Linker Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.


C:\Temp>p.exe
Clock cycles for C++ STL Transform: 85559157

THE FIRST BOOK OF MOSES, CALLED GENESIS


C:\Temp>

 

Mmmm... I restructured my UpCase subroutine and found it had a bug that could bite at small lengths, so I post the restructured subroutine which I have tested for different cases of alignments and lengths.

; UpCase2.asm
; fasm UpCase2.asm
format ms64 coff
section '.code' readable executable
public UpCase
UpCase:
   mov r9, rsi
   mov r10, rdi
   test rdx, rdx
   jle epilog
   mov r8, rcx
   mov rsi, rcx
   neg ecx
   and ecx, 31
   mov eax, 00056520h
   vmovd xmm1, eax
   vpbroadcastb ymm3, xmm1
   vpsrld ymm1, ymm1, 8
   vpbroadcastb ymm2, xmm1
   vpsrld ymm1, ymm1, 8
   vpbroadcastb ymm1,xmm1
   jz main_part
   lea rdi, [rsp+8]
   mov eax, edx
   sub rdx, rcx
   cmovl ecx, eax
   mov eax, ecx
   REP movsb
   vmovdqu ymm0, [rsp+8]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqu [rsp+8], ymm0
   lea rsi, [rsp+8]
   mov rdi, r8
   mov ecx, eax
   REP movsb
   mov rsi, rdi
   jle epilog
main_part:
   mov ecx, edx
   and rdx, -32
   jz tail
   add rsi, rdx
   neg rdx
main_loop:
   vmovdqa ymm0, [rsi+rdx]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqa [rsi+rdx], ymm0
   add rdx, 32
   js main_loop
tail:
   and ecx, 31
   jz epilog
   mov r8, rsi
   mov eax, ecx
   lea rdi, [rsp+8]
   REP movsb
   vmovdqu ymm0, [rsp+8]
   vpaddb ymm4, ymm0, ymm1
   vpcmpgtb ymm4, ymm4, ymm2
   vpand ymm4, ymm4, ymm3
   vpandn ymm0, ymm4, ymm0
   vmovdqu [rsp+8], ymm0
   lea rsi, [rsp+8]
   mov rdi, r8
   mov ecx, eax
   REP movsb
epilog:
   mov rsi, r9
   mov rdi, r10
ret
public RDTSC1
RDTSC1:
   RDTSC
   shl rdx, 32
   or rax, rdx
   ret

It had similar performance characteristics to the original.

 

Code that just used XLAT to perform the conversion was faster than the libc stuff:

XLAT1:
   mov r8, rbx
   mov rbx, table
   test rdx, rdx
   jle xlat_done
   mov r9, rcx
   mov r10, rdx
   neg ecx
   and ecx, 7
   jz main_xlat
   sub rdx, rcx
   cmovl ecx, r10d
   add r9, rcx
   neg rcx
xlat_prolog:
   mov al, [r9+rcx]
   xlatb
   mov [r9+rcx], al
   add rcx, 1
   js xlat_prolog
main_xlat:
   mov rcx, rdx
   and rdx, -8
   jl xlat_done
   je xlat_epilog
   add r9, rdx
   neg rdx
xlat_loop:
   mov rax, [r9+rdx]
   repeat 8
      xlatb
      ror rax, 8
   end repeat
   mov [r9+rdx], rax
   add rdx, 8
   js xlat_loop
xlat_epilog:
   and ecx, 7
   jz xlat_done
epilog_loop:
   mov al, [r9+rcx-1]
      xlatb
   mov [r9+rcx-1], al
   sub rcx, 1
   jnz epilog_loop
xlat_done:
   mov rbx, r8
   ret

Output:

File size = 4523438
Clock cycles for UpCase = 922020
Clock cycles for updncase = 395661851
Clock cycles for XLAT = 11453485

 

Quote:

Repeat Offender wrote:
I tried strupr and _strupr_s

... No output for _strupr_s, which always crashed.

If you had linked against the static version of the UCRT library, you would probably have seen an assertion-failure pop-up complaining about the buffer not containing a null-terminated string. Something like the following would work:

   char bible[BUFSIZE+1];
   ...
   nin=fread(bible,1,BUFSIZE,fili);
   bible[nin]='\0';                // everyone must null-terminate strings
   err = _strupr_s(txt,nin+1);

But then, a lot of CPU cycles would be wasted in _strupr_s() making sure that the null is present, so why use _strupr_s?

Quote:

Repeat Offender wrote:

Code that just used XLAT to perform the conversion was faster than the libc stuff: ..

Fortran users may prefer to use code they can relate to like the one below that is based on Fortran intrinsics with the ASCII character set and get similar (or even faster) conversion than that based on the non-standard strupr function in C while being functional for their needs:

   subroutine UpperCaseAscii( string )

      ! Argument list
      character(len=*), intent(inout) :: string

      ! Local variables
      integer, parameter :: CASE_OFFSET = iachar('A') - iachar('a')
      integer :: I
      integer :: K

      do I = 1, len_trim(string)
         K = iachar( string(I:I) )
         if ( K >= iachar('a') .and. K <= iachar('z') ) then
            string(I:I) = achar(K + CASE_OFFSET)
         end if
      end do

      return

   end subroutine UpperCaseAscii

The following is a reasonable average of results I see on a couple of different machines:

C:\Temp>p.exe
File size = 4523438
Clock cycles for RO UpCase AVX ASM code = 867051
Clock cycles with C strupr function = 36714453
Clock cycles for UpperCaseAscii (Fortran IACHAR) = 34864104
Clock cycles for PC updncase (CHAR LUT) = 273949856

Ratio of clock cycles: strupr/UpperCaseAscii = 1.05

C:\Temp>

 

Of, so numberOfElements has to be the size of the whole buffer, counting the terminating NUL. Got that to work.

File size = 4523438
Clock cycles for updncase = 376145820
Clock cycles for ROupcase = 21672422
Clock cycles for UpCase = 805513
Clock cycles for trial1 = 9281737
Clock cycles for trial2 = 9176733
Clock cycles for strupr = 36265847
Clock cycles for strupr_s = 42524356
Clock cycles for XLAT = 11115903

(with /arch:core-avx2)

 

Quote:

Repeat Offender wrote:
Oh, so numberOfElements has to be the size of the whole buffer, counting the terminating NUL

Rather, the portion of the buffer that you pass to _strupr_s must contain a NUL somewhere. The conversion stops at the first NUL.

#include <string.h>
#include <stdio.h>

int main( void ){
   char string[25];
   errno_t err;
   strcpy(string,"AbraCobrA"); strcpy(string+10,"CobraAbra");
   err = _strupr_s( string, 20);
   printf( "Upper: %s %s\n", string, string+10 );
}

Output:

Upper: ABRACOBRA CobraAbra

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today