News:

Building a 3D Ray Tracer  By stevmjon

Main Menu

Simple code profiling and high resolution timers

Started by AdeN, September 24, 2007, 06:01:49 PM

Previous topic - Next topic

AdeN

This is a bit of a long post - but hopefully worthwhile - I was a bit unsure whether I should have put it in the Source Code message board but decided as it is a bit lengthy and regarding it's content that it was more of a tutorial. What I have done is to split it up into the following sections:

1. Background

2. What is a high resolution timer?

3. What resolution timer does your PC provide?

4. What is the overhead of using the high resolution timer?

5. A very simple example: What is better "j = j + 1" or "inc j"?

6. End bracket


1. Background

One thing I always do when learning a new language is to get an idea of what sort of optimisations I can apply at a programming level that will help the compiler create faster code.

Now I don't know what goes on under the hood with PlayBasic other than the compiler converts the source code into a byte code representation.

Kevin: Out of interest are there any optimizations applied when creating the byte code? (I'm thinking of the usual sort of things, like removing loop invariants from the bodies of loops, removing redundant variable assignments, replacing psub/function calls with inline code for small psubs/functions etc).

Is there much of a difference between the byte code being generated when running from the IDE (I'm using IDE 1.1.5f and compiler 1.63h) to producing a fully deployed .EXE?

Sorry to be so nosey - but I'm just curious and trying to get a handle (pun intended ;-) on things.


(Note:  - Yes, the compiler has many redundancy operations in it.  When we build and run from the IDE though, all opcode have debug hooks in them.  When we build an EXE in release mode, all that fat is removed.  


2. What is a high resolution timer?

Anyway, onto the possibly interesting bit and something which may be of use to PlayBasic programmers trying to get a few more frames per second out of their games.

Most of you will have come across the Timer() command which returns the elapsed time in milliseconds - but how about a timer that is accurate to less than one millionth of a second?

Well it's easy, and it's already provided by Windows, although it uses 64 bit ("long long") values so we have to jump throgh a few hoops to use the QueryPerformanceFrequency and QueryPerformanceCounter APIs provided in kernel32.dll.

Here is some code that wraps the high resolution timer (some of it isn't elegant so apologies for that - but I've left the comments in so you can double-check my intentions - hopefully I haven't made any mistakes!), please feel free to skip over this to the examples below:

PlayBASIC Code: [Select]
Global Kernel32DLLid = 0
Global memTimerBankId = 0
Global memTimeBankPtr = 0
Global memTimerFreqLo = 0
Global memTimerFreqHi = 0

PSub HighResTimerInit(maxTimers)
; load the standard Windows Kernel32.DLL
Kernel32DLLid = GetFreeDll()
LoadDll "Kernel32", Kernel32DLLid
DLLCallConv Kernel32DLLid, 0

; each timer requires two 64 bit values (2 x 8 bytes)
; one for the start time and one for the current time
memTimerBankId = GetFreeBank()
createbank memTimerBankId, (16 * maxTimers)
memTimeBankPtr = GetBankPTR(memTimerBankId)
result = CallDll(Kernel32DLLid, "QueryPerformanceFrequency", memTimeBankPtr)
if result <> 0
memTimerFreqLo = PeekInt(memTimeBankPtr)
memTimerFreqHi = PeekInt(memTimeBankPtr + 4)
; if the frequency is too large the integer goes negative
; we don't support such large values - which should not be a problem as
; the max positive signed integer value is 2,147,483,647 ie 2000 Mhz!
if (memTimerFreqLo < 0) then result = 0
if (memTimerFreqHi > 0) then result = 0
endif
EndPsub result

PSub HighResTimerStart(whichTimer)
ptr = memTimeBankPtr + ((whichTimer - 1) * 16)
CallDll(Kernel32DLLid, "QueryPerformanceCounter", ptr)
EndPSub

PSub HighResTimerElapsed(whichTimer)
ptr = memTimeBankPtr + 8 + ((whichTimer - 1) * 16)
CallDll(Kernel32DLLid, "QueryPerformanceCounter", ptr)
loStart = PeekInt(ptr - 8 )
hiStart = PeekInt(ptr - 4 )
loNow = PeekInt(ptr)
hiNow = PeekInt(ptr + 4)

if hiNow - hiStart > 0
; timer has gone over the lower dword boundary
if loStart < 0
; timer started when lower dword had top bit set
; loNow cannot be negative because that would indicate that the hi dword
; is still the same and so we don't get into this branch unless loNow >= 0
; now lower dword does not have top bit set
elapsed = (2147483647 - loStart) + loNow
else
; timer started when lower dword did not have top bit set
if loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 + loNow) + (2147483647 + loStart)
else
; now lower dword does not have top bit set
elapsed = loNow - loStart
endif
endif
else
; hi dwords are the same - so only need to consider the low dwords
if loStart < 0
; timer started when lower dword had top bit set
; loNow cannot be positive because that would indicate that the hi dword
; incremented and so we don't get into this branch unless loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 + loNow) - (2147483647 + loStart)
else
; timer started when lower dword did not have top bit set
if loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 - loStart) + (2147483647 + loNow)
else
; now lower dword does not have top bit set
elapsed = loNow - loStart
endif
endif
endif
EndPSub elapsed

PSub HighResTimerCheckpoint(whichTimer)
ptr = memTimeBankPtr + 8 + ((whichTimer - 1) * 16)
CallDll(Kernel32DLLid, "QueryPerformanceCounter", ptr)
loStart = PeekInt(ptr - 8)
hiStart = PeekInt(ptr - 4)
loNow = PeekInt(ptr)
hiNow = PeekInt(ptr + 4)

; reset stat timer time to present time
PokeInt ptr - 8, loNow
PokeInt ptr - 4, hiNow

if hiNow - hiStart > 0
; timer has gone over the lower dword boundary
if loStart < 0
; timer started when lower dword had top bit set
; loNow cannot be negative because that would indicate that the hi dword
; is still the same and so we don't get into this branch unless loNow >= 0
; now lower dword does not have top bit set
elapsed = (2147483647 - loStart) + loNow
else
; timer started when lower dword did not have top bit set
if loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 + loNow) + (2147483647 + loStart)
else
; now lower dword does not have top bit set
elapsed = loNow - loStart
endif
endif
else
; hi dwords are the same - so only need to consider the low dwords
if loStart < 0
; timer started when lower dword had top bit set
; loNow cannot be positive because that would indicate that the hi dword
; incremented and so we don't get into this branch unless loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 + loNow) - (2147483647 + loStart)
else
; timer started when lower dword did not have top bit set
if loNow < 0
; now lower dword has top bit set
elapsed = (2147483647 - loStart) + (2147483647 + loNow)
else
; now lower dword does not have top bit set
elapsed = loNow - loStart
endif
endif
endif
EndPSub elapsed




3. What resolution timer does your PC provide?

Using the code above we can find out how accurate your PC is going to be when trying to code profile. Try the followiing out and see what you get:

PlayBASIC Code: [Select]
; call the init function with the max number of concurrent timers you want to use
; I usually just use 10 - there's just 16 bytes memory overhead per timer
if HighResTimerInit(10) = 0
print "PC does not support high resolution timer"
else
print "High resolution timer frequency: " + Str$(memTimerFreqLo) + " Hz (" + Str$(float(memTimerFreqLo) / 1048576.0) + " Mhz)"
endif



The results for my laptop - Note: CPU identifies itself as Genuine Intel(R) CPU T2400 @ 1.83GHz, 1.00 Gb RAM

High resolution timer frequency: 3579545 Hz (3.41372 Mhz)

Which is another way of saying that my PC is accurate to 0.279 millionths of a second (ie 1 / 3579545)


4. What is the overhead of using the high resolution timer?

One thing to bear in mind is that using the high resolution code above doesn't come free - it costs time to make the API calls and calculate the elapsed time. But exactly how much?

Well we can use the high resolution timer to timer itself!

First of all we time a loop without calling the high resolution timer (having first started one - as we are interested not in the setup time, which is very quick but in the ongoing measurement parts which will be potentially called many times), and then we repeat the test but calling the measurement routine and work out what the difference is.

PlayBASIC Code: [Select]
testIterations = 1000
dim times(testIterations)

print "Baseline loop test"
for test = 1 to 9
HighResTimerStart(1)

for i = 1 to testIterations
times(i) = i
next i

totalTimeEmpty = HighResTimerCheckpoint(1)
print "Test " + str$(test) + " Total: " + str$(totalTimeEmpty) + " Average: " + str$(float(totalTimeEmpty) / float(testIterations))
next test

print ""
print "High res timer loop test"
for test = 1 to 9
HighResTimerStart(1)
HighResTimerStart(2)

for i = 1 to testIterations
times(i) = HighResTimerCheckpoint(2)
next i

totalTime = HighResTimerCheckpoint(1)
min = times(1)
max = times(1)
for i = 1 to testIterations
if times(i) < min then min = times(i)
if times(i) > max then max = times(i)
next

print "Test " + str$(test) + " Total: " + str$(totalTime) + " Average: " + str$(float(totalTime) / float(testIterations)) + " Min: " + str$(Min) + " Max: " + str$(max)
next test

overhead# = float(totalTime - totalTimeEmpty) / float(testIterations)
print ""
print "Overhead of calling high res timer = " + str$(overhead#) + " " + Str$(overhead# * 1000.0 / float(memTimerFreqLo)) + " ms"
print "Which is equivalent of " + Str$(overhead# * 1000.0 * 100.0 / float(memTimerFreqLo)) + " hundredths of a millisecond"
print "or " + Str$(overhead# * 1000000.0 / float(memTimerFreqLo)) + " millionths of a second"



On my laptop running under the IDE the overhead is about 0.032 milliseconds per high res timer call - ie 32 millionths of a second. So we just need to bear this in mind if you use it in any intensive loops - normally you'll measure outside of loops when comparing algorithms so this should not be a big deal.


5. A very simple example: What is better "j = j + 1" or "inc j"?

Here is a simple example that determines if "j = j + 1" or "inc j" is faster - now I was expecting that "inc j" would be faster (as this is very basic operation at the processor level and requires less data to perform), but I seem to be wrong - or my testing is!

Kevin: Is this what you would have expected or is my testing flawed? See below

PlayBASIC Code: [Select]
testIterations = 1000000
dim testArray(testIterations)

print "baseline increment test"
totalTime = 0
for test = 1 to 9
HighResTimerStart(1)

j = 0
for i = 1 to testIterations
j = j + 1
next i
times(test) = HighResTimerCheckpoint(1)
totalTime = totalTime + times(test)

print "Test " + str$(test) + " Total: " + str$(times(test))
next test
print "Average: " + str$(totalTime / 9)

print "inc increment test"
totalTime = 0
for test = 1 to 9
HighResTimerStart(1)

j = 0
for i = 1 to testIterations
inc j
next i
times(test) = HighResTimerCheckpoint(1)
totalTime = totalTime + times(test)

print "Test " + str$(test) + " Total: " + str$(times(test))
next test
print "Average: " + str$(totalTime / 9)



Note: That I am doing one million loops in each test, on my laptop I get the following results running under the IDE in a window (having run the complete test 6 times to help avoid fluke runs skewing the results):

j = j + 1     average 216329, 213076, 187889, 191631, 202625, 193624 = 1205174
inc j         average 222143, 237167, 222062, 242018, 251182, 235329 = 1409901
difference              5814,  24091,  34173,  50387,  48557,  41705 =  204727


So from my quick analysis it would appear that "inc j" is a bit slower - don't forget that for each result above involved running the loop one millions and then repeating it nine times as well.

In other words it did 9,000,000 variable increments and the difference in that time was about 16.9% (ie 204727 / 1205174)


6. End bracket

Glad you managed to stick with me all the way to the end!

Please bear in mind that the example above is a highly construed one, in real world tests you will not get such a high difference as most of the time will be spent drawing the graphics and blittiing etc.

However, having optimized code always helps, especially if you're implementing game rules and AI. But bear in mind that assumptions can change between compiler releases so you always need to regression test your optimization choices when you get a new build of the compiler.

I am quite interested to hear what your PC high resolution timer runs at and whether my tests above are repeatable and produce the same broad results.

I will probably do some more tests to check out some other assumptions too - let me know if you want me to post these and also let me know if you come up with any good tests and results too.

Best regards,
Adrian Nelson






kevin

#1
 Adrian,

 Thanks for the tutorial !

QuoteKevin: Out of interest are there any optimizations applied when creating the byte code? (I'm thinking of the usual sort of things, like removing loop invariants from the bodies of loops, removing redundant variable assignments, replacing psub/function calls with inline code for small psubs/functions etc).

    The expressions are opt'd to a point,  like constant expressions can be resolved and many core built in functions are inlined also.  Ie. A line of code like  R#=Cos(45)*10 simply produces to  R#=10.      It doesn't unroll loops or remove redundant (unused) variables though.   The latter is on the to do list.      


QuoteIs there much of a difference between the byte code being generated when running from the IDE (I'm using IDE 1.1.5f and compiler 1.63h) to producing a fully deployed .EXE?

    The release runtimes have no safe code.  That is to say, pretty much  all the nanny error code generation stuff is gone.    

    Also, you really should be using PB1.63m or PB1.63n (login required).  From memory, you'd have to install them in order though.


QuoteKevin: Is this what you would have expected or is my testing flawed? See below

     Yep that's expected, the expression optimizer will often short cut general math operations such as A=A+B  from around version V1.50 and higher.    The speed difference occurs as Addition has a slightly higher precedence than INC opcode. Therefore, it's faster.




 


AdeN

Thanks for the feedback Kevin - I'll be sure to keep up-to-date with patches.

With regard to "A = A + 1" vs "Inc A", I am wondering if in your compiler and runtime that "Inc A" is actually redundant, as you could treat it a bit like a macro and internally (during compilation) when you spot "Inc <var>" you expand it is "<var> = <var> + 1". This reduces yor opcode table and also does not penalise anyone who prefers to use the more friendly "Inc" command (the same applied to the "Dec" command too).

I am working on some more profiling tests and will post some more tests and results - however these will have to wait for another week as I've now got to concentrate on getting my own software release finished off! So I must stop playing around with your excellent PlayBasic!

Please feel free to include this tutorial (or a modified form of it - let me know if you would like me to update/re-write it) in the PlayBasic documentation if you think it would be useful.

monkeybot

as i have mentioned before it would be nice to have a code profiler built into PB,next tutorial though...