News:

PlayBASIC2DLL V0.99 Revision I Commercial Edition released! - Convert PlayBASIC programs to super fast Machine Code. 

Main Menu

PlayBASIC V1.64N2 / V164N3 (Work In Progress) Gallery

Started by kevin, April 22, 2012, 12:07:12 AM

Previous topic - Next topic

kevin





PlayBASIC V1.64N2 & V1.64N3  (Work In Progress) Gallery


    This thread will document the changes being made for the PlayBASIC 1.64 revision N2 and revision N3 (Last half of thread).    Ideally this update will address a few issues in the V1.64N release, with probably some more tweaks thrown in for good measure.

    For Newer upgrade work in progress see,

     See  PlayBASIC V1.64P (login required)   -     Convert PlayBASIC To Machine Code Dll's

    See  PlayBASIC V1.64O


    For older upgrade work in progress see,

    See  PlayBASIC V1.64N

    See  PlayBASIC V1.64M

    See  PlayBASIC V1.64L Learning Edition History (Building learning Edition Thread)

    See  PlayBASIC V1.64L





PlayBASIC V1.64N2  Beta #1 - Sprites Collision and evils branch prediction

    While roaming the hill sides of the PB engine a few weeks back,  noticed something of an oddity in regards to sprite collision, in particular when used in scenes of say 500->1000 sprites.   Yeah, I guess that's a fairly pie in the sky sprite count given your average PB has has less than 25 sprites on screen, but none the less,  if we can make a dent on the bottle neck then any improvement will flow through to more real world situations.

    In the test code, I'm creating a 1000 sprites within a 800 * 600 range of each other and comparing each sprite to all 1000 of them,making a 1,000,000 brute force potential impacts.   What's interesting is that in this situation, it's actually quicker to not use collision classes at all,  by about 3 times in fact.    

    When sprite hit runs through the set, it's just AND'ing the your desired class with each sprites class , if no bits are shared, we skip it.

    Which might look a bit like this in PB code



   MyMASK = %11110000

   For ThisSprite =1 to  NumberOfSprites
        if GetSpriteCollisionClass(ThisSprite)   and MyMASK

             ; Here we'd continue to do the actually impact
             
       endif
   next



     So what we're saying is that when the IF / ENDIF statement is always true, then the loop runs much quicker (3 times faster) than when the we selective target impacts, when say only 10% of the cases were true.   Doesn't make sense ?, yeah... had me stumped for a while, but this has all the hall marks of the branch predictor failing us.     Which I'm not too sure we can avoid at this point.  I've been trying to rearrange the logic to avoid potential stalls, but the moment there's a comparison in the loop it dies in such circumstances.
         
     Now given the situation is pretty pie in the sky,  it's not really something i'm worried about, but it does bother me :)

 
     Attached is pic showing the scene i'm testing.. a bit over the top :)


ATLUS

one thousand sprite, good work, and how many fps?

kevin


kevin

#3
 PlayBASIC V1.64N2  Beta #2 - Compare Operators

     A by product of the previous test has lead to some improvements to the compare operator opcodes in the PB VM.    Compares aren't a heavily used operator throughout a program, but they generally appear in high repetition loops, so we can win back some free performance by customizing the opcodes to better handle the most general situation, which is INTEGER compares.   It'll still handles mixed INT/FLOAT types, but the reshuffle means it's slightly biased towards integer operations.

     In the shots bellow we're running a bunch of for/next loops each  performing 10,000 iterations.   Each test is comparing the various comparison types, so we've a loop for EQUALS, NOT EQUAL, LESS THAN, GREATER THAN,  LESS THAN EQUALS and GREATER THAN EQUALS.   Some loops are just doing one compare, others are comparing all 4 combination.. ie. INT=INT, INT=FLT, FLT=INT, FLT-FLT
           
    The compiler can also detect some assignment redundancy in expressions, which can remove the odd bogus move operator that could appear in the outputed code.    The combination makes the bench mark about twice as fast.   May have also found a few legacy opcodes no longer in use.  It's hard to tell really, since the output opcode is often calculated (merged) during code generation, rather than  a nice an obvious write function.  So if you get 'messages during BETA testing' then I wanna know about it.  



PlayBASIC V1.64N2  Beta #3 - FOR / NEXT Statements

    While looking through some user code recently, noticed a few usages of   STEP 1 in FOR/NEXT loops.  The run time actually has two looping methods built in, one for the default loop iteration of one, and another for a user defined step value.  This is necessary as the user defined step requires another variable read from memory every iteration.   Fetching from memory is unavoidable when the step is some custom value, but when it's one, it's just overhead for the sake of overhead.  So the parser now automatically detects this situation and defaults back to the standard +1 looping form.          
    So this,


     For lp =0 to Max Step 1
          Count++
     next

 
     is actually compiled as this now.

     For lp =0 to Max
          Count++
     next





kevin

#4
 PlayBASIC V1.64N2  Beta #4 - Squeeeeezzzzzing out some extra Loop performance

     At this stage,  optimizing any of the core operations in the PBV1.64 VM,  is how you'd imagine those formula one engineers must feel squeezing that extra millisecond out of their latest F1 car.   There was a point early on, where making the runtime faster was often pretty easy, but  as time drags, the big impact gains are few and far between.     More often than not now, new gains are made not through changing the runtime, but tweaking the compiler to detect and gracefully replace particular 'dodgy' bits of logic or functions.   Another way gains can be made, is via tweaking the layout of the opcodes being produced based upon the most common circumstances users apply it, which is how the performance gains in the comparison opcodes are made above.

      Hopefully by now, you'll appreciate that Looping and decisions go hand in hand, so any performance we win back has a positive impact upon end user programs.   The comparison gains alone gave us some extra zip to our Repeat/Until  While/EndWhile loops, in one test making a While loop 4 milliseconds faster.    But the actual operation wasn't any different than in V1.64N and bellow.   But last night, i've been looking over the looping opcodes (again) for alternative methods of structuring the opcode data, trying to avoid fetching anything that may or may not be required.   It turns out, that  by pre-processing some the opcodes data, some work can be done at compile time, rather than an runtime.   With some interesting results..  

      Bellow we're looking yet another thrilling standard bench mark test between V1.64N and V1.64N2 Beta 4, the test is just timing a bunch of looping constructs from FOR/NEXT, REPEATS / WHILE and DECLOOP...same old same old..    So the V1.64N release can run the test and maintain a 48fps refresh, which is quickest of all of the classic versions of the PlayBASIC,  well... until now..    So running the same test through V1.64N2 beta 4 and we're able to maintain a 60fps rate.         Now we should point out that the real benefits of this test are recent changes are the conditional repeat/while statements.  There's perhaps a tiny gain in FOR/NEXT opcode, but nothing to write home about.    

      Comparing today's runtime performance using the standard test, against older editions we get a slight speed advantage,  where the V1.64N2 is running the FULL standard test in 3.7->3.8 seconds, about point 1->2 of a second quicker than V1.64N, which was one of the first editions to cross under the 4 second barrier.    For those who might not know, the standard test is a program that measures the performance of the all the common low level operations in the PlayBASIC runtime.  It was originally written way... back in 2004 and i've a collection of programs built from various Pb revisions across that time period.     It's interesting looking back on how long it takes those older versions, like the V1.089 revision takes almost 9 seconds to run same code.   It's kind of like looking back on fossils where you see large jumps in the performance to some structural change that was made.        

       Anyway,  we'll keep chipping away.. I think there might be a way to cache some structures accesses across expressions and perhaps groups of lines.. But we'll see.        


kevin


PlayBASIC V1.64N2  Beta #5 - Select / Case

     Some days you win some performance back, some days you lose some.  Yesterday was that latter,  was trying some rearrangements of the opcodes that handle  IF / THEN / IF ELSE ENDIF / SELECT statements.  Prior to making the changes,  based on the previous sessions results, I would happily bet that they'll be quicker also.. But.. nope, slower..   Spent most of the afternoon trying to get the win, but ended up reverting to the previous version.   Which happens.. an awful lot..     

     Not a complete loss though as the implementation of Select / Case statement has a number of fatty bits wrapped around the edges.  Surprisingly when brute forcing the Select/Case against IF/THEN blocks,  a raw IF/THEN was quicker over 25,000 calls of 10 options

     This type of decision block thing



     If A =0 Then DoSomething0
     If A =1 Then DoSomething1
     If A =2 Then DoSomething2
     If A =3 Then DoSomething3
     If A =4 Then DoSomething4
     If A =5 Then DoSomething5
     If A =6 Then DoSomething6
     If A =7 Then DoSomething7
     If A =8 Then DoSomething8
     If A =9 Then DoSomething9


vs


     Select A

           case 0 : DoSomething0
           case 1 : DoSomething1
           case 2 : DoSomething2
           case 3 : DoSomething3
           case 4 : DoSomething4
           case 5 : DoSomething5
           case 6 : DoSomething6
           case 7 : DoSomething7
           case 8 : DoSomething8
           case 9 : DoSomething9

    endselect


 
        In PB1.64N the test was showing a 2 millisecond bias towards the IF/THEN version.   Which can attributed to that opcodes precedence in the runtime.   In other words just calling one is faster than  the other, let alone what they actually do.   When looking through the select opode though there did seem to be a few places where the opcode could return earlier, so have plugged some leaks.  Which resulted in the two sides performing almost evenly.   Clearly the IF/THEN statement block above gives a FIXED execution time.  Where SELECT can exit the structure when a case is met.  The sooner that case is to the top the quicker. 

        So in the above example if you set A to 9, then they'd preform about the same, if A is 5, then the select block is approaching twice a quick and so on.   Now you can shuffle the  IF/THEN logic to  make the an almost identical logical structure, but it's a just a little extra work like bellow.



     If A =0 Then DoSomething0 : Goto Done
     If A =1 Then DoSomething1 : Goto Done
     If A =2 Then DoSomething2 : Goto Done
     If A =3 Then DoSomething3 : Goto Done
     If A =4 Then DoSomething4 : Goto Done
     If A =5 Then DoSomething5 : Goto Done
     If A =6 Then DoSomething6 : Goto Done
     If A =7 Then DoSomething7 : Goto Done
     If A =8 Then DoSomething8 : Goto Done
     If A =9 Then DoSomething9
Done:



     You can take type of tweak even further by moving to JUMP Table are per ON VARIABLE GOTO, but they don't suit every thing.    Ideally it'd be best if the compiler could recognized the most optimal implementation that best fits the case data.   But, unless your falling through this type of structure 1000's of times each frame, then it's not worth worrying about today.     

   

Type Caching

      Because ARRAYS are dynamic in BASIC, then every time you pull or write a data to an array,  the runtime has a bunch of overhead to chew through to just work out the location of said data in memory.    So the fetch cost is really a pretty significant drain upon performance, where the more accesses we have the more overhead we introduce.     So obviously caching values in Variables where possible is a better approach.   Variables are comparable to CPU registers in the VM.   Everything internally is biased toward them, given they're the currency of the instruction set.   

      One idea i've been mulling over for a while now would approximate some long term redundancy style  opt's you see in low level languages like C/C++,  which tend to solve accesses to a pointer and carry that pointer forward until it has to be recalculated.    Ideally if we pre-processed the source code, we could make a program reorder the user code into the more efficient sequences of instructions. 
   
      Ie.



  '      If we had some code like this

      Dim Table(100)

      For lp =0 to 100
             Table(lp)=Table(lp)+Table(lp)
      next

  ' could be this..

      Dim Table(100)

      For lp =0 to 100
             Temp=Table(lp)
             Table(lp)=Temp+Temp
      next




     Now you'll notice there's multiple accesses of the same cell in the array in the first snippet.   So an easy opt would be pull the value once in an variable , double it there and then write it back.     A pre/post processor pass could actually do this for us as it'd take a much broader view of the code, the compiler can't as it's not aware of how these accesses might stack up over time.  So any time there's a query, it drops the code to get/set the value from the array. 

    But,  what I think might possible would be introducing an alternative type of Get/SET process for subsequent accesses.  It wouldn't be as optimal as rearranging the expression tends to be, but I suspect it'd make a happy middle ground reducing the overhead of heavy structure accesses, effectly giving back some free speed for what minimal effect all round.



kevin

#7
 PlayBASIC V1.64N2  Beta #6 - Type Caching In math Short cuts

        Since yesterday I've been looking over how types are accessed,  this is a very old part of the vm instruction set and one where's not much room to move, so I've had to add some secondary lower priority instructions to help out with the caching.     Firstly, caching can only be applied to ordered accesses of the same typed structure.  So it can't pre-compute part the cell addresses of a number types  and hold them that for long periods of time, it's just the most recent access.    

        Starting simple, i've been testing the logic upon typed variable access inside the math short cuts operators rather than even attempting to let this run while on anything bigger.   Depending upon the operation and the target, the short cuts generally output much the same code as the unrolled expression, not always, but generally they do.   When used on arrays/types the code generator spits out  code that basically reads the array, does the math operations and drops the result back into the array.     Since the source and destination cells are known to be the same, we can short cut this by allowing the read as normal, then spiting out a cached write rather than a flow blown write.  

        Bellow we've some test code and results showing the proof of concept works.    The code just runs batches of additions on 3 fields in a type, 25,000 times.          


type Vector3D
x#,y#,z#
EndType

type Stuff

Name$
X
y
z#
Table(100)
V3 as Vector3D

EndType

Dim Me as stuff
Dim test(10) as stuff
Dim cool as stuff list

me = new stuff

Test(1) = new stuff


cool = new stuff

me.v3 =0  

max=25000

Do


cls
frames++

tests=0
t=timer()
for lp =0 to max
me.X+=1
me.y+=2
me.z+=3
next
tt1#+=timer()-t
print tt1#/frames
print Me.x
print Me.y
print Me.z
Me.x=0
Me.y=0
Me.z=0
tests++


t=timer()
for lp =0 to max
me.X=me.X+1
me.y=me.y+2
me.z=me.z+3
next
tt2#+=timer()-t
print tt2#/frames
print Me.x
print Me.y
print Me.z
Me.x=0
Me.y=0
Me.z=0
tests++


print "BandWidth:"+Str$(max*24*Tests)
print "FPS:"+Str$(fps())


Sync
loop




       
       Ok.. so result wise, running the first loop in V1.64N2 beta6 gives us something like a 20-25% boot in performance compared to V1.64N.    But there's a BUT..  Something that I'd not foreseen, is that when you cache subsequent accesses like this we can no long rely upon the PB's automatic on write type creation.  Simply because the write function has no idea where the cache data type comes from.

       What i mean is that in PB, when you write to types that don't exist, PB will allocate the type for you..


       Eg.
      Type Stuff
            Xpos,Ypos
      EndType

      ; declares the type, doesn't allocate the types structure, only the container
       Dim Me as Stuff

      ; So this is actually writing to a type that doesn't exist
       Me. Xpos = 45       ; writing to the type



      Since PlayBASIC knows the type of array/type your writing to, it allocs the structure for you.  Normally you'd do this with the NEW operator.  If you use lists, then that's the only way.    The thing is, if the code has to READ from the structure and then caches the read pointer,  in an expression like Me.Xpos++,  we're reading from a type that doesn't exist and caching this none existent structure, so when we run the cached write, BOOM it crashes.    

      So what this means is that if we enable caching universally across our programs we'll be a slightly changing the behavior of the runtime, even though it's fairly subtle difference, this could be the difference between some programs running and other programs not running without significant modification.    In particular, if you rely upon the automatic allocation.  If you don't, then there's no problem.    
       
      Been testing a bunch of examples, and at this point have only found one or two where there's any issue,  even so,  I'm not too comfortable setting caching as the default behavior as yet without a lot of 3rd party testing.   I suspect the best way forward will be like when the first editions of the expression optimizer appeared, where originally you had to enable it, then eventually  it became the default mode.  


ATLUS

Kevin, do you planning add new operators in new version?

kevin

#9

Nope, only things planned are what's listed here (login required)


kevin

#10
PlayBASIC V1.64N2 BETA #6 (Retail Compiler Only Beta) - (Avail for Registered Users ONLY)

     PlayBASIC V1.64 N2  Beta #6,  this revision contains all the latest runtime optimizations and Language files..  

     The V1.64N revision 2 is meant to address a few issues found in the V1.64N upgrade, so far it's been more of an optimization based upgrade with various improvements being made to the PB VM.    The main changes found in this beta would be the new Compare operators, some conditional looping changes (repeat / Until  - While / EndWhile),  improved select case and last but not least Type Caching in the Math Operators.    

     Type caching is a method where the runtime can preserve info about type accesses across an expression.  So far, it's implementation is limited to only being available within short cut math operators such as ++, --, +=, -=, *=, /*  and logic operations.    

     Ie.

     Type Stuff
           X,Y,Z
     EndType
     DIm Me as Stuff
     Me = New Stuff
 
    ; the runtime can now remove some of the work from this type of operation
     Me.X++

    ;  It can't currently help if you do this
     Me.x = Me.x +1


      Internal tests show that caching gives us about around 25% gain in speed over the long hand version.   Be aware though, if you try and do this on types that don't exist you'll get a runtime error.   To create a type within the container, we use the NEW operator as shown above.  

      I hope, caching can be rolled out language wide, but my gut feeling is that we'll struggle to get enough people testing it across their code, which will mean it's implementation will be abandoned.  



Download


     Old Link removed



ATLUS


OldNESJunkie

To quote Kevin: "I hope, caching can be rolled out language wide, but my gut feeling is that we'll struggle to get enough people testing it across their code, which will mean it's implementation will be abandoned."

Personally I say make it that way anyways especially if the performance gains are that large. People can adjust their projects if need be. d/l now to test myself.

ATLUS

i think work well, i tested my codes and some(~50) standard "Examples codes".

kevin

#14
QuotePersonally I say make it that way anyways especially if the performance gains are that large. People can adjust their projects if need be. d/l now to test myself

  Have already been testing ways to expand this project wide, but unfortunately it's not turning out to be as simple as i'd hoped.  It can work nicely on serialized sequences of code, but issues frequently appear in blocks where there's a mixture of data structures being accessed.    The issue isn't that people would have to reshuffle their code to take best advantage (that's a given), the issue is that their code just won't work in some situations.   Finding those circumstances takes time and lots and lots and lots of testing.