Fast Dot

kevin · January 31, 2006, 08:50:50 AM

Spent most of today trying to remove some of the bottleneck from 'DOT' rendering. While previously it ok for a few points here and there, but would simply choke when trying to draw a full screen of dots. This is not acceptable!

It turns out that a hell of a lot of time was lost in clipping overhead, so after a little tinkering. hey presto it's now able to draw 800*600 dots in about 400ms. Still not amazing, but that's about 10 times faster than it was previously for number of pixels.

One of the issues is that DOT is basically a safe render, that mean it's it supports clipping, and handles the destination buffer for you and the various draw modes.

So obviously we can gain a little more back but removing it's safeness and letting the user lock/unlock the buffers and handle clipping. Which gives us FastDOT

FastDot on my machine is about 2.5 times faster than DotC. You can render a full screen of pixels 800*600 in 150 milliseconds on my duron 800mhzz. That's about a 5/6fps for a pixel by pixel screen fill (code bellow). Still not staggering, but a very healthy improvement.

Code for PlayBASIC V1.13

PlayBASIC Code: [Select]

w=getscreenwidth()
h=getscreenheight()
rendertoscreen 


; Render the full Screen Using DOTC and record how long it takes
   tim1=timer()
   lockbuffer
  For ypoint=0 To h
     For xpoint=0 To w
     dotc xpoint,ypoint,rgb(255,0,0)
     Next
  Next
   unlockbuffer
   tim1=timer()-tim1

   c=rndrgb()   
   sync
   waitkey


   basec=rgb(255,0,255)
Do
   cls 0

   tim2=timer()
   c=basec
   lockbuffer
  For ypoint=0 To h-1
     For xpoint=0 To w-1
     fastdot xpoint,ypoint,c
     Next
     c=c+xpoint+1
  Next
   unlockbuffer
   tim2=timer()-tim2

; show the time of DOTC and FASTDOT
   print tim1
   print tim2
   sync
loop

UPDATE NOTES: (14th Nov 2022)

-Read PlayBASIC Help Files about FAST DOT 2
-Read PlayBASIC Help Files about FAST DOT 3
-Read PlayBASIC Help Files about FAST DOT 4

Draco9898 · January 31, 2006, 03:47:55 PM

Very nice, getting these full scale pixel based rendering functions working fast seems like a pain in the tailhole :)

kevin · February 01, 2006, 12:58:16 AM

Well, the main drama is keeping things generic. Generic and fast, really don't go together. It doesn't matter how much fat I trim away from the edges, it's still generic. A more viable approach would be to implement a pointer data type, so the user can write their own customer dot filler. Although, this is really a situation where a concept like PB-Asm would shine

Pointer example

PlayBASIC Code: [Select]

  Dim  Address as pointer
  Dim  FrameBufferAddress as pointer
 FrameBufferAddress= GetSurfacePtr(0)    
  FrameBufferModulo=GetSurfaceModulo(0)
 ; assume as 32bit filler
  For Ylp=1 to height
    Address=FrameBufferAddress+(Ylp*FrameBUfferModulo)     
     For Xlp = 1 to width
         *Address = Rgb(255,0,255)
         inc Address,4  
     next
 next

That would certainly be quicker in the long term, but the draw back is the user has to support all video formats manually.

Conceptually, if we go ahead with PB-Asm, this would probably be the fastest way to generate time critical code, without it being totally platform dependant.

Code Select


 Dim  Address as pointer
 Dim  FrameBufferAddress as pointer
 FrameBufferAddress= GetSurfacePtr(0)    
 FrameBufferModulo=GetSurfaceModulo(0)
 ; assume as 32bit filler
  For Ylp=1 to height
    Address=FrameBufferAddress+(Ylp*FrameBUfferModulo)     
    FillColour =Rgb(255,0,255)
   Asm
    ; Seed registers  (R0 through R3   32bit)
        Mov.l R0, Width  
        Mov.l R1, FillColour
        Mov.l R2, Address
    ; Fill loop
Loop:
        Mov.l (R2),  R1   
        Add.l  R2,4
        DecBne R0,Loop
     EndAsm
 next

The main appeal of implementing something like PB-Asm, would be it's a way to by pass the variables/pointers and manipulate memory directly. The Asm segments could be jitted to the host platforms native machine code. Given the simplicity of the potential instruction set, Most, if not all operations would translate 1 to 1. In cycle terms that's about 4/5 cycles per pixel for that fill loop. Compared to the 100's of cycles it takes now per pixel.

but anyway, I digress..

kevin · February 11, 2006, 07:24:12 PM

Fast Dot Revisited

I've been quietly optimizing some of the old VM baggage away from PB1.17. This is often necessary as over time things get bloated which can often be stream lined. While I do have a pre-set standard benchmarks/results I use when testing for speed, these are mainly math and loop orientated. So I figured I'd use the raw DOT screen filler as gfx one.

Results,

Code Select


  In the screen shot above the  PB1.13 is filling a screen full of  (800*600*32bit) pixels in 150ms.   
  PB1.17 now performs  this task in 132ms

Test Machine Duron 800mhz, GF2 Video WinXp pro

In Frame rate terms that's above another full frame per second faster (as there's 20 milliseconds per frame). Which doesn't sound impressive, but effectively that mean the brute looping crunching power of PB in this case is about %12 better in this situation.

If you calculate the fill rate per pixel, you can get an idea of just how many pixels my test machine can fill at reasonable rate. ( Fill rate = (Fill WidthW * Fill Height) / Milliseconds )

My Machine (duron 800mhz) fill about 3200 pixels per millisecond. So it's fast enough to do this in 320*240*32Bit.at 38/40fps Which is pretty staggering (to me at least), as it sure wasn't able to come close to that just a few weeks ago.

Code Select



;makebitmapfont 1,$ffffff
w=getscreenwidth()
h=getscreenheight()

w=320
h=240

openscreen w,h,32,2
rendertoscreen 
;ScreenVsync on

; Render the full Screen Using DOTC and record how long it takes
	tim1=timer()
	lockbuffer
  For ypoint=0 To h
  	For xpoint=0 To w
     dotc xpoint,ypoint,rgb(255,0,0)
  	Next
  Next
	unlockbuffer
	tim1=timer()-tim1


	basec=rgb(255,0,255)
Do
	cls 0

	rendertoscreen 
	dot 0,ypoint

	tim2=timer()
	c=basec
  lockbuffer
  For ypoint=0 To h-1
  	For xpoint=0 To w-1
    fastdot xpoint,ypoint,c
  	Next
  	c=c+xpoint+1
  Next
  unlockbuffer
	tim2=timer()-tim2
	basec=basec+w

; show the time of DOTC and FASTDOT
	print fps()
	print "MS"+str$(tim2)
	print "Fill Rate:"+str$(Float(w*h)/tim2)
	sync
loop

Draco9898 · February 11, 2006, 08:32:34 PM

Are you going to throw PB-ASM in? Looks extremely useful...

kevin · February 11, 2006, 08:40:11 PM

Probably, although it's not like I can just throw it in. Effectively it's like producing a mini compiler, within a compiler.

thaaks · February 12, 2006, 06:56:56 AM

To me PB-Asm sounds like a way to circumvent engine/interpreter problems.
Personally I would avoid something like PB-Asm - it will result in a second language to be supported/improved/bugfixed...

Maybe it makes more sense to enhance the interpreter. The issue you're trying to solve looks pretty much like "HotSpot" from Java.
With JIT the SUN people were able to transform method code into machine code but the call stack (the sequence of methods to be called) was still interpreted in Java. So SUN worked on HotSpot which means they transform whole call stack regions into native code.
This works pretty well for big loops for example. Maybe that gives you some more ideas, Kevin...

But that's just my 2 cents ;)

Cheers,
Tommy

Digital Awakening · February 12, 2006, 07:09:37 AM

Who actually have use of PB-Asm? As Thaaks says it's like a 2nd language with all the problems involved with it. Also PB is meant to be an easy way to program games. Personally I would like to se PB FX first and perhaps other things that are more directly useable for game creation. PB FX would allow us to do great looking modern 2D games. When that's taken care of there's nothing that stops including more complicated features.

kevin · February 12, 2006, 07:52:51 AM

I knew this would be miss interpreted. Implementing something like PB-Asm is a low priority , the concept is as old PB it self. However, there is certainly a need for way to stream line time critical loops without the compiler generated overheads getting in the road.

The plan has always been to compile the source down to one generic byte code instruction set (which is what it already does), then translate the byte code to native machine code were possible. The translation can occur either in the platform VM, or externally (aka of a module). Anyway, the issue (one of them) is that no matter how clever how the code generator is, it's highly unlikely to able to reach the speed of a manually set out asm loop. But it'll certainly be a lot quicker either way :)

hartnell · May 07, 2008, 06:18:31 PM

The idea for PB-ASM opens up all kinds of new possibilities for PB. It would certainly attract wanna-be ASM coders and the computer science crowd. Imagine :

* Learn the fundamentals of ASM -- using PlayBasic!
* Learn the fundamentals of making your own operating system -- using PlayBasic!
* Learn the fundamentals of making your own programming language -- using PlayBasic!

I began a 6502 emulator project for this exact reason, sadly, I lack the time to program it myself.

It will be awhile before I'm able to get into computer science again, but I can definitely say that there is a market for this kind of thing.

The two requirements for attracting this audience would be

* Include PB-ASM in PB Source -- for people looking to write their own ASM routines.
* An option to develop using only the VM2.

If you ever want to continue with it, please post a brainstorming thread. :)

-- Shawn

kevin · May 07, 2008, 11:03:33 PM

See Kyruss

News:

Fast Dot