Main Menu

Compression (run length)

Started by kevin, February 03, 2008, 09:44:49 PM

Previous topic - Next topic

kevin

 
Compression  (run length)


  The simplest forms of compression are run length encoders.    Put simply, you scan the original data looking for runs of bytes (for example) that are the same value.   If a run is found, this sequence is then represented in the output stream in short hand form.  Usually by a token,  followed by the byte to be repeated  then number of iterations. (length of the run)


  I.e.  Original data sequence in ASCII .

   "ABCCCCDDEEEEEFABCCCC"


   Compressed form after run lengths have been removed. (assuming depack code is unique)

    @= depack code chr,   next char after this is the packed run chr followed by the length character. Length character is number chr from 0 to 9

     "AB@C4DD@E5FAB@C4"

      to depack the compression string we run through the sequence looking for the depack code,  if we find one, we expand this sequence back into it's original state and add it to the output stream.  



PlayBASIC Code: [Select]
   PackedString$= "AB@C4DD@E5FAB@C4"
Print PackedString$
print DepackString(PackedString$)

Sync
waitkey




Function DepackString(PackedSequence$)

for lp=1 to Len(PackedSequence$)

ThisChr=mid(PackedSequence$,lp)

; Check is CHR the depack symbol
if ThisChr=asc("@")

; The next Chr is the packed run chr
PackChr$=mid$(PackedSequence$,lp+1,1)

; the next chr after that is the run length chr
RunLength=mid(PackedSequence$,lp+2)-asc("0")

; make a string out of this and add it to the output string
DepackedString$=DepackedString$+make$(PackChr$,RunLength)

; bump the loop counter past the depack codes in the input string
lp=lp+2

else
DepackedString$=DepackedString$+chr$(Thischr)
endif


next
EndFunction DepackedString$





   PlayBasic example code


     The above is neither a good algorithm or implementation,  but rather as a learning example of the run length compression process.    Effectively all you're trying to do is find a way to represent that original data in a 'short hand' format.  

    None the less,  Variations of this algorithm can include various pack types to remove runs of ascending /descending bytes/nibbles,  common pairs/triplets etc from the input stream
     

   Note: With data compression,  the type of short hand representation used (the compression method) will differ depending upon the make up of the data.    



    Other forms of compression revolve around finding repeated sequences of data  within the original stream.  It can be difficult to visualize,  but one way to imagine this might be that you're attempting to brake down the data in a dictionary of unique words.  Which might look a bit like this.


    Ie. original form

    "the quick brown fox jumped over the dog"

     Compressed Dictionary form

     Dictionary$(0)="the"
     Dictionary$(1)="quick"
     Dictionary$(2)="brown"
     Dictionary$(3)="fox"
     Dictionary$(4)="jumped"
     Dictionary$(5)="over"
     Dictionary$(6)="dog"

     DepressTokens$="01234507"

   While this example doesn't show a huge saving on simple sentence, the return would be better packing longer paragraphs of text.  Generally much better than run length encoders could achieve.







AdeN

#1
When compressing text an alternative to the dictionary approach is to make use of unused characters in the character set - however this really only works for English because you can use ASCII codes 128 to 255 (giving you 128 entries) as tokens that expand out to common sequences of letters and punctuation - in fact you could even have whole words too - it all depends on your requirement.

Normally though you would use common digrams (two letter combinations) and trigrams (three letter combinations). - here is an interesting article on letter frequency, digrams and trigrams (login required)

The best way would be to analyse all of your application's text and produce an optimal 128 entry digram/trigram table but in practice you could take some common ones, for example

128 ". " - full stop and a following space
129 ", " - comma and a following space
130 "the"
131 "ing"
132 "and"
133 "th"
134 "in"
135 "he"
136 "er"
etc - you get the idea

There is another benefit of compressing your text using the above approach - it's a very simple (and not very good) form of encryption.