Tokenize String - Simple String Lexer

kevin · April 12, 2023, 09:31:46 AM

Tokenize String - Simple String Lexer

Here are the main bullet points that summarize what this code does:

* The code defines a custom data type named tLexToken that has three fields: TOKEN$, TOKEN_TYPE, and INDEX.

- The code initializes an array named TOKENS to hold instances of tLexToken. The array has 256 elements.

- The code creates a string variable named Message$ that contains a string to be tokenized. The string has some numbers, a string literal, and an arithmetic expression.

- The code calls a function named TOKENIZE_STRING that takes two arguments: a string to be tokenized and an array of tLexToken instances to hold the tokens.

- The TOKENIZE_STRING function defines three arrays to map ASCII characters to token types: ASCII_MAP, ASCII_WORD, and ASCII_NUMBER. The function also initializes a variable named TOKENS_COUNT to zero.

- The TOKENIZE_STRING function loops over each character in the input string and checks its ASCII value against the mapping arrays to determine its token type. The function then groups consecutive characters of the same type into a single token and adds it to the TOKENS array.

- The TOKENIZE_STRING function returns the number of tokens that it has created.

- The main code then loops over the TOKENS array and prints each token's index, type, and value. Finally, the code waits for the user to press a key before terminating.

PlayBASIC Code: [Select]

   Type tLexToken
         TOKEN$
         TOKEN_TYPE
         INDEX
   EndType

   Dim TOKENS(256) as tLexToken

   Message$=" This is the string I was to tokenize.  1000 * 2000"
   Message$+= chr$(34)+"String Literal"+chr$(34)
   Message$+= "12+4+5+6"
   print Message$

   Count=TOKENIZE_STRING(Message$, TOKENS())
   For lp =0 to Count-1
      s$ ="("+str$(Tokens(lp).TOKEN_TYPE)+")"
      s$+="="+Tokens(lp).TOKEN
      print "#"+str$(lp)+" "+s$
   next

   Sync
   waitkey
   




   // --------------------------------------------------------------------------------------------------
   // --------------------------------------------------------------------------------------------------
   // -------------------------------------->> TOKENIZE STRING <<---------------------------------------
   // --------------------------------------------------------------------------------------------------
   // --------------------------------------------------------------------------------------------------

   
function TOKENIZE_STRING(InputTEXT$, TOKENS().tLexToken)

      Size        =len(InputTEXT$)

      Static _INIT_LOCALS
      if _INIT_LOCALS=false

         _INIT_LOCALS=true

         dim ASCII_MAP(255)
         dim ASCII_WORD(255)
         dim ASCII_NUMBER(255)
   
         // ----------------------------------------------------------
         // Map common ASCC CHARACTERs
         // ----------------------------------------------------------
         ASCII_MAP(9)   = 9      //  TAB charatcer

         // ----------------------------------------------------------
         //  MAP Characters
         // ----------------------------------------------------------
         WORD$ =" .()+-/*="
         for lp=1 to len(WORD$)
            ThisCHR = mid(WORD$,lp)
            ASCII_MAP(ThisCHR)= ThisCHR
         next
      

      // ----------------------------------------------------------
      //  WORDS  UPPER / LOWER CASE with UNDER SCORES
      // ----------------------------------------------------------
         WORD$ ="_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
         for lp=1 to len(WORD$)
            ThisCHR = mid(WORD$,lp)
            ASCII_MAP(ThisCHR)= 1000
         next
   


      // ----------------------------------------------------------
      //  NUMBERS
      // ----------------------------------------------------------
         WORD$ ="0123456789"
         for lp=1 to len(WORD$)
            ThisCHR = mid(WORD$,lp)
            ASCII_MAP(ThisCHR)= 1001
         next

         // ----------------------------------------------------------
         //  STRING LITERAL 
         // ----------------------------------------------------------
         ASCII_MAP(34)   = 1002      //  STRING LITERAL block
      



      // ----------------------------------------------------------
      //  ALPHA NUMERIC WORDS 
      // ----------------------------------------------------------
      
         WORD$ ="0123456789_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
         for lp=1 to len(WORD$)
            ThisCHR = mid(WORD$,lp)
            ASCII_WORD( ThisCHR)   = ThisCHR
         next
         
      // ----------------------------------------------------------
      //  ALPHA NUMERIC WORDS 
      // ----------------------------------------------------------
   
         WORD$ ="0123456789."
         for lp=1 to len(WORD$)
            ThisCHR = mid(WORD$,lp)
            ASCII_NUMBER( ThisCHR)   = ThisCHR
         next

      endif

      TOKENS_COUNT=0

      for lp=1 to size
         
            ThisCHR = mid(InputTEXT$,lp)
            
            CHR_TYPE = ASCII_MAP(ThisCHR)
            
            //  DECODE RUN OF SAME TYPE OF CHARACTERS      
            if (CHR_TYPE>255)

                  EndPOS = lp   

                  // -------------------------------------------------
                  //  DECODE A WORD   
                  // -------------------------------------------------
                  if (CHR_TYPE=1000)

                     for ScanLP=lp to Size
                        SearchCHR    = mid(InputTEXT$,ScanLP)
                        if (ASCII_WORD(SearchCHR)=0) then exit
                        EndPOS = ScanLP
                     next
                     
                  endif

                  // -------------------------------------------------
                  //  DECODE A NUMBER   
                  // -------------------------------------------------
                  if (CHR_TYPE=1001)

                     for ScanLP=lp to Size
                        SearchCHR    = mid(InputTEXT$,ScanLP)
                        if (ASCII_NUMBER(SearchCHR)=0) then exit

Login required to view complete source code

News:

Tokenize String - Simple String Lexer

kevin