Tokenize String - Simple String Lexer

Started by kevin, April 12, 2023, 09:31:46 AM

Previous topic - Next topic

kevin

 Tokenize String - Simple String Lexer


Here are the main bullet points that summarize what this code does:

 *  The code defines a custom data type named tLexToken that has three fields: TOKEN$, TOKEN_TYPE, and INDEX.

  - The code initializes an array named TOKENS to hold instances of tLexToken. The array has 256 elements.

  - The code creates a string variable named Message$ that contains a string to be tokenized. The string has some numbers, a string literal, and an arithmetic expression.

  - The code calls a function named TOKENIZE_STRING that takes two arguments: a string to be tokenized and an array of tLexToken instances to hold the tokens.

  - The TOKENIZE_STRING function defines three arrays to map ASCII characters to token types: ASCII_MAP, ASCII_WORD, and ASCII_NUMBER. The function also initializes a variable named TOKENS_COUNT to zero.
 
  -  The TOKENIZE_STRING function loops over each character in the input string and checks its ASCII value against the mapping arrays to determine its token type. The function then groups consecutive characters of the same type into a single token and adds it to the TOKENS array.

  - The TOKENIZE_STRING function returns the number of tokens that it has created.
 
 - The main code then loops over the TOKENS array and prints each token's index, type, and value. Finally, the code waits for the user to press a key before terminating.


PlayBASIC Code: [Select]
   Type tLexToken
TOKEN$
TOKEN_TYPE
INDEX
EndType

Dim TOKENS(256) as tLexToken

Message$=" This is the string I was to tokenize. 1000 * 2000"
Message$+= chr$(34)+"String Literal"+chr$(34)
Message$+= "12+4+5+6"
print Message$

Count=TOKENIZE_STRING(Message$, TOKENS())
For lp =0 to Count-1
s$ ="("+str$(Tokens(lp).TOKEN_TYPE)+")"
s$+="="+Tokens(lp).TOKEN
print "#"+str$(lp)+" "+s$
next

Sync
waitkey





// --------------------------------------------------------------------------------------------------
// --------------------------------------------------------------------------------------------------
// -------------------------------------->> TOKENIZE STRING <<---------------------------------------
// --------------------------------------------------------------------------------------------------
// --------------------------------------------------------------------------------------------------


function TOKENIZE_STRING(InputTEXT$, TOKENS().tLexToken)

Size =len(InputTEXT$)

Static _INIT_LOCALS
if _INIT_LOCALS=false

_INIT_LOCALS=true

dim ASCII_MAP(255)
dim ASCII_WORD(255)
dim ASCII_NUMBER(255)

// ----------------------------------------------------------
// Map common ASCC CHARACTERs
// ----------------------------------------------------------
ASCII_MAP(9) = 9 // TAB charatcer

// ----------------------------------------------------------
// MAP Characters
// ----------------------------------------------------------
WORD$ =" .()+-/*="
for lp=1 to len(WORD$)
ThisCHR = mid(WORD$,lp)
ASCII_MAP(ThisCHR)= ThisCHR
next


// ----------------------------------------------------------
// WORDS UPPER / LOWER CASE with UNDER SCORES
// ----------------------------------------------------------
WORD$ ="_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
for lp=1 to len(WORD$)
ThisCHR = mid(WORD$,lp)
ASCII_MAP(ThisCHR)= 1000
next



// ----------------------------------------------------------
// NUMBERS
// ----------------------------------------------------------
WORD$ ="0123456789"
for lp=1 to len(WORD$)
ThisCHR = mid(WORD$,lp)
ASCII_MAP(ThisCHR)= 1001
next

// ----------------------------------------------------------
// STRING LITERAL
// ----------------------------------------------------------
ASCII_MAP(34) = 1002 // STRING LITERAL block




// ----------------------------------------------------------
// ALPHA NUMERIC WORDS
// ----------------------------------------------------------

WORD$ ="0123456789_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
for lp=1 to len(WORD$)
ThisCHR = mid(WORD$,lp)
ASCII_WORD( ThisCHR) = ThisCHR
next

// ----------------------------------------------------------
// ALPHA NUMERIC WORDS
// ----------------------------------------------------------

WORD$ ="0123456789."
for lp=1 to len(WORD$)
ThisCHR = mid(WORD$,lp)
ASCII_NUMBER( ThisCHR) = ThisCHR
next

endif

TOKENS_COUNT=0

for lp=1 to size

ThisCHR = mid(InputTEXT$,lp)

CHR_TYPE = ASCII_MAP(ThisCHR)

// DECODE RUN OF SAME TYPE OF CHARACTERS
if (CHR_TYPE>255)

EndPOS = lp

// -------------------------------------------------
// DECODE A WORD
// -------------------------------------------------
if (CHR_TYPE=1000)

for ScanLP=lp to Size
SearchCHR = mid(InputTEXT$,ScanLP)
if (ASCII_WORD(SearchCHR)=0) then exit
EndPOS = ScanLP
next

endif

// -------------------------------------------------
// DECODE A NUMBER
// -------------------------------------------------
if (CHR_TYPE=1001)

for ScanLP=lp to Size
SearchCHR = mid(InputTEXT$,ScanLP)
if (ASCII_NUMBER(SearchCHR)=0) then exit
Login required to view complete source code