Simple Tokenizer

LemonWizard · August 29, 2016, 06:40:42 PM

Ok so there's two things I'm going to talk about.
First off, it looks like one issue causing extra zeros to have been generated in my old math parser at the end of an equasion and positions being wrong was actually a result of throwing the for loop a 0 as the starting position. It somehow evaluates two values twice as a result of throwing it zero.. This can be demonstrated by setting the position in the call to get token's first starting position to 0.
Instead of getting 3 as a result it will get 33.
This is a weird bug I didn't know existed. and it also probably was part of the cause of some of my other math parsing tests to be acting haywire.

Secondly for everyone here's a treat because I've been racking my brain over this forever and I was thinking as hard as I could about all the documents I read concerning how to make one of these.
The rule set I needed was as follows:
Group tokens into a multiple of 4, plus 1 other type, numerical.
At the start of the sequence if I do not find an arithmetic token, continue and add numbers to the token string until:
End of input string, or, a token that is not a number is found.
I was doing this all wrong. Re-using old numerical positions inside a string and trying to keep track of it all as Kevin advised me is just too tedious.
I have been reading a lot on and off about old-style asm parsers and such.
Using what I learned from documentation and trial and error I give to you an OVERLY simplified tokenizer program.

Code Select


; PROJECT : Tokenizer
; AUTHOR  : LemonWizard
; CREATED : 8/29/2016
; EDITED  : 8/29/2016
; ---------------------------------------------------------------------

mine$ = "1234testhello"
mine2$ ="1234testhello"

result = match_sequence(mine$, mine2$)

type token
value$
endtype


dim tokenlist$(5)
currenttoken=0

input$ = "3+50"

tokenlist$(1) = get_token(input$, 1)
print tokenlist$(1)
tokenlist$(2) = get_token(input$, 2)
print tokenlist$(2)
tokenlist$(3) = get_token(input$, 3)
print tokenlist$(3)


waitkey
waitnokey



//Basic maths
function add(value_a, value_b)

result = value_a + value_b	
	
endfunction result

function subtract(value_a, value_b)
	
result = value_a - value_b
	
endfunction result

function divide(value_a, value_b)

result = value_a / value_b
	
endfunction result

function multiply(value_a, value_b)

result = value_a / value_b
	
endfunction result

//end of basic maths


//Single character comparison
function match(character_s$, character_m$)
result = false

if character_s = character_m then result=true	
endfunction result


function getchar(inputstring$, pos)
if len(inputstring$) <1 
	exitfunction "null"
endif
	
character$ = mid$(inputstring$, pos, 1)
	
endfunction character$


//The step I was missing in my character testing was something similiar to regex where I can testfor each character case by case
	//And make sure that while stepping through the source string to replace that it perfectly matches the target string in EVERY way
		//In other words I have to re-write the mid$ function because it's not trust worthy.. -.-
function match_sequence(input_string$, match_string$)

result = false //Always assume false till proven true

input_length = len(input_string$)
match_length = len(match_string$)
//No need to really do anything if the length of the match string is longer than the input string
if input_length <> match_length 
	result = false
	exitfunction result
endif

if input_length = match_length

misses = 0
for t=1 to input_length

match_chr$ = getchar(input_string$, t)
test_chr$ = getchar(match_string$, t)
if match_chr$ <> test_chr$ then inc matches
	
next t

if matches>1 then result=false	
if matches=0 then result=true	
endif
	
endfunction result

function get_token(input$, position)

token_$=""
steps = 0
for t=position to len(input$)
char$ = getchar(input$, t)

if steps=0
	ex=false
if char$ = "+"
	token_$="add"
	ex=true
endif
	
if char$ = "-" 
	token_$="sub"
	ex=true
endif
	
if char$ = "*" 
	token_$="mult"
	ex=true
endif
	
if char$ = "/" 
	token_$="div"
	ex=true
endif
	
	if ex=true
lastpos = t
exitfor
endif

endif

//Uh oh we found something that's another token so exit
	//now instead of searching for numbers
		//We just add whatever the character is
			//To the current token.
				//No more dealing with strings directly I hope
if steps > 0
	ex=false
if char$ = "+" then ex=true
if char$ = "-" then ex=true
if char$ = "*" then ex=true
if char$ = "/" then ex=true
lastpos = t
if ex=true then exitfor
endif

//Increment the steps taken here.
steps = steps + 1
//Add a character to our token after double-checking if
	//It equals anything already
result = match_sequence(token_$, "add")
result = match_sequence(token_$, "sub")
result = match_sequence(token_$, "mult")
result = match_sequence(token_$, "div")


if result=0
	//Now we only add the character to our token IF it's not add sub mult or div
		tokstring$ = token_$ + char$ //Local string char$ should have the last grabbed character which if it's not + - / * it's a number.
			lastpos = t //Save this
			token_$ = tokstring$
			char$=""
	
	endif

next t


//The function will only exit on a few conditions
	//Condition 1, We found a symbol
		//Condition 2, The length of the string ran out
			//Condition 3, the input string is empty
endfunction token_$ //return the token.

You know it really is a treat to be able to present this in it's elegant form.
Please enjoy and do comment, I'm going to be using this as the core of my math parser I am designing.

LemonWizard · August 29, 2016, 07:44:48 PM

A small update, I've added some code that actually properly extracts each token and cuts the string up.
I would never want to use this method, I'd rather keep the full source string intact.. but ehh it's the only method that works.
I tried forever to get it to remember the last position it was at and to be able to continue from the last known position when capturing the next token. For some reason using T from the for loop as the last position (since I have T starting in the for loop at the current position) does not work in more than the first two steps. The math is probably wrong or something but using the last position in the for loop
just results in the get token function always receiving 3 as a last position. So it just gets stuck there.
Welp, there's the working token extraction.
And again.. I really really really hate having to chop the string up and start at the first position always just to simplify the functionality.
I feel like between the function call there SHOULD be a way to use the iterator from the for loop to actually preserve the last position
and then re-use it in the next function call after the function returns. I will never understand why that doesn't work but it would have been
a wonderful simplistic solution if it did work. Oh well this is still quite bare-bones and simple. All that matters is it does what it's supposed to right? er.. ::) Enjoy

Code Select


; PROJECT : Tokenizer
; AUTHOR  : LemonWizard
; CREATED : 8/29/2016
; EDITED  : 8/29/2016
; ---------------------------------------------------------------------
//Our test problem

problem$ ="10+100+1000-500"

//Get the tokens

	//there was a problem with retaining the last position in the string
		//The simplest solution is to always start at a position of 1
			//and chop the string apart >.>
				//It's rude but it is the only solution I can get to work
repeat
sym=false
tok$, pos, = get_token(problem$, 1)
if tok$="add" then sym=true
if tok$="sub" then sym=true
if tok$="mult" then sym=true
if tok$="div" then sym=true
if sym=true then lastlength=1
if sym=false then lastlength = len(tok$)

problem$ = cutleft$(problem$, lastlength)

print tok$

until problem$=""
sync
waitkey
waitnokey





//Basic maths
function add(value_a, value_b)

result = value_a + value_b	
	
endfunction result

function subtract(value_a, value_b)
	
result = value_a - value_b
	
endfunction result

function divide(value_a, value_b)

result = value_a / value_b
	
endfunction result

function multiply(value_a, value_b)

result = value_a / value_b
	
endfunction result

//end of basic maths


//Single character comparison
function match(character_s$, character_m$)
result = false

if character_s = character_m then result=true	
endfunction result


function getchar(inputstring$, pos)
if len(inputstring$) <1 
	exitfunction "null"
endif
	
character$ = mid$(inputstring$, pos, 1)
	
endfunction character$


//The step I was missing in my character testing was something similiar to regex where I can testfor each character case by case
	//And make sure that while stepping through the source string to replace that it perfectly matches the target string in EVERY way
		//In other words I have to re-write the mid$ function because it's not trust worthy.. -.-
function match_sequence(input_string$, match_string$)

result = false //Always assume false till proven true

input_length = len(input_string$)
match_length = len(match_string$)
//No need to really do anything if the length of the match string is longer than the input string
if input_length <> match_length 
	result = false
	exitfunction result
endif

if input_length = match_length

misses = 0
for t=1 to input_length

match_chr$ = getchar(input_string$, t)
test_chr$ = getchar(match_string$, t)
if match_chr$ <> test_chr$ then inc matches
	
next t

if matches>1 then result=false	
if matches=0 then result=true	
endif
	
endfunction result

function get_token(input$, position)

token_$=""
steps = 0
for t=position to len(input$)
char$ = getchar(input$, t)

if steps=0
	ex=false
if char$ = "+"
	token_$="add"
	ex=true
endif
	
if char$ = "-" 
	token_$="sub"
	ex=true
endif
	
if char$ = "*" 
	token_$="mult"
	ex=true
endif
	
if char$ = "/" 
	token_$="div"
	ex=true
endif
	
	if ex=true
lastpos = t
exitfor
endif

endif

//Uh oh we found something that's another token so exit
	//now instead of searching for numbers
		//We just add whatever the character is
			//To the current token.
				//No more dealing with strings directly I hope
if steps > 0
	ex=false
if char$ = "+" then ex=true
if char$ = "-" then ex=true
if char$ = "*" then ex=true
if char$ = "/" then ex=true
lastpos = t
if ex=true then exitfor
endif

//Increment the steps taken here.
steps = steps + 1
//Add a character to our token after double-checking if
	//It equals anything already
result = match_sequence(token_$, "add")
result = match_sequence(token_$, "sub")
result = match_sequence(token_$, "mult")
result = match_sequence(token_$, "div")


if result=0
	//Now we only add the character to our token IF it's not add sub mult or div
		tokstring$ = token_$ + char$ //Local string char$ should have the last grabbed character which if it's not + - / * it's a number.
			lastpos = t//Save this
			token_$ = tokstring$
			char$=""
	
endif
next t


//The function will only exit on a few conditions
	//Condition 1, We found a symbol
		//Condition 2, The length of the string ran out
			//Condition 3, the input string is empty

endfunction token_$, lastpos //return the token.

kevin · September 13, 2016, 09:58:20 PM

In my parser there's a set of global variables that are contextual. So there's a basically structure / type that holds the 'current' pointers of where it's searching and what mode it's in and so on.

There's a collection of wrapped functions that which have nothing but optional parameters, for things like finding the next token, which fills in the tokens type/ location etc within the source code and other similar functions that test whats next, and wrapped function that read a run of tokens until an 'closing' condition is met.

So parser is basically

ThisToken= GetNextToken()

Select ThisToken

case TOKEN_FOR

Parse_FOR_STATEMENT()

case TOKEN_NEXT

Parse_NEXT_STATEMENT()

endSelect

etc..

Decoding an expression we just grab a run of tokens from the current position to next line end or colon. There are times when there are other stop tokens as well, but that's generally the rule.

Mock up lexical scanner routine

News:

Simple Tokenizer

LemonWizard

LemonWizard

kevin