Solved

a little Java lexer regex question

Posted on 2014-03-07
16
413 Views
Last Modified: 2014-03-08
hi Guys,

I'm writing a little lexer for a simple calculator. Nothing fancy. I'm trying to learn, so I want to write it myself.

I have this pattern:

(?<SIN>(?!sin\\()([-]?[0-9.]+)(?=\\)))

Open in new window


that will match the number inside the parenthesis of sin(xx), where x is any number, so "sin(2.3)" will give me this token: ["2.3"]

That would be great, except, my matcher also catches parentheses by these expressions:

(?<LEFTPARENS>\\()|(?<RIGHTTPARENS>\\))

Open in new window


So I end up with these tokens: ["(", "2.3", ")"] but I only want ["2.3"]

Is there a way to tell the matcher to skip the part of the string that is matched by another group?
0
Comment
Question by:Kyle Hamilton
  • 7
  • 5
  • 4
16 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 39914187
I don't have your code, so I don't know exactly what you are trying to do

But if you want to capture only what is in parentheses of sin, this works for me

    public static void main(String args[])
    {
        Pattern p = Pattern.compile("sin\\((-?\\d+(?:\\.\\d+)?)\\)");
        String  s = "sin(2.3)";

        Matcher m = p.matcher(s);

        if (m.find())
        {
            System.out.println(m.group(1));
        }
    }

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 39914348
It must be said that regex is not the right tool for the job. For instance, the last code doesn't match
 'sin(2.3 )'
 'sin(.23)'
and that's just a very simple expression. If your objective is to learn regex then this is not really a good context in which to do it
0
 
LVL 31

Expert Comment

by:farzanj
ID: 39914496
This would catch it:
sin\\((-?\\d*(?:\\.\\d+)?)\\)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 39914508
Yes, but beyond simple expressions, the approach really doesn't scale
0
 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39914533
the objective is to write a lexer/tokenizer for simple math expressions. asfaik regular expressions are one way to do that. i dont know of other ways, but what i dont want to do is write a psuedo state machine that reads the input letter by letter.

the expression for extracting the number from the sin function is not the issue.

the issue is that besides the sin expression i have a parenthesis expression for picking up parentheses. my question is, how to pick up parentheses but not ones already picked up by other exressions.


given this input string:

(1+2)*sin(2.3)

i want to end up with these tokens:

(, 1, 2, +, ), *, 2.3

i will post my whole pattern in a bit. i'm mot at my computer.

thanks
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 250 total points
ID: 39914542
Hi CEHJ,  Just a question.  What is YACC and what is it based on?
0
 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39914546
or more precisely, these tokens:

LEFTPARENS: (
NUMBER: 1
OPERATOR: +
NUMBER: 2
RIGHTPARENS: )
OPERATOR: *
SIN: 2.3


(i have the order wrong in previous post. i don't want to confuse things. plus sign should have come before the 2. sorry. )
0
 
LVL 31

Expert Comment

by:farzanj
ID: 39914558
Something like:
(\\()?(\\d+([+-*\\/]\\d+)*)(\\))?[*\\/]sin\\((-?\\d*(?:\\.\\d+)?)\\)

Open in new window

0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39914584
maybe i'm gonna need that FSM after all. looks like i was skipping the "scanner" phase of the tokenization process, and  going straight to the "evaluator" phase.

tokenization section:
http://en.m.wikipedia.org/wiki/Lexical_analyzer
0
 
LVL 31

Expert Comment

by:farzanj
ID: 39914598
I don't know what you are trying to do.  I answered whatever you asked.  Regular expression implements FSM.  This is how compilers are written.  YACC is a tool used to write compilers and it creates parsers for programming languages.  It uses regex to write BNF.
0
 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39914610
hi farzanj,

I appreciate all the help. I'm sorry if my question is not clear - I should have given it a different title. My question doesn't have to do with regex, it has to do with lexical analysis and tokenization.

I'm aware the regular expressions are implemented with FSMs. When I mentioned the FSM before, it was not referring to a regex engine implementation. It was the "scanner" phase of the tokenization process which employs its own FSM.

For now, I decided not to do everything in one step, and catch the entire sin(x) function then process it again later to extract the number. To do this whole project "properly" I would rewrite it according to the wikipedia page I posted earlier.

My code is on github, if that helps:
https://github.com/kyleiwaniec/cos210/blob/master/Spring2014/Calculator/InfixToPostfix.java

with this sample input:

(2+3)*sin(2.3)

I now get:

OPERATOR : (
NUMBER : 2
OPERATOR : +
NUMBER : 3
OPERATOR : )
OPERATOR : *
SIN : sin(2.3)  // process again to extract number


( I am not trying to write a full fledged lexer/parser. Just something small for a very simple calculator. At the moment all the code lives in one file, that's just for convenience ).
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 250 total points
ID: 39914856
You might like to look at https://javacc.java.net/ though i haven't used it myself
0
 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39914902
Thanks CEHJ.

That's much more than what I was looking for. I wanted to write a lexer from scratch - a very basic one.

I think I better close this question. I didn't phrase it properly, and it's probably too broad a question anyway.
0
 
LVL 25

Author Closing Comment

by:Kyle Hamilton
ID: 39915001
I'm assigning points this way because it led me to try to clarify my own question in my own mind. Thanks for the help.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 39915026
OK. Maybe you can give me some lessons on it once you're au fait ;)
0
 
LVL 25

Author Comment

by:Kyle Hamilton
ID: 39915032
lol - dont hold your breath!
:))
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now