Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Help building a "lineup-parser"

Posted on 2011-02-16
22
545 Views
Last Modified: 2013-11-18
Hi,

I'm currently working on a Flash/Flex based website in which users can post Events.

An Event has a description and a list of (dance-)floors (each floor has a name). On each of these floors a list of artist can be added that perform on that floor.

In a first approach I created a text-field for the description and a list of lists for the floors and lineup with thousands of add-/delete-buttons. Unfortunately this was really anoying to fill with information and looked really crappy, so I had the Idea of simply adding a big TextArea for the description and adding BBCode like tags for entering the floors and lineups.

Here an example:
Some description text ... blah blah blah ... some more text ... 
blah blah blah 

blah

[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman 
[/Lineup]

blah ... blah blah blah ...

Open in new window


The options I was thinking about were an Antlr grammar for this simple language and having a parser generated for this (The generation part is easy, just the grammar is hard for me). The other alternative would be a set of regular expressions and a manually created parser.

Unfortunately my brain seems to be incompatible with formal language definitions and regular expressions.

Does anyone have a hint to a really simple antlr-grammar, that I could use to get me started (Grammar that parses the experts-exchange "code" blocks, for example) or could someone please provide me some sample regular expressions that I could get started with (Looking for something that has a result ["text before the lineup", "content of the lineup", "text after the lineup closing tag"] or ["Floor-Name", "Lineup content"])

I'm not just lazy ... I spent a lot of hours on my antlr and regexp path ... as I mentioned ... My brain seems to be incompatible with this formal thinking :-( (If some one could lend me a brain that can do this, I'd spend an extra 2000 Points ;-) )
0
Comment
Question by:ChristoferDutz
  • 11
  • 7
  • 2
  • +2
22 Comments
 
LVL 27

Expert Comment

by:BigRat
ID: 34914937
[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman
[/Lineup]


<?xml version="1.0"?>
<Event>
   <Description>This is the major event of the year,/description>
   <LineUp>
      <Floor name="Uplifting Floor">
            <artist>Talla 2XLC</artist>
            <artist>Bluefire</artist>
            <artist>LXD</artist>
    </Floor>
    <Floor name="Classics Floor">
            <artist>Tube & Miller
            <artist>Charly the Diggerman
    </Floor>
   </LineUp>
</Event>

Doesn't Event need a time and a date? Doesn't the artist need a position? (Rather than rely on the lexical order)

There are SO MANY tools around to process XML it is hardly worthwhile maing your own data format.
0
 
LVL 27

Expert Comment

by:BigRat
ID: 34914947
Ooops, in the second Floor element I have forgotten to add the end artist tag </artist>. Sorry!
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34914966
Well processing XML would be easy. The text posted above should only deal with the Lineup and the Description. All other fields are dealt with seperately (with nice date-selectors and automatically initialized fields).

The thing is that I don't want my users to write complicated XML (I know xml is not hard, but users tend to run away if you present it to them or ask them to write it). I just want them to paste an event description they found somewhere and to easily modify the text so the system can process it.
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 27

Expert Comment

by:BigRat
ID: 34915009
Your users will use an HTML form to paste their events. In this form you'll have fields for the event name/description, time and place, and an artist list. Now the artist list can be simply a text box where you ask that the put one artist to a line. Now whether you need to sort out the exact list is another matter. More importantly the event name, time and place ought to be captured in separate boxes.

When the data arrives at the server, one simply checks in an existing XML file, which serves as a database, whether the event is duplicated and if not you add it. You could store the artist list, out of the textbox, in a CDATA section under Floor so as to preserve the line structure.

But parsing free text entered by humans is fraught with difficulties, error handling and such. One needs to identify those significant stuctural pieces of information, establish integrity conditions and store them appropiately.
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34915177

In a first approach I created a text-field for the description and a list of lists for the floors and lineup with thousands of add-/delete-buttons. Unfortunately this was really anoying to fill with information and looked really crappy, so I had the Idea of simply adding a big TextArea for the description and adding BBCode like tags for entering the floors and lineups.

First off ... I'm not using HTML, but Flex/Flash. I allready created a soulution that built my Event-Floor-Linup model in the client. The thing is that there are events with 30 Artists on 5 Floors. Using the manual approach woud make it nessecary to click thousands of times, do thousands of copy+pastes. I don't want that. I want the users to paste the Description from the offial press releases and do a little minimal update of the text and the system interprets it.

No matter if this approach is good or bad ... could we get back to my question? Please?

0
 
LVL 27

Expert Comment

by:BigRat
ID: 34915292
>>Please?

I did actually read your Flash/Flex section and the buttons and so on, which I though was very complicated. I wouldn't use such a technology because I think it is too complicated. But since you said please, I'll ask how far you've gotten with the grammar. Perhaps I can be of help.
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34915375
Thanks :-)

Well at first I started looking for existing grammars for Wikipedia-Style Notation. I actually found some ... one was called Creole. Unfortunately this grammar was far too complex for my purposes and the initial version could not be processed by antlr because of problems with unambiguous paths when parsing certain inputs.

I have really tried getting my head around this grammar stuff and really like is (My antlr-generated mathematical expression evaluator works great in Java and Flex). But it seems free text is much more complicated to write such a grammar for.

What I'm currently looking for, is a grammar that would parse my sampe to the following tree:

Root
- Text ("Some description text ... blah blah blah ... some more text ..." ....)
- Lineup
  - Floor
    - Name ("Uplifting Floor")
    - Text ("Talla 2XLC")
    - Text ("Bluefire")
    - Text ("LXD")
  - Floor 
    - Name ("Classics Floor")
    - Text ("Tube & Miller")
    - Text ("Charly the Diggerman") 

Open in new window


PS: This technology is actually far less complex than most people think and I can do things the HTML guys wont be able to ... it's really easy as soon as you start thinking of Web-Applications differently. I admit writing Flash sucks ... but Flex rocks ;-)
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34915385
I would be glad to have the following as a starting point:
Root
- Text ("Some description text ... blah blah blah ... some more text ..." ....)
- Lineup
  - Text ("[Floor: Uplifting Floor]\nTalla 2XLC\nBluefire\nLXD\n[Floor: Classics Floor]\nTube & Miller\nCharly the Diggerman
- Text ("blah ... blah blah blah ...") 

Open in new window


Unfortunately I forgott the "blah" ... line in my last example.
0
 
LVL 27

Assisted Solution

by:BigRat
BigRat earned 250 total points
ID: 34915657
An Antlr grammar is going to be very difficult because of the free text that appears between some of the keywords. The Antlr system is two level - namely it first tokenizes the input text (that is breaks them into little chunks) and then parses the tokens into a tree.

You'd be much better off reading the data line by line and having a couple of routines like "startswith" and "contains"  and so on to do the lexical work and using a simple state machine to make the decisions. Thus the thing would look like :-

state = DESCRIPTION
for each line l in inputfile do
   if l = '[/LineUp]' then
       -- end of lineup
       if state != FLOOR then
           -- error incorrectly placed end tag
      else
           -- save description and lineup nodes
           state = DESCRIPTION
       end if
   else if l='[Lineup] then
       state = LINEUP
      -- create and append a new lineup node
  else if l = '[Floor:' then
        if state!= LINEUP then
               -- error misplaced FLOOR start tag
        else
             -- create a new floor node onto the end of the current list
             -- save the name part from the line l after the colon before the ]
             state = FLOOR
        end if
   else if l startswith '[' then
        -- error case keyword probably misspelt
   else
        -- line is text
        case state of
        DESCRIPTION
            -- append l onto the current description node
        LINEUP
            -- error - there should be NO text between lineup and floor
       FLOOR
           -- append a new text node to the last floor node for this artist
       end case
   end all ifs
end for
-- here the state must be DESCRIPTION and the text of it must be empty, otherwise something is missing

The capitialized words are in fact just constants which say in which state the machine is. The lines starting with -- are comments, the rest I hope is obvious.
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34915787
Yeah ... well that would have been my ultimate fallback. But actually I am sort of trying to dig deeper into formal languages.
I am expecting my format to become more and more complex over time and maintaining such a manual parser does not seem to the the path that I want to go. Here again my question in a simpler way:

Could someone please post a antlr-grammar to parse this input:
Text1
Text2
[Lineup]
Text3
Text4
[/Lineup]
Text5

Open in new window


To the following tree:
root
- Text (Text1)
- Text (Text2)
- Lineup
  - Text (Text3)
  - Text (Text4)
- Text (Text5)

Open in new window


Or alternatively use a regular expression to split it up into thiese groups:
{Text1\nText2}{Text3\nText4}{Text5}

Open in new window

Meaning: {"Stuff before Lineup", "Stuff inside the Lienup Tags", "Stuff after the closing Lineup Tag"}

And I am really trying to solve this using an Antlr grammar and regular expressions and I don't want to do it differently. I actually want to try both approaches as an excercise for me in order to dig into these topics.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34916187
@BigRat

>>  An Antlr grammar is going to be very difficult because of the free text that appears between some of the keywords.


For the benefit of the discussion, could you expand upon why it's difficult? I'm not immediately seeing what you mean  = )
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34916241
I know this must be really really easy, but as I mentioned, I have tried digging into writing antlr grammars a few times now. It seems to be difficult only for me :-(

Mabe I allways missed the basic concept of these grammars. But if you think googling will provide you with such an easy sample-grammar, you will propably fail (At least I did). All grammars I could find are all way too complex to even start understanding them (if you are a beginner with this).
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34916316
I've never used Antlr before, but I have had to write a recursive descent parser for class, which I thought we were told that's one of the things Antlr does. In the event that it is truly a pain to develop and Antlr grammar, I would think writing your own recursive descent parser wouldn't be that difficult.
0
 
LVL 14

Expert Comment

by:tomaugerdotcom
ID: 34916389
Hey Chris, I know this is going to feel like it's setting your question all the way back to the beginning, but why do you not want to go with the XML approach?

The only difference that I can see between the BBCode-style approach and XML 1.0 is using <> instead of [ ]. It seems like exactly the same amount of work to me. I'm probably missing something because I know you know all about XML and must have decided not to go with that for some reason, but you state simplicity for your end users.

I would argue that it's simpler to use an existing standard for which there is tons of documentation and references, that you can validate if you write a proper DTD, and that parses natively using E4X, rather than trying to come up with something that's syntactically identical but using a different escape token ("[" instead of "<").

Now, going off in a different direction here - one thing that you might look at is simply using linebreaks as your separator. It looks like your structure is pretty formal and immutable. You can use a special token to indicate the start of a record, and then just make it such that each "record" follows the same format.
Lineup
%FLOOR%
Uplifting Floor
Talla 2XLC
Bluefire
LXD
%FLOOR%
Classics Floor
Tube & Miller
Charly the Diggerman

Open in new window


Alternatively, you might even just use a double return to separate your records like this:
Uplifting Floor
Talla 2XLC
Bluefire
LXD

Classics Floor
Tube & Miller
Charly the Diggerman

Open in new window


Okay, maybe this is not where you wanted to go, so forgive me since I'm coming late to the party here,   but some food for thought.

(*leaves to look up "Antir Grammar" on wikipedia)
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34916399
Ok ... to explain my pain a little more ;-)

I am working on a private project, which consists of a Flex/Flash based client and a Java Server. Now I wanted to secure my application with a user/group/role security-model in which I can provide permissions as boolean expressions. These are used by some Aspects to secure things on the server-side. Now I wanted to avoid the client bombarding the server with "can the user do this and that" questions, so I needed the same evaluation-logic on the client too. As I didn't want to reimplement the same logic twice and maintain it (which would be the real pita), I used Antlr and let it generate a Java and an ActionScript evaluator. If I change anything in the syntax of my security-language when doing my maven build antlr will automatically write both evaluators for me and they will be in sync perfectly.

So much for the introduction.

My system should be able to parse these event texts in the client, but also a Server-Side tool should be able to parse events posted by email so again I would have to double implement these parsers.

This is why I wanted to stick with antlr and a grammar for my simplistic event-language.
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34916454
@tomaugerdotcom

The problem is the user ... I know they will use [ and ] within their description texts and I know the pasted texts will contain linebreaks and different types of formatting. a "stupid" replacement won't bring me far. Especially if a user does something like this:
Really cool event in the Club XYZ [Formerly Known as Club ABC]. We woud be glad to see you!

Open in new window

In this case the entire editor would fail and my users would become frustrated very soon. A parser as the one I am talking about would be way more robust ... and keep my last post in mind ... I don't want to implement this multiple times.
0
 
LVL 27

Assisted Solution

by:BigRat
BigRat earned 250 total points
ID: 34916647
It's really a problem of two identical pieces of text being different. In the description there are multiple lines separated by CR/LF (presumably) which make a whole. But in the artist list the same text must be split linewise. In other words the lexical parsing is context dependant.

What is problematic with the formal approach is the lack of real error processing. Basically a RegEx would split it up like this   (^\[)*\[Lineup]((\r\n)*\[Floor:( )*([a-z,A-Z])+\](\r\n)((^[\r,\n])+\r\n)+)\[\/Lineup\]\r\n which is just an attempt to tackle the syntax without using fancy features and ignoring upepr/lower case issues. Either this matches the input or it doesn't and when it doesn't there's no telling why.

In the very old days state machines were used instead of formal parsers which came into usage after round 1972 (I remember Franklin de Remer explaining the method at a conference in that year). I have written a full xml parser and xpath processor just using a simple state machine. On the other hand I have also written a full LAR parser generator (that is an LR(1) parser - that from YACC and Bison are in fact LALR(1) - with a regular expression resolver to resolve LR ambiguities. Error recovery - ie: tolerance of user input gets reduce to syntax error and if you are lucky at the place where it exactly is, although sometimes several characters further. The user is forced to change his input and try again.

With a state machine one can take appropiate action when the syntax is broken but in a way that recovery is possible - here a missing [/Lineup] tag. If nothing follows in the file it does not matter.
0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34916716
My idea would be that the system takes the input and parses it. If it is able to, it interprets it, if there are errors the entire input is treated as a description.

Well I think the formal approach is not at all problematic regarding error processing. As an example the security-evaluator understands boolean expressions on the lowest level, but is does understand full mathematical expressions above that "(5 * 6 / [user.id]) > 100" (Only the result must be a boolean) and it indeed detects errors if I do something bad "5 * / 6" or "(5 * 6 / [user.id) > 100" (notice the missing "]") or "42" (result is not a boolean).

Seems I will have to ask the Antlr mailinglist. Just thought this would be really simple :-(
0
 
LVL 11

Accepted Solution

by:
petiex earned 250 total points
ID: 34916777
Here's a regexp implementation. I didn't bother with any exception checking, like when there is no [Floor] or [Lineup], but you can add that as the need emerges:
<?xml version="1.0" encoding="utf-8"?>
<mx:Application xmlns:fx="http://ns.adobe.com/mxml/2009"
                xmlns:s="library://ns.adobe.com/flex/spark"
                xmlns:mx="library://ns.adobe.com/flex/mx">
    <fx:Script>
     <![CDATA[

        private var wholeRegExp:RegExp = /([\s\S]*)(\[Lineup\])([\s\S]*)(\[\/Lineup\])([\s\S]*)/m;
        private var floorRegExp:RegExp = /\[Floor:([^\]]+)\]([\s\S]?)/g;

        private function translate():void {
              var result:Object = wholeRegExp.exec(testTextArea.text);
            resultTextArea.htmlText = result[1]+"<br/><b>LINEUP</b><br/>";
            resultTextArea.htmlText += String(result[3]).replace(floorRegExp, "Floor<br/>  Name: $1<br/>$2<br/>");
            resultTextArea.htmlText += "<br/>After lineup: "+result[5];
        }
     ]]>
</fx:Script>
    <mx:TextArea id="testTextArea" width="300" height="100">
        <mx:text><![CDATA[
Some description text ... blah blah blah ... some more text ...
blah blah blah

blah

[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman
[/Lineup]

blah ... blah blah blah ...
            ]]>
    </mx:text></mx:TextArea>
    <mx:Button click="translate()" label="Translate"/>
    <mx:TextArea width="300" height="100" id="resultTextArea"/>
</mx:Application>

Open in new window

0
 
LVL 27

Expert Comment

by:BigRat
ID: 34916877
>>Just thought this would be really simple :-(

It would be if you thought like that. I'd turn the problem around and make the input soooooo simple that it would be easy to use. The common case must be the easy case. People can always use the multiple easy case for the complex. The input should look like :-

Event:
   Meat Loaf comes to town.....................
His band includes ........................
LineUp:
Floor: UpperStory
Meatloaf
The artist formerly known as Rat
Floor: Basement
Dave Brubeck and the late Dudley Moore

You don't need any formal grammar. The header is always one word followed by a colon and CR/LF.
You don't need any end tags, which saves finding out if they are there and erroring when not. I don't even know why one must have LineUp. I would have thought that date, time and place were more important :-

Event:
   Meat Loaf comes to town.....................
His band includes ........................
Date:
   22.10.2011
Time:
   20:30
Place:
Rat's House
Rathouseflats
Darmstadt
Floor: UpperStory
Meatloaf
The artist formerly known as Rat
Floor: Basement
Dave Brubeck and the late Dudley Moore

0
 
LVL 20

Author Comment

by:ChristoferDutz
ID: 34917278
And why don't all the Wikis and similar do it the way I want to do it? Have a look at the editor I am writing this text in right now:

Code blocks are wrapped in "[ C o d e]" and "[ / C o d e ]" or Quotes, Links ... Have a look at Wikipedia, most Forum systems, Jira's Wiki Notation, ... the list is endless. It's something the peope are used to.

I my case this format would allow far to many falsely parsed elemens and my Artist list would be filled with junk, that I have to remove manually.



.....


Thanks @petiex even if RegExps were my fallback path. I think I'll stick with that. Seems to be doing exactly what I wanted :-) Think I'll still try to build a grammar for this though ;-)
0
 
LVL 20

Author Closing Comment

by:ChristoferDutz
ID: 34917313
Thanks for your help ... for the antlr grammar I'll contact the Mailinglist :-)
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Regular Expression- Range (telephone) 4 48
HTML 5 video and audio or Flash 1 60
Problem to open Excel file 15 178
Not seen Link button 5 55
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
The goal of the tutorial is to teach the user how to how to record live broadcast.
The goal of the tutorial is to teach the user how to select which audio input to use. Once you have an audio input plugged into the laptop or computer, you will go into the audio input settings and choose which audio input you want to use.

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question