Solved

Help building a "lineup-parser"

Posted on 2011-02-16
22
540 Views
Last Modified: 2013-11-18
Hi,

I'm currently working on a Flash/Flex based website in which users can post Events.

An Event has a description and a list of (dance-)floors (each floor has a name). On each of these floors a list of artist can be added that perform on that floor.

In a first approach I created a text-field for the description and a list of lists for the floors and lineup with thousands of add-/delete-buttons. Unfortunately this was really anoying to fill with information and looked really crappy, so I had the Idea of simply adding a big TextArea for the description and adding BBCode like tags for entering the floors and lineups.

Here an example:
Some description text ... blah blah blah ... some more text ... 
blah blah blah 

blah

[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman 
[/Lineup]

blah ... blah blah blah ...

Open in new window


The options I was thinking about were an Antlr grammar for this simple language and having a parser generated for this (The generation part is easy, just the grammar is hard for me). The other alternative would be a set of regular expressions and a manually created parser.

Unfortunately my brain seems to be incompatible with formal language definitions and regular expressions.

Does anyone have a hint to a really simple antlr-grammar, that I could use to get me started (Grammar that parses the experts-exchange "code" blocks, for example) or could someone please provide me some sample regular expressions that I could get started with (Looking for something that has a result ["text before the lineup", "content of the lineup", "text after the lineup closing tag"] or ["Floor-Name", "Lineup content"])

I'm not just lazy ... I spent a lot of hours on my antlr and regexp path ... as I mentioned ... My brain seems to be incompatible with this formal thinking :-( (If some one could lend me a brain that can do this, I'd spend an extra 2000 Points ;-) )
0
Comment
Question by:ChristoferDutz
  • 11
  • 7
  • 2
  • +2
22 Comments
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman
[/Lineup]


<?xml version="1.0"?>
<Event>
   <Description>This is the major event of the year,/description>
   <LineUp>
      <Floor name="Uplifting Floor">
            <artist>Talla 2XLC</artist>
            <artist>Bluefire</artist>
            <artist>LXD</artist>
    </Floor>
    <Floor name="Classics Floor">
            <artist>Tube & Miller
            <artist>Charly the Diggerman
    </Floor>
   </LineUp>
</Event>

Doesn't Event need a time and a date? Doesn't the artist need a position? (Rather than rely on the lexical order)

There are SO MANY tools around to process XML it is hardly worthwhile maing your own data format.
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Ooops, in the second Floor element I have forgotten to add the end artist tag </artist>. Sorry!
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
Well processing XML would be easy. The text posted above should only deal with the Lineup and the Description. All other fields are dealt with seperately (with nice date-selectors and automatically initialized fields).

The thing is that I don't want my users to write complicated XML (I know xml is not hard, but users tend to run away if you present it to them or ask them to write it). I just want them to paste an event description they found somewhere and to easily modify the text so the system can process it.
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Your users will use an HTML form to paste their events. In this form you'll have fields for the event name/description, time and place, and an artist list. Now the artist list can be simply a text box where you ask that the put one artist to a line. Now whether you need to sort out the exact list is another matter. More importantly the event name, time and place ought to be captured in separate boxes.

When the data arrives at the server, one simply checks in an existing XML file, which serves as a database, whether the event is duplicated and if not you add it. You could store the artist list, out of the textbox, in a CDATA section under Floor so as to preserve the line structure.

But parsing free text entered by humans is fraught with difficulties, error handling and such. One needs to identify those significant stuctural pieces of information, establish integrity conditions and store them appropiately.
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility

In a first approach I created a text-field for the description and a list of lists for the floors and lineup with thousands of add-/delete-buttons. Unfortunately this was really anoying to fill with information and looked really crappy, so I had the Idea of simply adding a big TextArea for the description and adding BBCode like tags for entering the floors and lineups.

First off ... I'm not using HTML, but Flex/Flash. I allready created a soulution that built my Event-Floor-Linup model in the client. The thing is that there are events with 30 Artists on 5 Floors. Using the manual approach woud make it nessecary to click thousands of times, do thousands of copy+pastes. I don't want that. I want the users to paste the Description from the offial press releases and do a little minimal update of the text and the system interprets it.

No matter if this approach is good or bad ... could we get back to my question? Please?

0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
>>Please?

I did actually read your Flash/Flex section and the buttons and so on, which I though was very complicated. I wouldn't use such a technology because I think it is too complicated. But since you said please, I'll ask how far you've gotten with the grammar. Perhaps I can be of help.
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
Thanks :-)

Well at first I started looking for existing grammars for Wikipedia-Style Notation. I actually found some ... one was called Creole. Unfortunately this grammar was far too complex for my purposes and the initial version could not be processed by antlr because of problems with unambiguous paths when parsing certain inputs.

I have really tried getting my head around this grammar stuff and really like is (My antlr-generated mathematical expression evaluator works great in Java and Flex). But it seems free text is much more complicated to write such a grammar for.

What I'm currently looking for, is a grammar that would parse my sampe to the following tree:

Root
- Text ("Some description text ... blah blah blah ... some more text ..." ....)
- Lineup
  - Floor
    - Name ("Uplifting Floor")
    - Text ("Talla 2XLC")
    - Text ("Bluefire")
    - Text ("LXD")
  - Floor 
    - Name ("Classics Floor")
    - Text ("Tube & Miller")
    - Text ("Charly the Diggerman") 

Open in new window


PS: This technology is actually far less complex than most people think and I can do things the HTML guys wont be able to ... it's really easy as soon as you start thinking of Web-Applications differently. I admit writing Flash sucks ... but Flex rocks ;-)
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
I would be glad to have the following as a starting point:
Root
- Text ("Some description text ... blah blah blah ... some more text ..." ....)
- Lineup
  - Text ("[Floor: Uplifting Floor]\nTalla 2XLC\nBluefire\nLXD\n[Floor: Classics Floor]\nTube & Miller\nCharly the Diggerman
- Text ("blah ... blah blah blah ...") 

Open in new window


Unfortunately I forgott the "blah" ... line in my last example.
0
 
LVL 27

Assisted Solution

by:BigRat
BigRat earned 250 total points
Comment Utility
An Antlr grammar is going to be very difficult because of the free text that appears between some of the keywords. The Antlr system is two level - namely it first tokenizes the input text (that is breaks them into little chunks) and then parses the tokens into a tree.

You'd be much better off reading the data line by line and having a couple of routines like "startswith" and "contains"  and so on to do the lexical work and using a simple state machine to make the decisions. Thus the thing would look like :-

state = DESCRIPTION
for each line l in inputfile do
   if l = '[/LineUp]' then
       -- end of lineup
       if state != FLOOR then
           -- error incorrectly placed end tag
      else
           -- save description and lineup nodes
           state = DESCRIPTION
       end if
   else if l='[Lineup] then
       state = LINEUP
      -- create and append a new lineup node
  else if l = '[Floor:' then
        if state!= LINEUP then
               -- error misplaced FLOOR start tag
        else
             -- create a new floor node onto the end of the current list
             -- save the name part from the line l after the colon before the ]
             state = FLOOR
        end if
   else if l startswith '[' then
        -- error case keyword probably misspelt
   else
        -- line is text
        case state of
        DESCRIPTION
            -- append l onto the current description node
        LINEUP
            -- error - there should be NO text between lineup and floor
       FLOOR
           -- append a new text node to the last floor node for this artist
       end case
   end all ifs
end for
-- here the state must be DESCRIPTION and the text of it must be empty, otherwise something is missing

The capitialized words are in fact just constants which say in which state the machine is. The lines starting with -- are comments, the rest I hope is obvious.
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
Yeah ... well that would have been my ultimate fallback. But actually I am sort of trying to dig deeper into formal languages.
I am expecting my format to become more and more complex over time and maintaining such a manual parser does not seem to the the path that I want to go. Here again my question in a simpler way:

Could someone please post a antlr-grammar to parse this input:
Text1
Text2
[Lineup]
Text3
Text4
[/Lineup]
Text5

Open in new window


To the following tree:
root
- Text (Text1)
- Text (Text2)
- Lineup
  - Text (Text3)
  - Text (Text4)
- Text (Text5)

Open in new window


Or alternatively use a regular expression to split it up into thiese groups:
{Text1\nText2}{Text3\nText4}{Text5}

Open in new window

Meaning: {"Stuff before Lineup", "Stuff inside the Lienup Tags", "Stuff after the closing Lineup Tag"}

And I am really trying to solve this using an Antlr grammar and regular expressions and I don't want to do it differently. I actually want to try both approaches as an excercise for me in order to dig into these topics.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
@BigRat

>>  An Antlr grammar is going to be very difficult because of the free text that appears between some of the keywords.


For the benefit of the discussion, could you expand upon why it's difficult? I'm not immediately seeing what you mean  = )
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
I know this must be really really easy, but as I mentioned, I have tried digging into writing antlr grammars a few times now. It seems to be difficult only for me :-(

Mabe I allways missed the basic concept of these grammars. But if you think googling will provide you with such an easy sample-grammar, you will propably fail (At least I did). All grammars I could find are all way too complex to even start understanding them (if you are a beginner with this).
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
I've never used Antlr before, but I have had to write a recursive descent parser for class, which I thought we were told that's one of the things Antlr does. In the event that it is truly a pain to develop and Antlr grammar, I would think writing your own recursive descent parser wouldn't be that difficult.
0
 
LVL 14

Expert Comment

by:tomaugerdotcom
Comment Utility
Hey Chris, I know this is going to feel like it's setting your question all the way back to the beginning, but why do you not want to go with the XML approach?

The only difference that I can see between the BBCode-style approach and XML 1.0 is using <> instead of [ ]. It seems like exactly the same amount of work to me. I'm probably missing something because I know you know all about XML and must have decided not to go with that for some reason, but you state simplicity for your end users.

I would argue that it's simpler to use an existing standard for which there is tons of documentation and references, that you can validate if you write a proper DTD, and that parses natively using E4X, rather than trying to come up with something that's syntactically identical but using a different escape token ("[" instead of "<").

Now, going off in a different direction here - one thing that you might look at is simply using linebreaks as your separator. It looks like your structure is pretty formal and immutable. You can use a special token to indicate the start of a record, and then just make it such that each "record" follows the same format.
Lineup
%FLOOR%
Uplifting Floor
Talla 2XLC
Bluefire
LXD
%FLOOR%
Classics Floor
Tube & Miller
Charly the Diggerman

Open in new window


Alternatively, you might even just use a double return to separate your records like this:
Uplifting Floor
Talla 2XLC
Bluefire
LXD

Classics Floor
Tube & Miller
Charly the Diggerman

Open in new window


Okay, maybe this is not where you wanted to go, so forgive me since I'm coming late to the party here,   but some food for thought.

(*leaves to look up "Antir Grammar" on wikipedia)
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
Ok ... to explain my pain a little more ;-)

I am working on a private project, which consists of a Flex/Flash based client and a Java Server. Now I wanted to secure my application with a user/group/role security-model in which I can provide permissions as boolean expressions. These are used by some Aspects to secure things on the server-side. Now I wanted to avoid the client bombarding the server with "can the user do this and that" questions, so I needed the same evaluation-logic on the client too. As I didn't want to reimplement the same logic twice and maintain it (which would be the real pita), I used Antlr and let it generate a Java and an ActionScript evaluator. If I change anything in the syntax of my security-language when doing my maven build antlr will automatically write both evaluators for me and they will be in sync perfectly.

So much for the introduction.

My system should be able to parse these event texts in the client, but also a Server-Side tool should be able to parse events posted by email so again I would have to double implement these parsers.

This is why I wanted to stick with antlr and a grammar for my simplistic event-language.
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
@tomaugerdotcom

The problem is the user ... I know they will use [ and ] within their description texts and I know the pasted texts will contain linebreaks and different types of formatting. a "stupid" replacement won't bring me far. Especially if a user does something like this:
Really cool event in the Club XYZ [Formerly Known as Club ABC]. We woud be glad to see you!

Open in new window

In this case the entire editor would fail and my users would become frustrated very soon. A parser as the one I am talking about would be way more robust ... and keep my last post in mind ... I don't want to implement this multiple times.
0
 
LVL 27

Assisted Solution

by:BigRat
BigRat earned 250 total points
Comment Utility
It's really a problem of two identical pieces of text being different. In the description there are multiple lines separated by CR/LF (presumably) which make a whole. But in the artist list the same text must be split linewise. In other words the lexical parsing is context dependant.

What is problematic with the formal approach is the lack of real error processing. Basically a RegEx would split it up like this   (^\[)*\[Lineup]((\r\n)*\[Floor:( )*([a-z,A-Z])+\](\r\n)((^[\r,\n])+\r\n)+)\[\/Lineup\]\r\n which is just an attempt to tackle the syntax without using fancy features and ignoring upepr/lower case issues. Either this matches the input or it doesn't and when it doesn't there's no telling why.

In the very old days state machines were used instead of formal parsers which came into usage after round 1972 (I remember Franklin de Remer explaining the method at a conference in that year). I have written a full xml parser and xpath processor just using a simple state machine. On the other hand I have also written a full LAR parser generator (that is an LR(1) parser - that from YACC and Bison are in fact LALR(1) - with a regular expression resolver to resolve LR ambiguities. Error recovery - ie: tolerance of user input gets reduce to syntax error and if you are lucky at the place where it exactly is, although sometimes several characters further. The user is forced to change his input and try again.

With a state machine one can take appropiate action when the syntax is broken but in a way that recovery is possible - here a missing [/Lineup] tag. If nothing follows in the file it does not matter.
0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
My idea would be that the system takes the input and parses it. If it is able to, it interprets it, if there are errors the entire input is treated as a description.

Well I think the formal approach is not at all problematic regarding error processing. As an example the security-evaluator understands boolean expressions on the lowest level, but is does understand full mathematical expressions above that "(5 * 6 / [user.id]) > 100" (Only the result must be a boolean) and it indeed detects errors if I do something bad "5 * / 6" or "(5 * 6 / [user.id) > 100" (notice the missing "]") or "42" (result is not a boolean).

Seems I will have to ask the Antlr mailinglist. Just thought this would be really simple :-(
0
 
LVL 11

Accepted Solution

by:
petiex earned 250 total points
Comment Utility
Here's a regexp implementation. I didn't bother with any exception checking, like when there is no [Floor] or [Lineup], but you can add that as the need emerges:
<?xml version="1.0" encoding="utf-8"?>
<mx:Application xmlns:fx="http://ns.adobe.com/mxml/2009"
                xmlns:s="library://ns.adobe.com/flex/spark"
                xmlns:mx="library://ns.adobe.com/flex/mx">
    <fx:Script>
     <![CDATA[

        private var wholeRegExp:RegExp = /([\s\S]*)(\[Lineup\])([\s\S]*)(\[\/Lineup\])([\s\S]*)/m;
        private var floorRegExp:RegExp = /\[Floor:([^\]]+)\]([\s\S]?)/g;

        private function translate():void {
              var result:Object = wholeRegExp.exec(testTextArea.text);
            resultTextArea.htmlText = result[1]+"<br/><b>LINEUP</b><br/>";
            resultTextArea.htmlText += String(result[3]).replace(floorRegExp, "Floor<br/>  Name: $1<br/>$2<br/>");
            resultTextArea.htmlText += "<br/>After lineup: "+result[5];
        }
     ]]>
</fx:Script>
    <mx:TextArea id="testTextArea" width="300" height="100">
        <mx:text><![CDATA[
Some description text ... blah blah blah ... some more text ...
blah blah blah

blah

[Lineup]
[Floor: Uplifting Floor]
Talla 2XLC
Bluefire
LXD
[Floor: Classics Floor]
Tube & Miller
Charly the Diggerman
[/Lineup]

blah ... blah blah blah ...
            ]]>
    </mx:text></mx:TextArea>
    <mx:Button click="translate()" label="Translate"/>
    <mx:TextArea width="300" height="100" id="resultTextArea"/>
</mx:Application>

Open in new window

0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
>>Just thought this would be really simple :-(

It would be if you thought like that. I'd turn the problem around and make the input soooooo simple that it would be easy to use. The common case must be the easy case. People can always use the multiple easy case for the complex. The input should look like :-

Event:
   Meat Loaf comes to town.....................
His band includes ........................
LineUp:
Floor: UpperStory
Meatloaf
The artist formerly known as Rat
Floor: Basement
Dave Brubeck and the late Dudley Moore

You don't need any formal grammar. The header is always one word followed by a colon and CR/LF.
You don't need any end tags, which saves finding out if they are there and erroring when not. I don't even know why one must have LineUp. I would have thought that date, time and place were more important :-

Event:
   Meat Loaf comes to town.....................
His band includes ........................
Date:
   22.10.2011
Time:
   20:30
Place:
Rat's House
Rathouseflats
Darmstadt
Floor: UpperStory
Meatloaf
The artist formerly known as Rat
Floor: Basement
Dave Brubeck and the late Dudley Moore

0
 
LVL 20

Author Comment

by:ChristoferDutz
Comment Utility
And why don't all the Wikis and similar do it the way I want to do it? Have a look at the editor I am writing this text in right now:

Code blocks are wrapped in "[ C o d e]" and "[ / C o d e ]" or Quotes, Links ... Have a look at Wikipedia, most Forum systems, Jira's Wiki Notation, ... the list is endless. It's something the peope are used to.

I my case this format would allow far to many falsely parsed elemens and my Artist list would be filled with junk, that I have to remove manually.



.....


Thanks @petiex even if RegExps were my fallback path. I think I'll stick with that. Seems to be doing exactly what I wanted :-) Think I'll still try to build a grammar for this though ;-)
0
 
LVL 20

Author Closing Comment

by:ChristoferDutz
Comment Utility
Thanks for your help ... for the antlr grammar I'll contact the Mailinglist :-)
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
The last time I worked with Flash and Socket connections was in AS1. A recent project required flash connecting to a Socket, and sending receiving information - we figured it would be easy enough - we all know about the socket policy documents and c…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now