XML without ending element tag?

I have a file sent to me that I need to parse.  It was not saved with an  extension and what I thought to be XML at first looks mal-formed.

What method would you use to parse?  When I've tried xml reader it fails.

<AWARD>
<DATE>0917
<YEAR>13
<AGENCY>test
<OFFICE>Air
<LOCATION>23 CONS
<ZIP>31699-1794
<CLASSCOD>54
<NAICS>332311
<OFFADD>4380B Arkansas Rd Dalton GA 31699-1794
<SUBJECT>AircraftMaintenance
<SOLNBR>hA4830-13-a-0007
<NTYPE>COMBINE  
<CONTACT>test P. DeBasio, Contract Administrator, Phone 333-237-4316, Email test.test@test - Jennifer M. testr, Contract Officer, Phone 333-237-4316, Email two.harper.2@test.mil
<AWDNBR>AA4830-13-Z-S004
<AWDAMT>$83,320.00
<AWDDATE>091713
<AWARDEE>xxx-State Restoration, Inc. , 748 Gold Ave, Ste 2, Test City, SC 86442 US
<DESC>
<LINK>
<URL>
<DESC>Link To Document
</AWARD>
RealityGroupAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Julian HansenCommented:
This looks like XML but it is not.

The elements don't have closing tags like

<DATE>0917</DATE>

Which is a requirement for an XML document.

In this case - assuming there is one attribute per line i.e. the <CONTACT> fields is one long line in the data file and does not wrap to the next line as it does here - then you would need to read the file in line by line and use something like a Regular expression to split the lines up into key value pairs.
0
Geert BormansInformation ArchitectCommented:
This is something that looks like XML but is not XML at all.

What you could try is run TagSoup over it, that is a tool that clears html files into wellformed xhtml, but it does not throw away the unknown elements... I use it for normalizing XML too, and it tends to work well. TagSoup has ports in different languages (http://ccil.org/~cowan/XML/tagsoup/)

Using some regular expressions and a scripting language, and a limted state machine could help too, but there is some risk involved, mainly with encodings and less common characters. This one file looks pretty clean, but just make sure you are not building a cripled XML parser yourself

You could look into micro xml tools (http://www.ibm.com/developerworks/library/x-microxml1/) Existing parsers are a lot more flexible, so you could catch some events and end up with wellformed XML (there is some stuff you need to do, but the groundwork can be used)

It seems by the way that this file is valid SGML (the XML ancestor), though it lacks a DTD.
So that is an oportunity. If you need to process a lot of those, you could develop an SGML DTD for it and use SGML tools for parsing
- nsgmls (http://www.jclark.com/sp/sx.htm), maybe omnimark or balise...

It all depends on how many of those you need to process and how often
0
Geert BormansInformation ArchitectCommented:
If it is only this one document...

<?xml version="1.0" encoding="UTF-8"?>
<AWARD>
    <DATE>0917</DATE>
    <YEAR>13</YEAR>
    <AGENCY>test</AGENCY>
    <OFFICE>Air</OFFICE>
    <LOCATION>23 CONS</LOCATION>
    <ZIP>31699-1794</ZIP>
    <CLASSCOD>54</CLASSCOD>
    <NAICS>332311</NAICS>
    <OFFADD>4380B Arkansas Rd Dalton GA 31699-1794</OFFADD>
    <SUBJECT>AircraftMaintenance</SUBJECT>
    <SOLNBR>hA4830-13-a-0007</SOLNBR>
    <NTYPE>COMBINE</NTYPE>
    <CONTACT>test P. DeBasio, Contract Administrator, Phone 333-237-4316, Email test.test@test -
        Jennifer M. testr, Contract Officer, Phone 333-237-4316, Email
        two.harper.2@test.mil</CONTACT>
    <AWDNBR>AA4830-13-Z-S004</AWDNBR>
    <AWDAMT>$83,320.00</AWDAMT>
    <AWDDATE>091713</AWDDATE>
    <AWARDEE>xxx-State Restoration, Inc. , 748 Gold Ave, Ste 2, Test City, SC 86442 US</AWARDEE>
    <DESC/>
    <LINK/>
    <URL/>
    <DESC>Link To Document</DESC>
</AWARD>

Open in new window

0
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

frankhelkCommented:
Addendum: There is a variant of tags who need no closing tags, in the case they have no enclosed content - either because
no content is intended (only attributes are needed), or
the tag needs to be present by definition but is empty
The correct syntax:

<somteag attribute1="something"></sometag>

Open in new window

might be abreviated to
<sometag attribute1="something" />

Open in new window

which is also covered by the XML language standards. The author's example doesn't meet that, anyhow.

I presume that this is a proprietary dataset format that looks XML alike, but isn't XML.

It could be parsed easily because it's format is simple (no nested content), but it would throw exceptions if fed to any usual XML parser. If you code a parser for it, you have to regard that the delimiter codes < and > might appear in the data, so you have to regard only the first apperances of 'em (presumed the <CONTACT> field got line-wrapped by the EE site)
0
RealityGroupAuthor Commented:
I have resigned myself that I'll have to just read the file in with StreamReader and parse it.  

And even thought it may not appear that way at first there is nested content.

<DESC>
<LINK>
<URL>
<DESC>Link To Document

The <URL> and second <DESC> are "sub elements" of <LINK>.

Thanks for all the responses.
0
frankhelkCommented:
OK - I've seen that there are opening/closing pairs, but w/o proper syntax.

Remarkably, while the <DESC> tag appears simply twice with the opening syntax, there's a correct closing tag on the <AWARD> tag. Anyhow, all other tags don't follow the spec - they use the separate open/close syntax for the opening tag, but the closing tags are missing.

It would be nasty to code stable around that mess of a structure. I presume you're not in the position for that, but I would schedule the programmer who committed that IT crime to write 500 times "I will code my output to be well formed XML" on the blackboard - with screeching chalk. In XML. Well formed :)
0
RealityGroupAuthor Commented:
Correct, I have no control or recourse to get this change.  It is from a government feed.
There are more types other than <AWARD>.  But they are all formatted the same way.
Fortunately it is pretty consistent in what I'm seeing.  I just wanted to avoid any cumbersome parsing routines.  The code to parse isn't hard to come up with just ugly and like you said...nasty to code stable around that mess of a structure.

Thanks for your input.
0
Geert BormansInformation ArchitectCommented:
@frankhelk. I have 20 years experience with SGML and XML. Please note that noone ever said this is XML or suposed to be XML. It looks like XML but it is not, not at all. BUT it is also not an IT crime, so no reason to condemne anyone to writing 500 lines on the blackboard.
I have gigabytes of similar constructs on tape here, and given the proper context, I consider them perfectly valid data, calling it an IT crime is ignorence.

SGML allows End tag omision and given a proper SGML DTD this piece of data would parse beautifully, even with the nested structure at the end. So, I am not saying this is intended to be SGML. I have a strong suspicion however, it was created having SGML in mind.

@RealityGroup
Tagsoup would not spot the nested requirement at the end, so that suggestion no longer holds.

But my suggestion of using James Clarks toolset (nsgmls has an SGML 2 XML transformer, be it in java) and have an SGML DTD with the Data still stands. You could use that as a first step in your processing workflow
For a customer that is a legal publisher with gigabytes of SGML in an Content Managment System, I am still processing SGML 2 XML and the other way round on a regular basis. It is not because the tools are old, that they don't work properly.

If you are going to develop your own stream reader, try to develop something on top of a SAX parser. A SAX parser reads the SGML-like stuff start to end, and you might be able to interact gracefully with the error messages in the call backs

Or have a look at this one, seems to be a C# SGML reader class.
http://stackoverflow.com/questions/1148083/sgml-parser-net-recommendations
All you need is a SGML DTD for this thing. I can help you with that

But you haven't given any more detail on the assignment. Where does it come from, how many of those do you need to process? After all, if this is SGML, your supllier must have a DTD ("wellformed" is a concept unknown to SGML, so there must exist a DTD). Maybe he can normalize it for you.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Julian HansenCommented:
I don't see any way around a custom pass routine.
0
Geert BormansInformation ArchitectCommented:
I was writing when you sent your last comment... if the party you get this from is governemental, chances are really big this is SGML. SGML was very much picked up 20 years ago with law publishing, technical documentation and... goverment
SGML parsers are indeed far more complex than XML parser, specialy because they needed to deal with the end tag omission. One more reason to not develop the writer yourself entirely, but have a look at the existing tools first
0
Geert BormansInformation ArchitectCommented:
I don't see any way around a custom pass routine.

well, in a way... if you consider an SGML DTD a custom pass routine :-)
0
RealityGroupAuthor Commented:
I will be processing one file a day with about 2000 records in it.  There are about 10 other types besides the <AWARD>.

It took me about 5 minutes to write a custom parser for the <AWARD> type.  I was just hoping for a more "elegant" way to do this.

Thanks again for all the replies.

@Gertone I will take a look at your recommendations and see if they will work for me.  Thank you.

ps.  They won't supply any  DTD or documentation for the files.
0
Geert BormansInformation ArchitectCommented:
It took me about 5 minutes to write a custom parser for the <AWARD> type.  I was just hoping for a more "elegant" way to do this.

Until you hit a foreign character, or a missing element, or...
custom parsers could bite you

If you can use the SGML tools then you would have one clean workflow, all you need then is to develop 10 SGML DTDS... given the above complexity, that should not be too hard

Good luck
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
XML

From novice to tech pro — start learning today.