Solved

Parsing NewsML and storing to Database

Posted on 2008-10-13
3
1,783 Views
Last Modified: 2013-12-17
Hi,

Could anyone please help me out with a parsing logic that will parse NewsMLfiles and inserts into the database. NewsMl files are standard xml files with a metadata. My objecive is to parse the NewsML file and extract the feed data and insert into a SQL Server database. The database will have correspounding fields which matches the NewsML feed tags. The parser should work in such a way that it will fetch all the entries from the XML file and inserts into the appropriate column in the database. Later i use these information in the database to populate in a website.You can use any .NET supporting language like C#, VB etc.
I have attached the sample of NewsML .
If I am not clear with any portion of my question, plz mention.
Thanks
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE NewsML SYSTEM "http://idsdat06.reuters.com/newsml/NewsMLv1.0.dtd" [

  <!ENTITY % xhtml SYSTEM "http://idsdat06.reuters.com/newsml/xhtml1-strict.dtd">

  %xhtml;

]>

<!--src: rtr2newsml0.990-->

<NewsML Duid="MTFH09658_2002-07-03_01-40-47_2640656_NEWSML">

    <Catalog Href="http://www.reuters.com/newsml/mastercatalog.xml"/>

    <NewsEnvelope>

        <DateAndTime>20020703T014047+0000</DateAndTime>

        <NewsService FormalName="RTR_TNS"/>

        <NewsProduct FormalName="TXT"/>

        <Priority FormalName="3"/>

    </NewsEnvelope>

    <NewsItem Duid="MTFH09658_2002-07-03_01-40-47_2640656_NEWSITEM">

        <Identification>

            <NewsIdentifier>

                <ProviderId>reuters.com</ProviderId>

                <DateId>20020703</DateId>

                <NewsItemId>MTFH09658_2002-07-03_01-40-47_2640656</NewsItemId>

                <RevisionId Update="N" PreviousRevision="0">1</RevisionId>

                <PublicIdentifier>urn:newsml:reuters.com:20020703:MTFH09658_2002-07-03_01-40-47_2640656:1</PublicIdentifier>

            </NewsIdentifier>

            <DateLabel>2002-07-03 01:40:47 GMT (Reuters)</DateLabel>

        </Identification>

        <NewsManagement>

            <NewsItemType FormalName="News"/>

            <FirstCreated>20020703T014047+0000</FirstCreated>

            <ThisRevisionCreated>20020703T014047+0000</ThisRevisionCreated>

            <Status FormalName="Usable"/>

            <Urgency FormalName="3"/>

        </NewsManagement>

        <NewsComponent EquivalentsList="no" Essential="no" Duid="MTFH09658_2002-07-03_01-40-47_2640656_MAIN_NC" xml:lang="en">

            <TopicSet FormalName="HighImportance">  <Topic Duid="ts_1">  <TopicType FormalName="Geography"/>  <FormalName Scheme="N2000">AF</FormalName>  <Description xml:lang="en">Afghanistan</Description>  <Property FormalName="CountryGrouping" Value="EMRG"/>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_2">  <TopicType FormalName="Geography"/>  <FormalName Scheme="N2000">ASIA</FormalName>  <Description xml:lang="en">Asia</Description>  <Property FormalName="WhyPresent" Value="Ancestor"/>  </Topic>  <Topic Duid="ts_3">  <TopicType FormalName="Country Grouping"/>  <FormalName Scheme="N2000">EMRG</FormalName>  <Description xml:lang="en">Emerging countries</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_4">  <TopicType FormalName="Geography"/>  <FormalName Scheme="N2000">US</FormalName>  <Description xml:lang="en">United States of America</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_5">  <TopicType FormalName="Geography"/>  <FormalName Scheme="N2000">PK</FormalName>  <Description xml:lang="en">Pakistan</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_6">  <TopicType FormalName="Government and private aid"/>  <FormalName Scheme="N2000">AID</FormalName>  <FormalName Scheme="IPTCSubjectCodes">04008007</FormalName>  <Description xml:lang="en">Private/Government Aid</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_7">  <TopicType FormalName="INTERNATIONAL RELATIONS"/>  <FormalName Scheme="N2000">DIP</FormalName>  <FormalName Scheme="IPTCSubjectCodes">11002000</FormalName>  <Description xml:lang="en">Diplomacy; International Relations</Description>  <Property FormalName="WhyPresent" Value="Ancestor"/>  </Topic>  <Topic Duid="ts_8">  <TopicType FormalName="Reuters Legacy Code"/>  <FormalName Scheme="N2000">NEWS</FormalName>  <Description xml:lang="en">General News</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_9">  <TopicType FormalName="Reuters Legacy Code"/>  <FormalName Scheme="N2000">FEA</FormalName>  <Description xml:lang="en">Features (German)</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_10">  <TopicType FormalName="HEALTH AND MEDICINE"/>  <FormalName Scheme="N2000">HEA</FormalName>  <FormalName Scheme="IPTCSubjectCodes">07000000</FormalName>  <Description xml:lang="en">Health and medicine</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_11">  <TopicType FormalName="Lifestyle"/>  <FormalName Scheme="N2000">LIF</FormalName>  <FormalName Scheme="IPTCSubjectCodes">10000000</FormalName>  <Description xml:lang="en">Living and lifestyle</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_12">  <TopicType FormalName="POLITICS"/>  <FormalName Scheme="N2000">POL</FormalName>  <FormalName Scheme="IPTCSubjectCodes">11000000</FormalName>  <Description xml:lang="en">Domestic Politics</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_13">  <TopicType FormalName="Religion"/>  <FormalName Scheme="N2000">REL</FormalName>  <FormalName Scheme="IPTCSubjectCodes">12000000</FormalName>  <Description xml:lang="en">Religion and belief (extend definition)</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  <Topic Duid="ts_14">  <TopicType FormalName="CategoryCode"/>  <FormalName Scheme="MediaCategory">OVR</FormalName>  <Description xml:lang="en">General news stories</Description>  <Property FormalName="WhyPresent" Value="Classifier"/>  </Topic>  </TopicSet>

            <Role FormalName="Main"/>

            <AdministrativeMetadata>

                <FileName>2002-07-03T014047Z_01_2640656_RTRIDST_0_AFGHAN-RETURNEES-GENERAL-FEATURE.XML</FileName>

                <Provider>

                    <Party FormalName="Reuters"/>

                </Provider>

                <Source>

                    <Party FormalName="Reuters"/>

                </Source>

                <Property FormalName="SourceFeed" Value="IDS"/>

                <Property FormalName="IDSPublisher" Value="http://www.reuters.com/ids"/>

            </AdministrativeMetadata>

            <!--Single "Main Text" inner NewsComponent-->

            <NewsComponent EquivalentsList="no" Essential="no" Duid="MTFH09658_2002-07-03_01-40-47_2640656_MAIN_TEXT_NC" xml:lang="en">

                <Role FormalName="Main Text"/>

                <NewsLines>

                    <HeadLine>FEATURE-Afghan villagers rebuild homes from the rubble</HeadLine>

                    <ByLine/>

                    <DateLine>July 3, 2002</DateLine>

                    <CreditLine>REUTERS</CreditLine>

                    <CopyrightLine>© Reuters 2002. All rights reserved. Republication or redistribution of Reuters content, including by caching, framing or similar means, is expressly prohibited without the prior written consent of Reuters. Reuters and the Reuters sphere logo are registered trademarks and trademarks of the Reuters group of companies around the world.</CopyrightLine>

                    <SlugLine>AFGHAN-RETURNEES (GENERAL FEATURE)</SlugLine>

                    <NewsLine>

                        <NewsLineType FormalName="Caption"/>

                        <NewsLineText>AFGHAN-RETURNEES (GENERAL FEATURE):FEATURE-Afghan villagers rebuild homes from the rubble</NewsLineText>

                    </NewsLine>

                </NewsLines>

                <DescriptiveMetadata>

                    <Language FormalName="en"/>

                    <OfInterestTo FormalName="G"/>

                    <OfInterestTo FormalName="RBN"/>

                    <OfInterestTo FormalName="AFA"/>

                    <OfInterestTo FormalName="CSA"/>

                    <OfInterestTo FormalName="LBY"/>

                    <OfInterestTo FormalName="RWSA"/>

                    <OfInterestTo FormalName="RWS"/>

                    <OfInterestTo FormalName="REULB"/>

                    <OfInterestTo FormalName="GNS"/>

                    <OfInterestTo FormalName="SNS"/>

                    <OfInterestTo FormalName="SNI"/>

                    <OfInterestTo FormalName="RNP"/>

                    <OfInterestTo FormalName="DNP"/>

                    <OfInterestTo FormalName="PGE"/>

                    <OfInterestTo FormalName="SXNA"/>

                    <TopicOccurrence Importance="High" Topic="#ts_1"/>

                    <TopicOccurrence Importance="High" Topic="#ts_2"/>

                    <TopicOccurrence Importance="High" Topic="#ts_3"/>

                    <TopicOccurrence Importance="High" Topic="#ts_4"/>

                    <TopicOccurrence Importance="High" Topic="#ts_5"/>

                    <TopicOccurrence Importance="High" Topic="#ts_6"/>

                    <TopicOccurrence Importance="High" Topic="#ts_7"/>

                    <TopicOccurrence Importance="High" Topic="#ts_8"/>

                    <TopicOccurrence Importance="High" Topic="#ts_9"/>

                    <TopicOccurrence Importance="High" Topic="#ts_10"/>

                    <TopicOccurrence Importance="High" Topic="#ts_11"/>

                    <TopicOccurrence Importance="High" Topic="#ts_12"/>

                    <TopicOccurrence Importance="High" Topic="#ts_13"/>

                    <TopicOccurrence Importance="High" Topic="#ts_14"/>

                </DescriptiveMetadata>

                <ContentItem Duid="MTFH09658_2002-07-03_01-40-47_2640656_MAIN1_TEXT_CI">

                    <MediaType FormalName="Text"/>

                    <Format FormalName="XHTML"/>

                    <Characteristics>

                        <Property FormalName="ContentID" Value="urn:newsml:reuters.com:20020703:MTFH09658_2002-07-03_01-40-47_2640656_TXT:1"/>

                        <Property FormalName="ContentCreationDateAndTime" Value="20020703T014047+0000"/>

                        <Property FormalName="USN" Value="2640656"/>

                        <Property FormalName="Creator" Value="RTR_JANUS 2.300"/>

                    </Characteristics>

                    <DataContent>

                        <html xmlns="http://www.w3.org/1999/xhtml">

                            <head>

                                <title/>

                            </head>

                            <body>

                                <p> By Denise Duclaux</p>

                                <p> QARBAGH BAZAAR, Afghanistan, July 3 (Reuters) - Hundreds of  Afghan families live in the dust of crumbled, baked mud homes  in Qarbagh Bazaar, struggling to eke out a living in their  devastated village after four years as refugees in neighbouring  Pakistan.</p>

                                <p> "Everything in our life was here, but we walked away," says  the village leader, Dormad, as he watches U.S. soldiers unload  an army truck packed with boxes of clothes and school supplies.</p>

                                <p> The villagers fled their homes last year, some on foot and  a lucky few by car, as fighting between the ruling Taliban and  the opposition Northern Alliance grew bloodier each day.</p>

                                <p> The men, women and children of Qarbagh Bazaar lived under  trees in Pakistan and barely ate for months until aid  organisations gave them tents and sacks of wheat.</p>

                                <p> "A lot of people left here, it was so crowded on the  roads," Dormad said, speaking through a translator. "We had  nothing."</p>

                                <p> The villagers began straggling back, family by family, in  December after the Northern Alliance ousted the Taliban with  American help.</p>

                                <p> Dormad says the fundamentalist Islamic Taliban imprisoned  and tortured many of the village men during its five-year rule.</p>

                                <p> "When the first villagers arrived (back) they were sad,  because there was nothing left," said Dormad, whose village  stood near the frontline between Taliban and Northern Alliance  fighters. "But then they are happy, because the Taliban were  gone and they are hopeful that everything will be okay."</p>

                                <p> Dormad and his villagers still hold fast to the hope of a  better life, but the path ahead is daunting. The villagers are  jobless as Afghanistan struggles to climb out of the shadow of  more than two decades of war, and their fields are barren after  irrigation systems collapsed under years of neglect.</p>

                                <p> DISTANT DREAMS</p>

                                <p> Running water, electricity and health care are distant  dreams. "We are jobless, without enough food and water," Dormad  said. "We need help."</p>

                                <p> Dormad is pinning his hopes on U.S.-led coalition forces  and non-governmental organisations (NGOs) to help rescue  Qarbagh Bazaar from the clutch of poverty.</p>

                                <p> Soldiers from Bagram Air Base, the staging post for the  coalition forces in Afghanistan, are visiting his village for  the fifth time since the Taliban fell late last year.</p>

                                <p> Soldiers and villagers lug boxes off the army truck,  plunking each down in a small cloud of dust. Winter clothes,  school supplies, sneakers and sheets of plywood form a growing  pile of hope for the villagers.</p>

                                <p> "The distribution is up to you. I just ask that you please  don't sell it," said Major Bryan Cole to Dormad, knowing the  lure of hard cash in this poverty-stricken country.</p>

                                <p> The soldiers and their gifts attract a crowd of men, boys  and brightly dressed girls. Women stay hidden behind the worn  walls of their homes, save for a few huddled yards away beneath  bright blue burqas.</p>

                                <p> The children show little interest in the unopened boxes,  but beg the visitors for pens with the scraps of English they  have learned over the months. "How are you, give me pen," they  chant as they tug on arms and trouser legs.</p>

                                <p> Drawing and writing are a treat for the children, some of  whom have ragged scars on their faces and badly swollen eyes.</p>

                                <p> U.S. bombs levelled the village school, which the American  military said the Taliban had been using as a hideout.</p>

                                <p> The new school is housed in two tents that hold seven  classes a day for different age groups. The teachers, who are  paid nothing, offer classes in maths, history, the Holy Koran  and local languages like Persian and Pashto.</p>

                                <p> DEMINING</p>

                                <p> The children must be careful where they play, for while  some patches of ground are safe, others are deadly. A  mysterious garble of letters and numbers -- HTNO10BAG -- is  spray-painted in white on the walls of village buildings,  ravaged by decades of fierce winds and relentless war.</p>

                                <p> Halo Trust's Unit No. 10 out of Bagram has demined the  area, and the NGO has left its all-clear mark for villagers  looking to resettle and soldiers keeping up their patrol.</p>

                                <p> The U.S. military has helped to dig a new well for Qarbagh  Bazaar. It also plans to build another school for the children  and hopes to help with supplies until next year's growing  season.</p>

                                <p> Worry has replaced the terror that plagued villagers during  the years of civil war, but Dormad said the people of Qarbagh  Bazaar are far from defeated.</p>

                                <p> "It is a hard, hard life," he said. "But it is better than  being in other places like Pakistan. This is our home."  </p>

                            </body>

                        </html>

                    </DataContent>

                </ContentItem>

            </NewsComponent>

        </NewsComponent>

    </NewsItem>

</NewsML>

Open in new window

0
Comment
Question by:Sreejith22
3 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 500 total points
ID: 22730221
Frustrating, I know... There has been a couple of views, and I think more than one is monitoring.

If you want to process the file via SQL, then I might be able to help. Can you show the database table that is receiving the data ? Do all tags have to be processed - or just a few identifiers plus the body ? Can the server see the file - or can you call a stored procedure passing the document (content) as a string (ie stored procedure will use an XML datatype to receive - it does have a 2 gig limit) ?
0
 

Author Comment

by:Sreejith22
ID: 22856061
Thanks mark,
DONE!
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now