LetsLearn
asked on
Large XML file extraction
I have some very large XML files which are provided from an external source. I need to search through the files to isolate certain lines based on search criteria, eg ABNStatusFromDate="2018010 1", <State>ACT</State> and <EntityTypeText>Australian Private Company</EntityTypeText>.
I have tried opening the files in Excel and the Windows XML Notepad reader. They crash due to the size. I can open them in the FirstObject reader very quickly and I can search one by one by each date and then double check to see if the state and entity type matches the two above criteria. But this will obviously take forever.
Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?
I won't attach the file because of the size. Below are a couple of lines from the XML file.
Many thanks for any assistance which can be given.
<ABR recordLastUpdatedDate="201 71121" replaced="N"><ABN status="ACT" ABNStatusFromDate="2017110 9">1462274 5297</ABN> <EntityTyp e><EntityT ypeInd>PRV </EntityTy peInd><Ent ityTypeTex t>Australi an Private Company</EntityTypeText></ EntityType ><MainEnti ty><NonInd ividualNam e type="MN"><NonIndividualNa meText>FUM AO PROPERTY PTY LTD</NonIndividualNameText ></NonIndi vidualName ><Business Address><A ddressDeta ils><State >VIC</Stat e><Postcod e>3149</Po stcode></A ddressDeta ils></Busi nessAddres s></MainEn tity><ASIC Number ASICNumberType="undetermin ed">622745 297</ASICN umber><GST status="ACT" GSTStatusFromDate="2017110 9" /></ABR>
<ABR recordLastUpdatedDate="200 00701" replaced="N"><ABN status="CAN" ABNStatusFromDate="2000022 8">1462274 6156</ABN> <EntityTyp e><EntityT ypeInd>CUT </EntityTy peInd><Ent ityTypeTex t>Corporat e Unit Trust</EntityTypeText></En tityType>< MainEntity ><NonIndiv idualName type="MN"><NonIndividualNa meText>THE JOKA UNIT TRUST</NonIndividualNameTe xt></NonIn dividualNa me><Busine ssAddress> <AddressDe tails><Sta te>NSW</St ate><Postc ode>2320</ Postcode>< /AddressDe tails></Bu sinessAddr ess></Main Entity></A BR>
<ABR recordLastUpdatedDate="201 50202" replaced="N"><ABN status="CAN" ABNStatusFromDate="2015020 2">1462274 6237</ABN> <EntityTyp e><EntityT ypeInd>IND </EntityTy peInd><Ent ityTypeTex t>Individu al/Sole Trader</EntityTypeText></E ntityType> <LegalEnti ty><Indivi dualName type="LGL"><GivenName>BOUA SRI</Given Name><Fami lyName>VIL LAROSA</Fa milyName>< /Individua lName><Bus inessAddre ss><Addres sDetails>< State>NSW< /State><Po stcode>256 0</Postcod e></Addres sDetails>< /BusinessA ddress></L egalEntity ><OtherEnt ity><NonIn dividualNa me type="TRD"><NonIndividualN ameText>Vi llarosa Photography</NonIndividual NameText>< /NonIndivi dualName>< /OtherEnti ty></ABR>
<ABR recordLastUpdatedDate="201 71109" replaced="N"><ABN status="ACT" ABNStatusFromDate="2017110 9">1462274 9795</ABN> <EntityTyp e><EntityT ypeInd>PRV </EntityTy peInd><Ent ityTypeTex t>Australi an Private Company</EntityTypeText></ EntityType ><MainEnti ty><NonInd ividualNam e type="MN"><NonIndividualNa meText>3 DU YOGA PTY LTD</NonIndividualNameText ></NonIndi vidualName ><Business Address><A ddressDeta ils><State >NSW</Stat e><Postcod e>2150</Po stcode></A ddressDeta ils></Busi nessAddres s></MainEn tity><ASIC Number ASICNumberType="undetermin ed">622749 795</ASICN umber></AB R>
<ABR recordLastUpdatedDate="200 30601" replaced="N"><ABN status="CAN" ABNStatusFromDate="2003060 1">1462275 0886</ABN> <EntityTyp e><EntityT ypeInd>IND </EntityTy peInd><Ent ityTypeTex t>Individu al/Sole Trader</EntityTypeText></E ntityType> <LegalEnti ty><Indivi dualName type="LGL"><NameTitle>MS</ NameTitle> <GivenName >JANETTE</ GivenName> <GivenName >ROSEMARY< /GivenName ><FamilyNa me>CAMUGLI A</FamilyN ame></Indi vidualName ><Business Address><A ddressDeta ils><State >QLD</Stat e><Postcod e>4059</Po stcode></A ddressDeta ils></Busi nessAddres s></LegalE ntity></AB R>
I have tried opening the files in Excel and the Windows XML Notepad reader. They crash due to the size. I can open them in the FirstObject reader very quickly and I can search one by one by each date and then double check to see if the state and entity type matches the two above criteria. But this will obviously take forever.
Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?
I won't attach the file because of the size. Below are a couple of lines from the XML file.
Many thanks for any assistance which can be given.
<ABR recordLastUpdatedDate="201
<ABR recordLastUpdatedDate="200
<ABR recordLastUpdatedDate="201
<ABR recordLastUpdatedDate="201
<ABR recordLastUpdatedDate="200
ASKER
I didn't think of doing that Ryan. Yes that would be great.
something like this:
declare @Output VARCHAR(MAX)
set @Output = '
<ABR recordLastUpdatedDate="20171121" replaced="N">
<ABN status="ACT" ABNStatusFromDate="20171109">14622745297</ABN>
<EntityType>
<EntityTypeInd>PRV</EntityTypeInd>
<EntityTypeText>Australian Private Company</EntityTypeText>
</EntityType>
<MainEntity>
<NonIndividualName type="MN">
<NonIndividualNameText>FUMAO PROPERTY PTY LTD</NonIndividualNameText>
</NonIndividualName>
<BusinessAddress>
<AddressDetails>
<State>VIC</State>
<Postcode>3149</Postcode>
</AddressDetails>
</BusinessAddress>
</MainEntity>
<ASICNumber ASICNumberType="undetermined">622745297</ASICNumber>
<GST status="ACT" GSTStatusFromDate="20171109" />
</ABR>
<ABR recordLastUpdatedDate="20000701" replaced="N">
<ABN status="CAN" ABNStatusFromDate="20000228">14622746156</ABN>
<EntityType>
<EntityTypeInd>CUT</EntityTypeInd>
<EntityTypeText>Corporate Unit Trust</EntityTypeText>
</EntityType>
<MainEntity>
<NonIndividualName type="MN">
<NonIndividualNameText>THE JOKA UNIT TRUST</NonIndividualNameText>
</NonIndividualName>
<BusinessAddress>
<AddressDetails>
<State>NSW</State>
<Postcode>2320</Postcode>
</AddressDetails>
</BusinessAddress>
</MainEntity>
</ABR>
<ABR recordLastUpdatedDate="20150202" replaced="N">
<ABN status="CAN" ABNStatusFromDate="20150202">14622746237</ABN>
<EntityType>
<EntityTypeInd>IND</EntityTypeInd>
<EntityTypeText>Individual/Sole Trader</EntityTypeText>
</EntityType>
<LegalEntity>
<IndividualName type="LGL">
<GivenName>BOUASRI</GivenName>
<FamilyName>VILLAROSA</FamilyName>
</IndividualName>
<BusinessAddress>
<AddressDetails>
<State>NSW</State>
<Postcode>2560</Postcode>
</AddressDetails>
</BusinessAddress>
</LegalEntity>
<OtherEntity>
<NonIndividualName type="TRD">
<NonIndividualNameText>Villarosa Photography</NonIndividualNameText>
</NonIndividualName>
</OtherEntity>
</ABR>
<ABR recordLastUpdatedDate="20171109" replaced="N">
<ABN status="ACT" ABNStatusFromDate="20171109">14622749795</ABN>
<EntityType>
<EntityTypeInd>PRV</EntityTypeInd>
<EntityTypeText>Australian Private Company</EntityTypeText>
</EntityType>
<MainEntity>
<NonIndividualName type="MN">
<NonIndividualNameText>3 DU YOGA PTY LTD</NonIndividualNameText>
</NonIndividualName>
<BusinessAddress>
<AddressDetails>
<State>NSW</State>
<Postcode>2150</Postcode>
</AddressDetails>
</BusinessAddress>
</MainEntity>
<ASICNumber ASICNumberType="undetermined">622749795</ASICNumber>
</ABR>
<ABR recordLastUpdatedDate="20030601" replaced="N">
<ABN status="CAN" ABNStatusFromDate="20030601">14622750886</ABN>
<EntityType>
<EntityTypeInd>IND</EntityTypeInd>
<EntityTypeText>Individual/Sole Trader</EntityTypeText>
</EntityType>
<LegalEntity>
<IndividualName type="LGL">
<NameTitle>MS</NameTitle>
<GivenName>JANETTE</GivenName>
<GivenName>ROSEMARY</GivenName>
<FamilyName>CAMUGLIA</FamilyName>
</IndividualName>
<BusinessAddress>
<AddressDetails>
<State>QLD</State>
<Postcode>4059</Postcode>
</AddressDetails>
</BusinessAddress>
</LegalEntity>
</ABR>
'
DECLARE @XML XML
SET @XML = CONVERT (XML, @Output)
SELECT
c.value('(./@recordLastUpdatedDate)', 'VARCHAR(MAX)') recordLastUpdatedDate,
c.value('(./@replaced)', 'VARCHAR(MAX)') replaced,
c.value('ABN[1]', 'nvarchar(max)') ABN_Text,
c.value('(ABN/@status)[1]', 'VARCHAR(MAX)') ABN_status,
c.value('(ABN/@ABNStatusFromDate)[1]', 'VARCHAR(MAX)') ABNStatusFromDate,
c.value('(EntityType/EntityTypeInd)[1]', 'nvarchar(max)') EntityTypeInd,
c.value('(EntityType/EntityTypeText)[1]', 'nvarchar(max)') EntityTypeText
-- other fields ...
FROM @XML.nodes('//ABR') t(c)
ASKER
Many thanks for doing this Ryan, I'll test it and let you know how I get on.
ASKER
Hi Ryan,
I was thinking of using this in MS Access as the files open reasonably quickly. Could I trouble you for one extra piece of help, how would I run this in Access? It's been a while since I have done anything with modules, classes and queries in Access.
Thanks in advance,
Cathryn
I was thinking of using this in MS Access as the files open reasonably quickly. Could I trouble you for one extra piece of help, how would I run this in Access? It's been a while since I have done anything with modules, classes and queries in Access.
Thanks in advance,
Cathryn
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks Ryan, I'm closing this as the best solution.
I might ask a similar question in the Microsoft Access area as well as it is a program I am more familiar with.
Thank you for your help.
I might ask a similar question in the Microsoft Access area as well as it is a program I am more familiar with.
Thank you for your help.
hope that make sense?