• Status: Solved
  • Priority: Medium
  • Security: Private
  • Views: 43
  • Last Modified:

Large XML file extraction

I have some very large XML files which are provided from an external source. I need to search through the files to isolate certain lines based on search criteria, eg ABNStatusFromDate="20180101", <State>ACT</State> and <EntityTypeText>Australian Private Company</EntityTypeText>.

I have tried opening the files in Excel and the Windows XML Notepad reader. They crash due to the size. I can open them in the FirstObject reader very quickly and I can search one by one by each date and then double check to see if the state and entity type matches the two above criteria. But this will obviously take forever.

Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?

I won't attach the file because of the size. Below are a couple of lines from the XML file.

Many thanks for any assistance which can be given.

<ABR recordLastUpdatedDate="20171121" replaced="N"><ABN status="ACT" ABNStatusFromDate="20171109">14622745297</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>FUMAO PROPERTY PTY LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>VIC</State><Postcode>3149</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">622745297</ASICNumber><GST status="ACT" GSTStatusFromDate="20171109" /></ABR>
<ABR recordLastUpdatedDate="20000701" replaced="N"><ABN status="CAN" ABNStatusFromDate="20000228">14622746156</ABN><EntityType><EntityTypeInd>CUT</EntityTypeInd><EntityTypeText>Corporate Unit Trust</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>THE JOKA UNIT TRUST</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2320</Postcode></AddressDetails></BusinessAddress></MainEntity></ABR>
<ABR recordLastUpdatedDate="20150202" replaced="N"><ABN status="CAN" ABNStatusFromDate="20150202">14622746237</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><GivenName>BOUASRI</GivenName><FamilyName>VILLAROSA</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2560</Postcode></AddressDetails></BusinessAddress></LegalEntity><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>Villarosa Photography</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20171109" replaced="N"><ABN status="ACT" ABNStatusFromDate="20171109">14622749795</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>3 DU YOGA PTY LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2150</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">622749795</ASICNumber></ABR>
<ABR recordLastUpdatedDate="20030601" replaced="N"><ABN status="CAN" ABNStatusFromDate="20030601">14622750886</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><NameTitle>MS</NameTitle><GivenName>JANETTE</GivenName><GivenName>ROSEMARY</GivenName><FamilyName>CAMUGLIA</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>QLD</State><Postcode>4059</Postcode></AddressDetails></BusinessAddress></LegalEntity></ABR>
0
LetsLearn
Asked:
LetsLearn
  • 4
  • 3
1 Solution
 
Ryan ChongCommented:
Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?
not sure about loading XML to Excel, but certainly it can be loaded into MS SQL, from there we probably can do a query thing to filter what you want or what you don't want.

hope that make sense?
0
 
LetsLearnAuthor Commented:
I didn't think of doing that Ryan. Yes that would be great.
0
 
Ryan ChongCommented:
something like this:

declare @Output VARCHAR(MAX)
set @Output = '
<ABR recordLastUpdatedDate="20171121" replaced="N">
	<ABN status="ACT" ABNStatusFromDate="20171109">14622745297</ABN>
	<EntityType>
		<EntityTypeInd>PRV</EntityTypeInd>
		<EntityTypeText>Australian Private Company</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>FUMAO PROPERTY PTY LTD</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>VIC</State>
				<Postcode>3149</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
	<ASICNumber ASICNumberType="undetermined">622745297</ASICNumber>
	<GST status="ACT" GSTStatusFromDate="20171109" />
</ABR>
<ABR recordLastUpdatedDate="20000701" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20000228">14622746156</ABN>
	<EntityType>
		<EntityTypeInd>CUT</EntityTypeInd>
		<EntityTypeText>Corporate Unit Trust</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>THE JOKA UNIT TRUST</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2320</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
</ABR>
<ABR recordLastUpdatedDate="20150202" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20150202">14622746237</ABN>
	<EntityType>
		<EntityTypeInd>IND</EntityTypeInd>
		<EntityTypeText>Individual/Sole Trader</EntityTypeText>
	</EntityType>
	<LegalEntity>
		<IndividualName type="LGL">
			<GivenName>BOUASRI</GivenName>
			<FamilyName>VILLAROSA</FamilyName>
		</IndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2560</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</LegalEntity>
	<OtherEntity>
		<NonIndividualName type="TRD">
			<NonIndividualNameText>Villarosa Photography</NonIndividualNameText>
		</NonIndividualName>
	</OtherEntity>
</ABR>
<ABR recordLastUpdatedDate="20171109" replaced="N">
	<ABN status="ACT" ABNStatusFromDate="20171109">14622749795</ABN>
	<EntityType>
		<EntityTypeInd>PRV</EntityTypeInd>
		<EntityTypeText>Australian Private Company</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>3 DU YOGA PTY LTD</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2150</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
	<ASICNumber ASICNumberType="undetermined">622749795</ASICNumber>
</ABR>
<ABR recordLastUpdatedDate="20030601" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20030601">14622750886</ABN>
	<EntityType>
		<EntityTypeInd>IND</EntityTypeInd>
		<EntityTypeText>Individual/Sole Trader</EntityTypeText>
	</EntityType>
	<LegalEntity>
		<IndividualName type="LGL">
			<NameTitle>MS</NameTitle>
			<GivenName>JANETTE</GivenName>
			<GivenName>ROSEMARY</GivenName>
			<FamilyName>CAMUGLIA</FamilyName>
		</IndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>QLD</State>
				<Postcode>4059</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</LegalEntity>
</ABR>
'

DECLARE @XML XML

SET @XML = CONVERT (XML, @Output)
	
SELECT 
	c.value('(./@recordLastUpdatedDate)', 'VARCHAR(MAX)') recordLastUpdatedDate,
	c.value('(./@replaced)', 'VARCHAR(MAX)') replaced,
	c.value('ABN[1]', 'nvarchar(max)') ABN_Text,
	c.value('(ABN/@status)[1]', 'VARCHAR(MAX)') ABN_status,
	c.value('(ABN/@ABNStatusFromDate)[1]', 'VARCHAR(MAX)') ABNStatusFromDate,
	c.value('(EntityType/EntityTypeInd)[1]', 'nvarchar(max)') EntityTypeInd,
	c.value('(EntityType/EntityTypeText)[1]', 'nvarchar(max)') EntityTypeText
	-- other fields ...
FROM @XML.nodes('//ABR') t(c)

Open in new window


SnapShot.png
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LetsLearnAuthor Commented:
Many thanks for doing this Ryan, I'll test it and let you know how I get on.
0
 
LetsLearnAuthor Commented:
Hi Ryan,

I was thinking of using this in MS Access as the files open reasonably quickly. Could I trouble you for one extra piece of help, how would I run this in Access? It's been a while since I have done anything with modules, classes and queries in Access.

Thanks in advance,
Cathryn
0
 
Ryan ChongCommented:
you probably can try the existing Access wizard to import the XML file:

Import XML File In Access 2010
https://www.addictivetips.com/microsoft-office/import-xml-file-in-access-2010/

but personally seldom using this feature before.

see if you can achieve that using the Wizard mentioned in article above?

anyway, still think MS SQL should be pretty fast as well but just that you need to familiar yourself with the SQL scripts I have posted above, which could be the more difficult part.
0
 
LetsLearnAuthor Commented:
Thanks Ryan, I'm closing this as the best solution.

I might ask a similar question in the Microsoft Access area as well as it is a program I am more familiar with.

Thank you for your help.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now