Link to home
Start Free TrialLog in
Avatar of Stanley Lai
Stanley LaiFlag for Hong Kong

asked on

Web scraping using lxml under Python to extract data from xml

Hi,

I'm now having a new task which is completely new to me.  My boss asked me to grab data from an xml file which send to us on a daily basis.  I'm completely new to xml.  I find web scraping tools on the internet seems relevant to the task that I need to handle.  I also find lxml y using Python may help to solve my case.

May I know whether any short and concise (not as detail as encyclopedia) notes and materials which come with sample Python codes (at least a sample code skeleton) which demonstrate how to code scraping xml data by using lxml library under Python?

Or, if some other libraries which can do the job better than lxml under Python, that will also be welcome.  But, currently, due to license issue, I can only use Python as the programming language or Excel VBA.

Kindly please help.

Cheers!
Stanley
Avatar of Norie
Norie

Stanley

Where are these files stored?

What do you need to 'scrape' from them?

Could you upload a sample file?
You'll use curl or wget to download the file, then attach it to an email, send email to a notification list of people.

No scraping involved, just a download.
Avatar of Stanley Lai

ASKER

Hi Norie,

The xml I mentioned is being send to us via an internal private network.  Not via the internet or on a specific web site.

I have attached a sample data file from my boss.  This sample xml have missed the key and authentication elements at the footer of the xml.  But the main point is to extract the data elements in within.  Of course, the simplest way to extract the data is to use MS Excel, but my boss prefer to use a program to extract.  Any programming languages can do but the point is : he has no need to pay a single penny to get the programming tools.  Therefore, MS Excel VBA (already on even PC except server), Python, R ... etc are the ideal choices.

Actually, I have no idea on how to do it but I have no choice to reject.  The lxml in a Python way is a solution that I get from internet search only.  Not a must as the final solution.

Cheers
Stanley
Test-FPS.xml
Hi David,

The xml I mentioned is being send to us via an internal private network.  Not via the internet or on a specific web site.  I have no problem in getting the file at all.  So far as I know, this file will reach a network folder in my company at certain time of every business day.  Therefore, I have no worries on how to get it.

My point is just to extract data from the xml file and present the extracted in a table format.  The result can be placed in an Excel spreadsheet or in a csv text file.

My boss do not prefer me to use Excel to import the xml file and ask me to use a formal programming language to extract the data.  Currently, I have no idea on what tools should I use at all and the point is the tools should be free of charge.

**attached is a samle xml that I get from my boss"

Cheers
Stanley
Test-FPS.xml
Any one can help on the captioned issue??  Extract data from an existing xml file.  Any free tools can do, not fixed to use Python or the mentioned library.

Stanley
Any free tools can do, not fixed to use Python or the mentioned library.
not really in Python yet but I have done XML extraction using XPath in MSSQL

it would be worth a try?

In general:

Importing and Processing data from XML files into SQL Server tables
https://www.mssqltips.com/sqlservertip/2899/importing-and-processing-data-from-xml-files-into-sql-server-tables/

XQUERY,XPATH,XMLSCHEMA,XML INDEX
https://www.allaboutmssql.com/2012/09/xqueryxpathxmlschemaxml-index_6.html
Hi Ryan,

Thanks a lot.

Let me have a look on the material first.  As I'm really new to xml.  Xpath is also new to me but I'm fine on using SQL server.

May take some time to review.  Once done, will update this case.  Thanks ^^

Cheers
Stanley
ASKER CERTIFIED SOLUTION
Avatar of Ryan Chong
Ryan Chong
Flag of Singapore image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks all.  Seems I need some more time to digest all the materials provided before I can start to work on this project.  Thanks a lot ^^

Stanley