asked on

html to csv

trying to parse nested html via regex in python and convert it to csv file

Unless the html is very regularly formatted and is guaranteed to always follow that, don't use regex. You are much better off using a library/module to parse HTML.

If it is regular, guaranteed, and you still want to use regexes, please provide your exact requirements and a sample of the html. Your question is currently too vague to be answered.

pepr

For the CSV part, use the Python standard csv module (https://docs.python.org/3/library/csv.html).

For parsing HTML, it depends on the quality of the HTML source. In the past, the tool called Beautifulsoup was an excellent tool to extract info even from mangled HTML. I know the implementation changed since then, but the project is still alive (https://www.crummy.com/software/BeautifulSoup/bs4/doc/, https://pypi.org/project/beautifulsoup4/).

If the HTML is formed correctly, it should be possible to use the lxml module (https://lxml.de/parsing.html#parsing-html) that was adopted as the standard module in newer Python versions (https://docs.python.org/3/library/xml.etree.elementtree.html).

There is also the standard module html.parser (https://docs.python.org/3/library/html.parser.html#module-html.parser). It may look too complicated at first, but it is worth to learn. :)

David Favor

Better to just install html2text, then parse it's output into your .csv file.

Every Linux Distro packages a recent version of html2text. Most Distros package latest version.

If you're using Windows, you'll have to port html2text or find an existing port.

slightwv (䄆 Netminder)

Yes, please provide more details on exactly what you are wanting to do.

I'm not a Python Expert but HTML should be valid XML and it looks like XML to CSV is pretty straight forward in Python:
http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-xml-to-csv-using-python/

wilcoxon

Actually only XHTML is XML-compliant. I'm 95% sure no other version of HTML is XML-compliant.

slightwv (䄆 Netminder)

It may not be 100% XML compliant but depending of exactly what they are wanting to extract to CSV, it may be close enough to be parsed as XML.

Louis LIETAER

Do you steel need help on this topic ?

skullnobrains

provide a data sample and we will be able to help. i usually do that with sed

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.