Link to home
Start Free TrialLog in
Avatar of Diksha Bansal
Diksha Bansal

asked on

html to csv

trying to parse nested html via regex in python and convert it to csv file
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Unless the html is very regularly formatted and is guaranteed to always follow that, don't use regex.  You are much better off using a library/module to parse HTML.

If it is regular, guaranteed, and you still want to use regexes, please provide your exact requirements and a sample of the html.  Your question is currently too vague to be answered.
Avatar of pepr
pepr

For the CSV part, use the Python standard csv module (https://docs.python.org/3/library/csv.html).

For parsing HTML, it depends on the quality of the HTML source. In the past, the tool called Beautifulsoup was an excellent tool to extract info even from mangled HTML. I know the implementation changed since then, but the project is still alive (https://www.crummy.com/software/BeautifulSoup/bs4/doc/, https://pypi.org/project/beautifulsoup4/).

If the HTML is formed correctly, it should be possible to use the lxml module (https://lxml.de/parsing.html#parsing-html) that was adopted as the standard module in newer Python versions (https://docs.python.org/3/library/xml.etree.elementtree.html).

There is also the standard module html.parser (https://docs.python.org/3/library/html.parser.html#module-html.parser). It may look too complicated at first, but it is worth to learn. :)
Better to just install html2text, then parse it's output into your .csv file.

Every Linux Distro packages a recent version of html2text. Most Distros package latest version.

If you're using Windows, you'll have to port html2text or find an existing port.
Yes, please provide more details on exactly what you are wanting to do.

I'm not a Python Expert but HTML should be valid XML and it looks like XML to CSV is pretty straight forward in Python:
http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-xml-to-csv-using-python/
Actually only XHTML is XML-compliant.  I'm 95% sure no other version of HTML is XML-compliant.
It may not be 100% XML compliant but depending of exactly what they are wanting to extract to CSV, it may be close enough to be parsed as XML.
Do you steel need help on this topic ?
provide a data sample and we will be able to help. i usually do that with sed
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.