Diksha Bansal
asked on
html to csv
trying to parse nested html via regex in python and convert it to csv file
For the CSV part, use the Python standard csv module (https://docs.python.org/3/library/csv.html).
For parsing HTML, it depends on the quality of the HTML source. In the past, the tool called Beautifulsoup was an excellent tool to extract info even from mangled HTML. I know the implementation changed since then, but the project is still alive (https://www.crummy.com/software/BeautifulSoup/bs4/doc/, https://pypi.org/project/beautifulsoup4/).
If the HTML is formed correctly, it should be possible to use the lxml module (https://lxml.de/parsing.html#parsing-html) that was adopted as the standard module in newer Python versions (https://docs.python.org/3/library/xml.etree.elementtree.html).
There is also the standard module html.parser (https://docs.python.org/3/library/html.parser.html#module-html.parser). It may look too complicated at first, but it is worth to learn. :)
For parsing HTML, it depends on the quality of the HTML source. In the past, the tool called Beautifulsoup was an excellent tool to extract info even from mangled HTML. I know the implementation changed since then, but the project is still alive (https://www.crummy.com/software/BeautifulSoup/bs4/doc/, https://pypi.org/project/beautifulsoup4/).
If the HTML is formed correctly, it should be possible to use the lxml module (https://lxml.de/parsing.html#parsing-html) that was adopted as the standard module in newer Python versions (https://docs.python.org/3/library/xml.etree.elementtree.html).
There is also the standard module html.parser (https://docs.python.org/3/library/html.parser.html#module-html.parser). It may look too complicated at first, but it is worth to learn. :)
Better to just install html2text, then parse it's output into your .csv file.
Every Linux Distro packages a recent version of html2text. Most Distros package latest version.
If you're using Windows, you'll have to port html2text or find an existing port.
Every Linux Distro packages a recent version of html2text. Most Distros package latest version.
If you're using Windows, you'll have to port html2text or find an existing port.
Yes, please provide more details on exactly what you are wanting to do.
I'm not a Python Expert but HTML should be valid XML and it looks like XML to CSV is pretty straight forward in Python:
http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-xml-to-csv-using-python/
I'm not a Python Expert but HTML should be valid XML and it looks like XML to CSV is pretty straight forward in Python:
http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-xml-to-csv-using-python/
Actually only XHTML is XML-compliant. I'm 95% sure no other version of HTML is XML-compliant.
It may not be 100% XML compliant but depending of exactly what they are wanting to extract to CSV, it may be close enough to be parsed as XML.
Do you steel need help on this topic ?
provide a data sample and we will be able to help. i usually do that with sed
This question needs an answer!
Become an EE member today
7 DAY FREE TRIALMembers can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
If it is regular, guaranteed, and you still want to use regexes, please provide your exact requirements and a sample of the html. Your question is currently too vague to be answered.