Solved

Sort HTML File

Posted on 2014-01-05
1
333 Views
Last Modified: 2014-01-05
I want to take an HTML file, and edit it so that it is easier to scrape the data in it. I wish to edit the file, so that two things happen.

1.) all <img ... tags are replaced with a carriage return and then <img ...

2.) all </img> tags are replaced with </img> and a carriage return.  

So, a file that has:

blah<img alt="" src="http://test.com/test.jpg"></img><img alt="" src="http://test.com/test2.jpg"></img>

becomes:

blah
<img alt="" src="http://test.com/test.jpg"></img>

<img alt="" src="http://test.com/test2.jpg"></img>
0
Comment
Question by:stakor
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
1 Comment
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 39758393
perl -pe 's/(?=<img)/\n/g;s{(?<=</img>)}{\n}g' <<END
blah<img alt="" src="http://test.com/test.jpg"></img><img alt="" src="http://test.com/test2.jpg"></img>
END
0

Featured Post

[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question