Solved

best way to remove html tags from a String in java

Posted on 2011-09-20
7
386 Views
Last Modified: 2012-05-12
I have a html file, which is in the form of a String. I want to strip off the html tags and extract the text to a String. I would like to do this in java. I have seen several ways of accomplishing this. I want to know the best way to do this.

TIA
0
Comment
Question by:bent27
  • 5
  • 2
7 Comments
 
LVL 47

Accepted Solution

by:
for_yan earned 500 total points
Comment Utility
        String htString = "sdfsd <html> dsfsdfjds  <jkj>  sdfsd<sdfsdf/>sdfsdfs<fsfs> ";


        String res = htString.replaceAll("<[^>]+>","");

        System.out.println(res);

Open in new window


Output:

sdfsd  dsfsdfjds    sdfsdsdfsdfs 

Open in new window

0
 

Author Comment

by:bent27
Comment Utility
what about something like


String sample = "&lt;head&gt; bla bla</head>";
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility

but between &lt; and &gt: - it is not the tags - so we want to keep "head" in between ?

        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("(<[^>]+>)|(&[^;]+;)","");


  

        System.out.println(res);

Open in new window


Output:

head bla bla;

Open in new window

0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 500 total points
Comment Utility
You probably want it like that, because that' why they use &lt; , &gt;- to represent "<"  and ">" say in inequalities and distinguish them from  tags



        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("<[^>]+>","").replace("&lt;","<").replace("&gt;",">");


       // String res1 = res.replaceAll("&[^;]+;","");

        System.out.println(res);

Open in new window



Output:

<head> bla bla;

Open in new window

0
 

Author Comment

by:bent27
Comment Utility
sample input :

String htString = "&lt;head&gt; bla bla</head>;";


intended output :

bla bla

or, have you used jsoup, what is your take on it?
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 500 total points
Comment Utility
We can do it this way if you want, but I don't think it is what you want -
if they use &lt; &gt; in the HTML code - I think they do it becuae they want to use literal ">" and "<"
instead of the tags - th's owhy I don't think you need to remove the stuff between them - is that
real snippet
"&lt;head&gt; bla bla</head>;"; ?

Woulds browser really understand this as the openeing <head> tag ?
That's wahy I thiink this &lt; should be replace by "<"
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
This is how you want it below  but think about what I posted above - check it - don't know if I'm right - but that was my understanding
why people write &lt; - othewise it is easier to type "<"

        String htString = "&lt;head&gt; bla bla</head>;";




          String res = htString.replaceAll("(<[^>]+>)|(&lt;.+?&gt;)","");



        System.out.println(res);

Open in new window

Output:

 bla bla;

Open in new window

0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now