Solved

best way to remove html tags from a String in java

Posted on 2011-09-20
7
448 Views
Last Modified: 2012-05-12
I have a html file, which is in the form of a String. I want to strip off the html tags and extract the text to a String. I would like to do this in java. I have seen several ways of accomplishing this. I want to know the best way to do this.

TIA
0
Comment
Question by:bent27
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
7 Comments
 
LVL 47

Accepted Solution

by:
for_yan earned 500 total points
ID: 36567457
        String htString = "sdfsd <html> dsfsdfjds  <jkj>  sdfsd<sdfsdf/>sdfsdfs<fsfs> ";


        String res = htString.replaceAll("<[^>]+>","");

        System.out.println(res);

Open in new window


Output:

sdfsd  dsfsdfjds    sdfsdsdfsdfs 

Open in new window

0
 

Author Comment

by:bent27
ID: 36567471
what about something like


String sample = "&lt;head&gt; bla bla</head>";
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36567520

but between &lt; and &gt: - it is not the tags - so we want to keep "head" in between ?

        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("(<[^>]+>)|(&[^;]+;)","");


  

        System.out.println(res);

Open in new window


Output:

head bla bla;

Open in new window

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 500 total points
ID: 36567582
You probably want it like that, because that' why they use &lt; , &gt;- to represent "<"  and ">" say in inequalities and distinguish them from  tags



        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("<[^>]+>","").replace("&lt;","<").replace("&gt;",">");


       // String res1 = res.replaceAll("&[^;]+;","");

        System.out.println(res);

Open in new window



Output:

<head> bla bla;

Open in new window

0
 

Author Comment

by:bent27
ID: 36567606
sample input :

String htString = "&lt;head&gt; bla bla</head>;";


intended output :

bla bla

or, have you used jsoup, what is your take on it?
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 500 total points
ID: 36567682
We can do it this way if you want, but I don't think it is what you want -
if they use &lt; &gt; in the HTML code - I think they do it becuae they want to use literal ">" and "<"
instead of the tags - th's owhy I don't think you need to remove the stuff between them - is that
real snippet
"&lt;head&gt; bla bla</head>;"; ?

Woulds browser really understand this as the openeing <head> tag ?
That's wahy I thiink this &lt; should be replace by "<"
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36567840
This is how you want it below  but think about what I posted above - check it - don't know if I'm right - but that was my understanding
why people write &lt; - othewise it is easier to type "<"

        String htString = "&lt;head&gt; bla bla</head>;";




          String res = htString.replaceAll("(<[^>]+>)|(&lt;.+?&gt;)","");



        System.out.println(res);

Open in new window

Output:

 bla bla;

Open in new window

0

Featured Post

Learn by Doing. Anytime. Anywhere.

Do you like to learn by doing?
Our labs and exercises give you the chance to do just that: Learn by performing actions on real environments.

Hands-on, scenario-based labs give you experience on real environments provided by us so you don't have to worry about breaking anything.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question