best way to remove html tags from a String in java

I have a html file, which is in the form of a String. I want to strip off the html tags and extract the text to a String. I would like to do this in java. I have seen several ways of accomplishing this. I want to know the best way to do this.

TIA
bent27Asked:
Who is Participating?
 
for_yanConnect With a Mentor Commented:
        String htString = "sdfsd <html> dsfsdfjds  <jkj>  sdfsd<sdfsdf/>sdfsdfs<fsfs> ";


        String res = htString.replaceAll("<[^>]+>","");

        System.out.println(res);

Open in new window


Output:

sdfsd  dsfsdfjds    sdfsdsdfsdfs 

Open in new window

0
 
bent27Author Commented:
what about something like


String sample = "&lt;head&gt; bla bla</head>";
0
 
for_yanCommented:

but between &lt; and &gt: - it is not the tags - so we want to keep "head" in between ?

        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("(<[^>]+>)|(&[^;]+;)","");


  

        System.out.println(res);

Open in new window


Output:

head bla bla;

Open in new window

0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
for_yanConnect With a Mentor Commented:
You probably want it like that, because that' why they use &lt; , &gt;- to represent "<"  and ">" say in inequalities and distinguish them from  tags



        String htString = "&lt;head&gt; bla bla</head>;";


        String res = htString.replaceAll("<[^>]+>","").replace("&lt;","<").replace("&gt;",">");


       // String res1 = res.replaceAll("&[^;]+;","");

        System.out.println(res);

Open in new window



Output:

<head> bla bla;

Open in new window

0
 
bent27Author Commented:
sample input :

String htString = "&lt;head&gt; bla bla</head>;";


intended output :

bla bla

or, have you used jsoup, what is your take on it?
0
 
for_yanConnect With a Mentor Commented:
We can do it this way if you want, but I don't think it is what you want -
if they use &lt; &gt; in the HTML code - I think they do it becuae they want to use literal ">" and "<"
instead of the tags - th's owhy I don't think you need to remove the stuff between them - is that
real snippet
"&lt;head&gt; bla bla</head>;"; ?

Woulds browser really understand this as the openeing <head> tag ?
That's wahy I thiink this &lt; should be replace by "<"
0
 
for_yanCommented:
This is how you want it below  but think about what I posted above - check it - don't know if I'm right - but that was my understanding
why people write &lt; - othewise it is easier to type "<"

        String htString = "&lt;head&gt; bla bla</head>;";




          String res = htString.replaceAll("(<[^>]+>)|(&lt;.+?&gt;)","");



        System.out.println(res);

Open in new window

Output:

 bla bla;

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.