Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1009
  • Last Modified:

Extract content between two tags in HTML document


I have an HTML document, contained in a String (using this code: http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html).

Now, I would like to extract the String between the <title> tag, and the </title> tag.

(The title tags will *always* be in this document).

Any example code for this? Should I just use StringTokenizer, setting the delimiter to "<title>", then in the second token, run the StringTokenizer class again, to extract the first token, when the delimiter is "</title>" ??

That's the best idea that I can come up with.

Thanks in advance,
>> IM
  • 2
1 Solution
Here's a working example :

      String someHtml =
              "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" +
              "<html>\n" +
              "<head>\n" +
              "<title>this is my title</title>\n" +
              "\n" +
              "<style type=\"text/css\">\n" +
              "</style>\n" +
              "\n" +
              "<script type=\"text/javascript\">\n" +
              "</script>\n" +
              "\n" +
              "\n" +
              "</head>\n" +
              "\n" +
              "<body>\n" +
              "   <div>bleh</div>\n" +
              "</body>\n" +
              "\n" +

      String titleStartTag = "<title>";
      String titleEndTag = "</title>";

      int start = someHtml.indexOf(titleStartTag);
      int end = someHtml.indexOf(titleEndTag);

      if (start != -1 && end !=-1)
         String titleText = someHtml.substring(start + titleStartTag.length(), end);

         System.out.println("title inner text is [" + titleText + "]");


Output when run is :

    title inner text is [this is my title]
BTW - I did a .toLowerCase() on the whole HTML string because title could appear as <title> or <TITLE>.      This will also make the inner text lower case.  To get around this, you could find the <title> start and end via the lowercase string, then fall back to the original HTML string when doing the .substring.

InteractiveMindAuthor Commented:
Fantastic! Thank you very much.  :)
>> IM

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now