• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 193
  • Last Modified:

Removing HTML for a string. So only plain text is left.

Hi every one

I have solved some of  the problems with my program with the help of  RomanRega

but i am still having problems with some of my methods

they are the RemoveHTML() method which at the moment doen't remove the html

here is the first version of the method

StringBuffer returnMessage = new StringBuffer(Output);
          int startPosition = Output.indexOf("<"); // encountered the first opening brace
    int endPosition = Output.indexOf(">");
    if (startPosition >= 0)    endPosition  = Output.indexOf(">", startPosition); // encountered the first closing braces
    while( startPosition != -1 )
    {
      returnMessage.delete( startPosition, endPosition +1 ); // remove the tag
      startPosition = (returnMessage.toString()).indexOf("<"); // look for the next opening brace
      //endPosition = (returnMessage.toString()).indexOf(">", startPosition);
          if (startPosition >= 0)    endPosition  = Output.indexOf(">", startPosition); // encountered the first closing braces
    }
    Output=returnMessage.toString();
    System.out.println(Output);

but this only removes some of the < or > charcters i need it to remove all the in between stuff

i have another version i have been working on but this seems to do the same

char currentChar;
          int startI = 0;
          int endI;
          int tokNo;
          StringTokenizer temp;
          boolean flag = false;
         
          for (endI = 0; endI < Output.length(); endI++)
          {
                    currentChar = Output.charAt(endI);
                    if (currentChar == '<')
                    {
                         flag = true;
                         temp = new StringTokenizer(Output.substring(startI, endI));
                         tokNo = temp.countTokens();
                         for(int words = 0; words < tokNo; words++)
                         {
                              extract.addElement(temp.nextToken(" \n\t"));
                         }
                    }
                    while(flag && endI < Output.length())
                    {
                         currentChar = Output.charAt(endI);
                         if (currentChar == '>')
                         {
                              startI = endI + 1;
                              flag = false;
                         }
                         else
                         {
                         endI++;
                         }
                    }
          }


Can anyone help i have to get the program finished by the end of the week!
0
alexlindley
Asked:
alexlindley
  • 5
  • 2
3 Solutions
 
TimYatesCommented:
 htmlString.replaceAll("\\<.*?\\>","");
0
 
TimYatesCommented:
should do it...
0
 
TimYatesCommented:
So long as you are on JDK 1.4+
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
aozarovCommented:
>> htmlString.replaceAll("\\<.*?\\>","");
That should work, though I don't think there is a need for the \\ in this case, [this will work as well -> htmlString.replaceAll("<[^>]*>","");]  but this will remove just the html tag itself.
e.g: hello<b> xxx</b> world -> hello xxx world
If you want also to remove the content of the tag:
e.g: hello<b>xxx</b> world -> hello world
Then you will need to do something like that (assuming the input confirms to XHTML [which requires a valid close tag]):
htmlString.replaceAll("<[^\\/]*\\/>","");


0
 
TimYatesCommented:
Good point...  still works though ;-)
0
 
aozarovCommented:
>> Good point...  still works though ;-)
Right [Regarding "though I don't think there is a need for the \\ in this case"] :-)
But not for "also to remove the content of the tag" if that what alexlindley needs...
0
 
TimYatesCommented:
true...  I wasn't sure of that either...  but I got the impression he just wanted to strip the tags, not the content of the tags...

I guess time will tell ;-)

Tim
0
 
RomanRegaCommented:
As i said in the other thread:

static String  removeHTML(String html)
   {
     
      if (html==null || html.length()==0) return null;
      StringBuffer returnMessage = new StringBuffer(html.length());
      int startPosition=0;
      int endPosition;
      do{
         endPosition=html.indexOf('<',startPosition);
         if (endPosition<0) endPosition=html.length();
         if (endPosition>startPosition){
            returnMessage.append(html, startPosition, endPosition);
            returnMessage.append(' ');
         }
         if (endPosition>=html.length()) break;        
         startPosition = html.indexOf(">", endPosition+1)+1; // encountered the first closing braces
      }while (startPosition>=0);
     
      return returnMessage.toString();
   }
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now