Solved

extract only <p>..</p> from web page in java

Posted on 2009-07-08
11
252 Views
Last Modified: 2012-05-07
I got web page contents using java.net.url in java.
And I got all the tags and contents. But I only want to get the text in <p> tag.

Can I use regular expression for that? Please let me know if there's any example.


Thanks!!
0
Comment
Question by:Juuno
  • 3
  • 3
  • 2
  • +2
11 Comments
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 75 total points
Comment Utility
You'd be better off using an html parser. See

http://exampledepot.com/egs/javax.swing.text.html/GetLinks.html?l=rel

and use HTML.Tag.P instead or use a high level API like HttpUnit
0
 
LVL 15

Assisted Solution

by:fsze88
fsze88 earned 75 total points
Comment Utility
try this?

        String beTestString = "<p>abcxyz</p>";
        Pattern p = Pattern.compile("<p>(.*)</p>");
        Matcher m = p.matcher(beTestString);
//        boolean b = m.matches();
        System.out.println("m.group(1)  : " + m.group(1));
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Well a simple multiline would break that wouldn't it? Not to mention nesting...
0
 
LVL 15

Expert Comment

by:fsze88
Comment Utility
I have not try on multline, hum.... I think not a problem
so we can use  m.groupCount()  to get number of group there and using for loop take all of text of <p> tag....
make sense?
0
 
LVL 27

Assisted Solution

by:ddrudik
ddrudik earned 75 total points
Comment Utility
Here's starter code:
import java.util.regex.Pattern;

import java.util.regex.Matcher;

class Module1{

  public static void main(String[] asd){

  String sourcestring = "source string to match with pattern";

  Pattern re = Pattern.compile("<p[^>]*>(.*?)</p>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

  Matcher m = re.matcher(sourcestring);

  int mIdx = 0;

    while (m.find()){

      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){

        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));

      }

      mIdx++;

    }

  }

}

Open in new window

0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
There's no need to reinvent the wheel that's an html parser
0
 
LVL 92

Accepted Solution

by:
objects earned 275 total points
Comment Utility
heres what you need

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

you just need to tweak it to track when you are inside a

Let me know if you need any help
0
 

Author Comment

by:Juuno
Comment Utility
@ objects

> heres what you need
> http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

I got an error like: TestCallBack cannot be resolved to a type though I have that class.

0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
Comment Utility
sorry that's a type should be lowercase B
0
 

Author Comment

by:Juuno
Comment Utility
I got an exception: javax.swing.text.ChangedCharSetException at this line: editorKit.read(reader, htmlText, 0);

Thanks!!
0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
Comment Utility
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now