Solved

extract only <p>..</p> from web page in java

Posted on 2009-07-08
11
255 Views
Last Modified: 2012-05-07
I got web page contents using java.net.url in java.
And I got all the tags and contents. But I only want to get the text in <p> tag.

Can I use regular expression for that? Please let me know if there's any example.


Thanks!!
0
Comment
Question by:Juuno
  • 3
  • 3
  • 2
  • +2
11 Comments
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 75 total points
ID: 24806199
You'd be better off using an html parser. See

http://exampledepot.com/egs/javax.swing.text.html/GetLinks.html?l=rel

and use HTML.Tag.P instead or use a high level API like HttpUnit
0
 
LVL 15

Assisted Solution

by:fsze88
fsze88 earned 75 total points
ID: 24806413
try this?

        String beTestString = "<p>abcxyz</p>";
        Pattern p = Pattern.compile("<p>(.*)</p>");
        Matcher m = p.matcher(beTestString);
//        boolean b = m.matches();
        System.out.println("m.group(1)  : " + m.group(1));
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24806430
Well a simple multiline would break that wouldn't it? Not to mention nesting...
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 
LVL 15

Expert Comment

by:fsze88
ID: 24806497
I have not try on multline, hum.... I think not a problem
so we can use  m.groupCount()  to get number of group there and using for loop take all of text of <p> tag....
make sense?
0
 
LVL 27

Assisted Solution

by:ddrudik
ddrudik earned 75 total points
ID: 24807575
Here's starter code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<p[^>]*>(.*?)</p>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24808386
There's no need to reinvent the wheel that's an html parser
0
 
LVL 92

Accepted Solution

by:
objects earned 275 total points
ID: 24809496
heres what you need

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

you just need to tweak it to track when you are inside a

Let me know if you need any help
0
 

Author Comment

by:Juuno
ID: 24810396
@ objects

> heres what you need
http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

I got an error like: TestCallBack cannot be resolved to a type though I have that class.

0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
ID: 24810428
sorry that's a type should be lowercase B
0
 

Author Comment

by:Juuno
ID: 24810657
I got an exception: javax.swing.text.ChangedCharSetException at this line: editorKit.read(reader, htmlText, 0);

Thanks!!
0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
ID: 24810686
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question