Solved

extract only <p>..</p> from web page in java

Posted on 2009-07-08
11
254 Views
Last Modified: 2012-05-07
I got web page contents using java.net.url in java.
And I got all the tags and contents. But I only want to get the text in <p> tag.

Can I use regular expression for that? Please let me know if there's any example.


Thanks!!
0
Comment
Question by:Juuno
  • 3
  • 3
  • 2
  • +2
11 Comments
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 75 total points
ID: 24806199
You'd be better off using an html parser. See

http://exampledepot.com/egs/javax.swing.text.html/GetLinks.html?l=rel

and use HTML.Tag.P instead or use a high level API like HttpUnit
0
 
LVL 15

Assisted Solution

by:fsze88
fsze88 earned 75 total points
ID: 24806413
try this?

        String beTestString = "<p>abcxyz</p>";
        Pattern p = Pattern.compile("<p>(.*)</p>");
        Matcher m = p.matcher(beTestString);
//        boolean b = m.matches();
        System.out.println("m.group(1)  : " + m.group(1));
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24806430
Well a simple multiline would break that wouldn't it? Not to mention nesting...
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 15

Expert Comment

by:fsze88
ID: 24806497
I have not try on multline, hum.... I think not a problem
so we can use  m.groupCount()  to get number of group there and using for loop take all of text of <p> tag....
make sense?
0
 
LVL 27

Assisted Solution

by:ddrudik
ddrudik earned 75 total points
ID: 24807575
Here's starter code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<p[^>]*>(.*?)</p>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24808386
There's no need to reinvent the wheel that's an html parser
0
 
LVL 92

Accepted Solution

by:
objects earned 275 total points
ID: 24809496
heres what you need

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

you just need to tweak it to track when you are inside a

Let me know if you need any help
0
 

Author Comment

by:Juuno
ID: 24810396
@ objects

> heres what you need
http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

I got an error like: TestCallBack cannot be resolved to a type though I have that class.

0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
ID: 24810428
sorry that's a type should be lowercase B
0
 

Author Comment

by:Juuno
ID: 24810657
I got an exception: javax.swing.text.ChangedCharSetException at this line: editorKit.read(reader, htmlText, 0);

Thanks!!
0
 
LVL 92

Assisted Solution

by:objects
objects earned 275 total points
ID: 24810686
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
This video teaches viewers about errors in exception handling.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now