Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

extract only <p>..</p> from web page in java

Posted on 2009-07-08
11
Medium Priority
?
258 Views
Last Modified: 2012-05-07
I got web page contents using java.net.url in java.
And I got all the tags and contents. But I only want to get the text in <p> tag.

Can I use regular expression for that? Please let me know if there's any example.


Thanks!!
0
Comment
Question by:Juuno
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +2
11 Comments
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 300 total points
ID: 24806199
You'd be better off using an html parser. See

http://exampledepot.com/egs/javax.swing.text.html/GetLinks.html?l=rel

and use HTML.Tag.P instead or use a high level API like HttpUnit
0
 
LVL 15

Assisted Solution

by:fsze88
fsze88 earned 300 total points
ID: 24806413
try this?

        String beTestString = "<p>abcxyz</p>";
        Pattern p = Pattern.compile("<p>(.*)</p>");
        Matcher m = p.matcher(beTestString);
//        boolean b = m.matches();
        System.out.println("m.group(1)  : " + m.group(1));
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24806430
Well a simple multiline would break that wouldn't it? Not to mention nesting...
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 15

Expert Comment

by:fsze88
ID: 24806497
I have not try on multline, hum.... I think not a problem
so we can use  m.groupCount()  to get number of group there and using for loop take all of text of <p> tag....
make sense?
0
 
LVL 27

Assisted Solution

by:ddrudik
ddrudik earned 300 total points
ID: 24807575
Here's starter code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<p[^>]*>(.*?)</p>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24808386
There's no need to reinvent the wheel that's an html parser
0
 
LVL 92

Accepted Solution

by:
objects earned 1100 total points
ID: 24809496
heres what you need

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

you just need to tweak it to track when you are inside a

Let me know if you need any help
0
 

Author Comment

by:Juuno
ID: 24810396
@ objects

> heres what you need
http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

I got an error like: TestCallBack cannot be resolved to a type though I have that class.

0
 
LVL 92

Assisted Solution

by:objects
objects earned 1100 total points
ID: 24810428
sorry that's a type should be lowercase B
0
 

Author Comment

by:Juuno
ID: 24810657
I got an exception: javax.swing.text.ChangedCharSetException at this line: editorKit.read(reader, htmlText, 0);

Thanks!!
0
 
LVL 92

Assisted Solution

by:objects
objects earned 1100 total points
ID: 24810686
0

Featured Post

Build and deliver software with DevOps

A digital transformation requires faster time to market, shorter software development lifecycles, and the ability to adapt rapidly to changing customer demands. DevOps provides the solution.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Suggested Courses

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question