Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 346
  • Last Modified:

regular expression multiple lines until

K so I came across another regular expression issue.

So I'm trying to get info from a source code of a webpage.

This i my first part of the expression
<h1.*>(.*)</h1>

Open in new window

Now I need to ignore everything until I get the following expression
<td.*>\s*Siège social\s*</td>\s*<td>\s*<a.*>(.*)<br/>(.*)</a>

Open in new window


What do I write between them? Also is this well written?
0
Mutsop
Asked:
Mutsop
  • 5
  • 4
  • 4
  • +1
1 Solution
 
MutsopAuthor Commented:
I just tested the second part of my expression and for some reason wouldn't work.

The html code is:
<tr align="left" valign="top">
																			<td style="color:#777777">
																				Siège social
																			</td>
																			<td>
																				<a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Lieu-dit%20Platey%20AMANCE" target="_blank" class="lienFleche"> Lieu-dit Platey <br>10140 AMANCE</a>
																			</td>
				 <td valign="middle" align="left"><a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Lieu-dit%20Platey%20AMANCE" target="_blank"><img border="0" src="http://img1.societe.com/imgz/locator.png" alt="google map" title="plan"></a></td>
																		</tr>

Open in new window


so what I need is "Lieu-dit Platey" and "10140 AMANCE".

what is wrong with the expression?
0
 
CodeCruiserCommented:
If you are trying to parse the HTML, HTMLAgilityPack may be useful

http://htmlagilitypack.codeplex.com/
0
 
MutsopAuthor Commented:
Problem is as for now, I need to use regular expressions.
I'll check into the HTMLAgilityPack afterwards for the next project :)
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
käµfm³d 👽Commented:
What do I write between them?

You can try (non-greedy [ ? ] dot-star):

.*?

Open in new window


Also is this well written?
No, for the simple fact that you are using a greedy dot-star, which will try to match as much as possible before declaring success. This is fine if you only have one occurrence of the target string in your source string; if you have more than one, then you will only ever return one string because the dot-star will consume everything in between,  valid match or not, until it finds the last piece of string that still makes the overall match succeed.
0
 
käµfm³d 👽Commented:
Try this modification to your pattern:
<h1[^>]*>.*?<td[^>]*>\s*Siège social\s*</td>\s*<td>\s*<a[^>]*>(.*?)<br/>(.*?)</a>

Open in new window

0
 
MutsopAuthor Commented:
Sorry for the late response, but it doesn't seem to work...
<h1[^>]*>(.*?)</h1>.*?<td[^>]*>\s*Siège social\s*</td>\s*<td>\s*<a[^>]*>(.*?)<br/>(.*?)</a>

Open in new window

0
 
njonestxCommented:
Don't know if you still need an answer, but this should at least help in future expressions that you build.

It looks like there were two problems in your regex. One was that the greedy matches of the tags let the expression match too much. The bit "<a.*>" matched well past the actual end tag character ">". I changed these tag matches to lazy matches "*?" that will stop at the first ">". (Note: this isn't 100% foolproof for html.)

The other problem was that your input has "<br>" before "10140 AMANCE" but your original expression called for "<br/>". I changed this to "<br/?>" so that the slash would be optional.

Here's an expression that will properly handle these elements.
<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*)<br/?>(.*)</a>

Open in new window

0
 
MutsopAuthor Commented:
Hey thx for the reply,

I have some other website, but with less information.
So I would love this thing to work.

I tried your solution,
But doesnt seem to work.

This is for example one of the companies:
http://www.societe.com/societe/-brouard-combustibles-s-a-s-391942661.html

Dim RegexObj As New Regex("<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*)<br/?>(.*)</a>", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
            Dim MatchResults As Match = RegexObj.Match(page)
            While MatchResults.Success
                extract = MatchResults.Groups(1).ToString + Environment.NewLine
                extract += "Adres: " + MatchResults.Groups(2).ToString + " " + MatchResults.Groups(3).ToString + Environment.NewLine
                MatchResults = MatchResults.NextMatch()
            End While

Open in new window


The MatchResults.Success is false.

fyi: the company name works when I try
<h1[^>]*>(.*?)</h1>

Open in new window

0
 
njonestxCommented:
I just tried it with the code (in VB.NET 2008) that you provided, and it was successful for me. It encountered 1 match after which the "extract" variable had the following contents:

 Zone Industrielle "les Cophas" 
Adres: 28120 ILLIERS-COMBRAY 

Open in new window


What you may want to check is how you are obtaining the source page. If you're using the wrong encoding to read it, it may strip out "è" or convert it to "e" or some other character. If you take a look at the top of the page, there's a meta tag that sets the encoding to use, indicating that "iso-8859-1" should be used. Here's a MS article on this tag.

Dim url As String = "http://www.societe.com/societe/-brouard-combustibles-s-a-s-391942661.html"
Dim client As WebClient = New WebClient()
Dim data As Stream = client.OpenRead(url)
Dim reader As StreamReader = New StreamReader(data, System.Text.Encoding.GetEncoding("iso-8859-1"))
Dim page As String = ""
Do While Not reader.EndOfStream
    page += reader.ReadLine + vbCrLf
Loop

Dim extract As String
Dim RegexObj As New Regex("<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*)<br/?>(.*)</a>", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
Dim MatchResults As Match = RegexObj.Match(page)
While MatchResults.Success
    extract = MatchResults.Groups(1).ToString + Environment.NewLine
    extract += "Adres: " + MatchResults.Groups(2).ToString + " " + MatchResults.Groups(3).ToString + Environment.NewLine
    MatchResults = MatchResults.NextMatch()
End While

Open in new window

0
 
MutsopAuthor Commented:
Thanks works amazing :D
0
 
käµfm³d 👽Commented:
For future readers, the use of RegexOptions.Multiline is incorrect in the accepted solution. That option should instead be RegexOptions.Singleline.
0
 
njonestxCommented:
The Multiline option is not required but neither will it affect the result of the regex since ^ or $ are not used.
0
 
käµfm³d 👽Commented:
The Multiline option is not required but neither will it affect the result of the regex since ^ or $ are not used.
True, but not using the Singleline option will affect the result of the match, hence the comment    = )
0
 
njonestxCommented:
Good point. Because he used \s to match the line feeds, everything turned out okay, but there will be a problem if there's a line feed in the tags themselves which is still valid html. Nice job kaufmed!
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 5
  • 4
  • 4
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now