Solved

Charset Regular Expression

Posted on 2008-10-26
11
898 Views
Last Modified: 2012-06-27
Hello,
I made a crawler, and in each page I have to get the information of the charset.
I made in this way, but the regular expression not is the best, and somethimes I get the charset like: charset=UTF-8" /> or charset=iso-8859-1">, and I want just the word utf-8, iso-8859-1 or other.
My code is this one (regular expression):
preg_match("/charset ?.* \"/ix", $codigo, $charset);

Have a perfect regular expression for this?
0
Comment
Question by:Pedro Chagas
  • 4
  • 4
  • 3
11 Comments
 
LVL 13

Expert Comment

by:Xyptilon2
Comment Utility
Try this: charset\s*=\s*([^"' >]*)

It will match anything after the = (plus possible spaces) and stops when it encounters a single- or double quote or a space or a closing angular bracket. The first back-reference \1 will contain the charset.

For help with regular expressions and testing them, have a look at the software RegexBuddy from http://www.regexbuddy.com
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
preg_match("/charset\s*=\s*['\"]?([^'\"\s>]+)/ix", $codigo, $charset);
0
 
LVL 3

Author Comment

by:Pedro Chagas
Comment Utility
In the solution of Xyptilon2, have one double quote, the system don't recognize a double quote alone, or erase this one, or put another double quote.

In the solution of ahoffmann, the system show me like this:"iso-8859-1iso-8859" "utf-8utf-". I thing  this one is close to resolution!

The Best Regards,
JC




iso-8859-1iso-8859- utf-8utf-
0
 
LVL 13

Expert Comment

by:Xyptilon2
Comment Utility
My mistake, you need to escape the double quote with a backslash like this..

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset);

goodluck!
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> the system show me like this:"iso-8859-1iso-8859" "utf-8utf-"
please post the corresponding line where you get this match
also post the code you use

I guess you simply need to remove the x modifier.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 3

Author Comment

by:Pedro Chagas
Comment Utility
(For ahoffmann:) I remove de modifier x and the problem is the same. I post the code I made for get this usual html information:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> // in this case I want the information of ut-8
============================

(For Xyptilon2:) I try your example with escaped double quote, but the problem is the same, the system get this kinfd of infotrmation: iso-8859-1iso-8859-   /  utf-8utf-

===============================
If you want to view more code, tell me......
/////////////////////ADQUIRIR O CONTEUDO

$codigo = @file_get_contents("$urlactual->url"); //get all code

//preg_match("/charset ?.* \"/ix", $codigo, $charset); //the original

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples

Open in new window

0
 
LVL 13

Expert Comment

by:Xyptilon2
Comment Utility
I just tried this and it works:

<?PHP

$codigo = "<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />";

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples

print_r($charset);
?>

Remember that $charset[0] contains the full pattern that matches and $charset[1] contains the back reference of what is in between () which is the one you want

so echo $charset[1] will contains your charset using the regex above. The same is true for the regex ahoffmann i'm guessing you're looking at $charset[0] instead of $charset[1]?
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
LOL, Xyptilon2 pointed out why real programmers use perl instead of PHP 'cause there the difference for scalars vs. arrays is obvious (beside reducing the total number of lines of code significantly;-)
0
 
LVL 3

Author Comment

by:Pedro Chagas
Comment Utility
Hello,
I attach a file for you see what I receive.
The goal of this regular expression is to get the charset for further processing, because we all know the problem of characters (encoding), and if the page have one type of charset the treatement is one, if is other type of character the treatement is other. So I have to know the charset for I put in condition "IF". I put more code for you understand better this problem.
I need the pure word of the type of charset, utf-8 or other types.
This is important because I collected all the information of entire site, but for I made the good treatment of collected information I have to know the charset for I process the function 'mb_detect_encoding'.
If I don't resolve the charset problem, I have to get another solution to reach the same goal.
$codigo = @file_get_contents("$urlactual->url"); //saca o codigo todo

//preg_match("/charset ?.* \"/ix", $codigo, $charset); //the original

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples
 
 

 
 

//normalizar a variavel de charset

$marcador = implode ("", $charset); //passar a array para string

$marcador = strtr($marcador, '/>', "  ");

$marcador = trim ($marcador); //tirar os espaços do lado direito e esquerdo

$marcador = substr("$marcador", 8, -1); //limpar a string para que mostre apenas o charset

$marcador = strtolower ($marcador); //pôr toda a string em letra pequena

unset($charset); //fazer reset á variavel que continha o charset
 

//$tags = get_meta_tags("$urlactual->url");

//echo $urlactual->urlhost; 

//print_r($tags['keywords']);

//$enc = mb_detect_encoding (file ("tags['keywords']"));

$enc = mb_detect_encoding("$urlactual->url", "auto");
 
 
 

//preg_match("@<body[^>]*>(.*)</body>@Usi", $codigo, $titulo);

//preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $titulo);

//print_r ($titulo);
 
 
 

if ($marcador == "" && $enc == "ASCII") {

	mysql_query("delete from urls where id = '$urlactual->id'", $db);

 		} else if ($marcador == "iso-8859-1") {

			preg_match("/title> ?.* <\/title>/ix", $codigo, $titulo);

			$titulo = $titulo[0];

			$titulo = mb_convert_encoding($titulo, "UTF-8", "ISO-8859-1");

			preg_match("@<h1[^>]*>(.*)</h1>@Usi", $codigo, $hh1);

			$hh1 = $hh1[0];

			$hh1 = mb_convert_encoding($hh1, "UTF-8", "ISO-8859-1");

			preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $hh2);

			$hh2 = $hh2[0];

			$hh2 = mb_convert_encoding($hh2, "UTF-8", "ISO-8859-1");

			preg_match("@<strong[^>]*>(.*)</strong>@Usi", $codigo, $bold);

			$bold = $bold[0];

			$bold = mb_convert_encoding($bold, "UTF-8", "ISO-8859-1");

			$tags = get_meta_tags("$urlactual->url");

			$descricao = $tags['description'];

			$descricao = mb_convert_encoding($descricao, "UTF-8", "ISO-8859-1");

			$chaves = $tags['keywords'];

			$chaves = mb_convert_encoding($chaves, "UTF-8", "ISO-8859-1");

			mysql_query("insert into url_conteudo set url_bruto = '$urlactual->url', palavra_chave = '$palavradb', url_titulo = '$titulo', url_h1 = '$hh1', url_h2 = '$hh2', url_description = '$descricao', url_keywords = '$chaves', url_bold = '$bold'", $db);

			mysql_query("delete from urls where id = '$urlactual->id'", $db);

				} else if ($marcador == "utf-8") {

					preg_match("/title> ?.* <\/title>/ix", $codigo, $titulo);

					$titulo = $titulo[0];

					preg_match("@<h1[^>]*>(.*)</h1>@Usi", $codigo, $hh1);

					$hh1 = $hh1[0];

					preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $hh2);

Open in new window

charset.jpg
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 250 total points
Comment Utility
> $marcador = implode ("", $charset); //passar a array para string
eso es el problema para ti!

it returns the string you descibed
The advise was to use $charset[1] and *not* the complete array.
0
 
LVL 13

Expert Comment

by:Xyptilon2
Comment Utility
As I said, and as ahoffmann also pointed out, you had to use $charset[1] and not $charset. The reasons why are explained here: http://cn.php.net/manual/en/function.preg-match.php

Well at least you have it working now
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something …
Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now