?
Solved

Charset Regular Expression

Posted on 2008-10-26
11
Medium Priority
?
910 Views
Last Modified: 2012-06-27
Hello,
I made a crawler, and in each page I have to get the information of the charset.
I made in this way, but the regular expression not is the best, and somethimes I get the charset like: charset=UTF-8" /> or charset=iso-8859-1">, and I want just the word utf-8, iso-8859-1 or other.
My code is this one (regular expression):
preg_match("/charset ?.* \"/ix", $codigo, $charset);

Have a perfect regular expression for this?
0
Comment
Question by:Pedro Chagas
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
  • 3
11 Comments
 
LVL 13

Expert Comment

by:Xyptilon2
ID: 22810524
Try this: charset\s*=\s*([^"' >]*)

It will match anything after the = (plus possible spaces) and stops when it encounters a single- or double quote or a space or a closing angular bracket. The first back-reference \1 will contain the charset.

For help with regular expressions and testing them, have a look at the software RegexBuddy from http://www.regexbuddy.com
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 22823572
preg_match("/charset\s*=\s*['\"]?([^'\"\s>]+)/ix", $codigo, $charset);
0
 
LVL 3

Author Comment

by:Pedro Chagas
ID: 22824846
In the solution of Xyptilon2, have one double quote, the system don't recognize a double quote alone, or erase this one, or put another double quote.

In the solution of ahoffmann, the system show me like this:"iso-8859-1iso-8859" "utf-8utf-". I thing  this one is close to resolution!

The Best Regards,
JC




iso-8859-1iso-8859- utf-8utf-
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 13

Expert Comment

by:Xyptilon2
ID: 22824928
My mistake, you need to escape the double quote with a backslash like this..

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset);

goodluck!
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 22824962
> the system show me like this:"iso-8859-1iso-8859" "utf-8utf-"
please post the corresponding line where you get this match
also post the code you use

I guess you simply need to remove the x modifier.
0
 
LVL 3

Author Comment

by:Pedro Chagas
ID: 22827322
(For ahoffmann:) I remove de modifier x and the problem is the same. I post the code I made for get this usual html information:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> // in this case I want the information of ut-8
============================

(For Xyptilon2:) I try your example with escaped double quote, but the problem is the same, the system get this kinfd of infotrmation: iso-8859-1iso-8859-   /  utf-8utf-

===============================
If you want to view more code, tell me......
/////////////////////ADQUIRIR O CONTEUDO
$codigo = @file_get_contents("$urlactual->url"); //get all code
//preg_match("/charset ?.* \"/ix", $codigo, $charset); //the original
preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples

Open in new window

0
 
LVL 13

Expert Comment

by:Xyptilon2
ID: 22827996
I just tried this and it works:

<?PHP

$codigo = "<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />";

preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples

print_r($charset);
?>

Remember that $charset[0] contains the full pattern that matches and $charset[1] contains the back reference of what is in between () which is the one you want

so echo $charset[1] will contains your charset using the regex above. The same is true for the regex ahoffmann i'm guessing you're looking at $charset[0] instead of $charset[1]?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 22828880
LOL, Xyptilon2 pointed out why real programmers use perl instead of PHP 'cause there the difference for scalars vs. arrays is obvious (beside reducing the total number of lines of code significantly;-)
0
 
LVL 3

Author Comment

by:Pedro Chagas
ID: 22835159
Hello,
I attach a file for you see what I receive.
The goal of this regular expression is to get the charset for further processing, because we all know the problem of characters (encoding), and if the page have one type of charset the treatement is one, if is other type of character the treatement is other. So I have to know the charset for I put in condition "IF". I put more code for you understand better this problem.
I need the pure word of the type of charset, utf-8 or other types.
This is important because I collected all the information of entire site, but for I made the good treatment of collected information I have to know the charset for I process the function 'mb_detect_encoding'.
If I don't resolve the charset problem, I have to get another solution to reach the same goal.
$codigo = @file_get_contents("$urlactual->url"); //saca o codigo todo
//preg_match("/charset ?.* \"/ix", $codigo, $charset); //the original
preg_match("/charset\s*=\s*([^\"' >]*)/ix",$codigo, $charset); // expert examples
 
 
 
 
//normalizar a variavel de charset
$marcador = implode ("", $charset); //passar a array para string
$marcador = strtr($marcador, '/>', "  ");
$marcador = trim ($marcador); //tirar os espaços do lado direito e esquerdo
$marcador = substr("$marcador", 8, -1); //limpar a string para que mostre apenas o charset
$marcador = strtolower ($marcador); //pôr toda a string em letra pequena
unset($charset); //fazer reset á variavel que continha o charset
 
//$tags = get_meta_tags("$urlactual->url");
//echo $urlactual->urlhost; 
//print_r($tags['keywords']);
//$enc = mb_detect_encoding (file ("tags['keywords']"));
$enc = mb_detect_encoding("$urlactual->url", "auto");
 
 
 
//preg_match("@<body[^>]*>(.*)</body>@Usi", $codigo, $titulo);
//preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $titulo);
//print_r ($titulo);
 
 
 
if ($marcador == "" && $enc == "ASCII") {
	mysql_query("delete from urls where id = '$urlactual->id'", $db);
 		} else if ($marcador == "iso-8859-1") {
			preg_match("/title> ?.* <\/title>/ix", $codigo, $titulo);
			$titulo = $titulo[0];
			$titulo = mb_convert_encoding($titulo, "UTF-8", "ISO-8859-1");
			preg_match("@<h1[^>]*>(.*)</h1>@Usi", $codigo, $hh1);
			$hh1 = $hh1[0];
			$hh1 = mb_convert_encoding($hh1, "UTF-8", "ISO-8859-1");
			preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $hh2);
			$hh2 = $hh2[0];
			$hh2 = mb_convert_encoding($hh2, "UTF-8", "ISO-8859-1");
			preg_match("@<strong[^>]*>(.*)</strong>@Usi", $codigo, $bold);
			$bold = $bold[0];
			$bold = mb_convert_encoding($bold, "UTF-8", "ISO-8859-1");
			$tags = get_meta_tags("$urlactual->url");
			$descricao = $tags['description'];
			$descricao = mb_convert_encoding($descricao, "UTF-8", "ISO-8859-1");
			$chaves = $tags['keywords'];
			$chaves = mb_convert_encoding($chaves, "UTF-8", "ISO-8859-1");
			mysql_query("insert into url_conteudo set url_bruto = '$urlactual->url', palavra_chave = '$palavradb', url_titulo = '$titulo', url_h1 = '$hh1', url_h2 = '$hh2', url_description = '$descricao', url_keywords = '$chaves', url_bold = '$bold'", $db);
			mysql_query("delete from urls where id = '$urlactual->id'", $db);
				} else if ($marcador == "utf-8") {
					preg_match("/title> ?.* <\/title>/ix", $codigo, $titulo);
					$titulo = $titulo[0];
					preg_match("@<h1[^>]*>(.*)</h1>@Usi", $codigo, $hh1);
					$hh1 = $hh1[0];
					preg_match("@<h2[^>]*>(.*)</h2>@Usi", $codigo, $hh2);

Open in new window

charset.jpg
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 1000 total points
ID: 22835374
> $marcador = implode ("", $charset); //passar a array para string
eso es el problema para ti!

it returns the string you descibed
The advise was to use $charset[1] and *not* the complete array.
0
 
LVL 13

Expert Comment

by:Xyptilon2
ID: 22837789
As I said, and as ahoffmann also pointed out, you had to use $charset[1] and not $charset. The reasons why are explained here: http://cn.php.net/manual/en/function.preg-match.php

Well at least you have it working now
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

801 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question