asked on

PHP PREG RegEx for European and Cyrillic Languages

Hello:

I am trying to put a text file into an array.

$theRegEx = "#[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}\t[0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZzÁáÀàAa¿¿¿¿¿¿¿¿Ââ¿¿¿¿¿¿¿¿AaÅå¿¿ÄäAaÃã¿¿¿¿AaAa¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿b¿¿¿¿¿¿CcCcCcCcÇç¿¿¿¿¿¿¿Dd¿¿¿¿¿¿¿¿¿¿ÐdÐð¿¿Ð¿¿¿¿¿¿¿ÉéÈèEeÊê¿¿¿¿¿¿¿¿EeËë¿¿Ee¿¿¿¿EeEe¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ƒƒ¿¿GgGgGgGgGg¿¿Gg¿¿¿Hh¿¿¿¿¿¿¿¿¿¿¿¿H_¿Hh¿¿ÍíÌìIiÎîIiÏï¿¿IiIiIiIi¿¿¿¿¿¿¿¿¿¿IiI¿¿¿JjJ¿j¿¿¿¿¿¿¿KkKk¿¿¿¿¿¿¿¿¿LlLlLl¿¿¿¿¿¿¿¿LlL¿l¿¿¿¿l¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿Nn¿¿NnÑñ¿¿Nn¿¿¿¿¿¿¿¿¿¿¿¿¿¿N¨n¨ÓóÒòOoÔô¿¿¿¿¿¿¿¿OoÖö¿¿OoÕõ¿¿¿¿¿¿¿¿¿¿Øø¿¿OoOoOo¿¿¿¿¿¿¿¿¿¿Oo¿¿¿¿¿¿¿¿¿¿¿¿¿¿O¿¿¿¿¿¿¿¿¿¿P~p~¿¿¿Q°q°Q¸q¸RrRr¿¿Rr¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿Ss¿¿SsŠš¿¿¿¿¿Ss¿¿¿¿¿¿¿¿¿¿S¿s¿¿¿TtT¨¿¿¿Tt¿¿¿¿¿¿¿¿Tt¿¿¿t¿¿T¿¿ÚúÙùUuÛûUuUuÜüUuUuUuUuUuUu¿¿UuUu¿¿¿¿¿¿¿¿Uu¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿WwW°¿¿¿¿¿¿¿¿¿¿¿¿Ýý¿¿YyY°¿Ÿÿ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿Zz¿¿ŽžZz¿¿¿¿¿z¿¿¿¿¿¿¿¿¿¿¿¿¿ ]+\n#";

preg_match_all($theRegEx, $theData, $theData2, PREG_SET_ORDER);

Open in new window

My source looks something like this (French):

00:00:00:00 Et c'est une cruelle ironie du sort que de voir une explosion comme celle-là se produire 00:00:07:22 alors que les Africains célébraient 00:00:11:18 et suivaient la Coupe du monde qui se déroulait en Afrique du Sud. 00:00:16:11 D'un côté, on a la vision d'une Afrique qui avance, 00:00:21:00 d'une Afrique unifiée, 00:00:22:24 d'une Afrique qui se modernise et se crée des possibilités ; 00:00:26:17 de l'autre, on a 00:00:28:08 une vision d'Al-Qaïda et d'Al-Shebab qui n'est que de destruction et de mort. 00:00:34:19 Je crois que cela présente un contraste assez clair en ce qui concerne l'avenir 00:00:39:10 que la plupart des Africains souhaitent pour eux-mêmes et pour leurs enfants. 00:00:42:21 Nous devons nous assurer que nous faisons tout 00:00:46:11 notre possible afin de soutenir ceux qui veulent bâtir, 00:00:48:08 par opposition à ceux qui veulent démolir. 00:00:55:08 Ce qu'on a pu voir dans certaines des déclarations 00:00:58:19 faites par des organisations terroristes, c'est 00:01:00:17 qu'elles ne considèrent pas la vie des Africains comme précieuse en soi. 00:01:05:10 Elles la voient comme un terrain où il est possible de livrer des batailles idéologiques 00:01:12:20 qui tuent des innocents, au mépris des conséquences à long terme 00:01:16:07 pourvu qu'elles y trouvent des avantages tactiques à court terme. 00:01:18:19 C'est la raison pour laquelle il est si important, alors même que nous affrontons

Open in new window

Is my RegEx correct?

Right now it won't return anything.

tdterry

since your input is new-line delimited, use the multi-line modifier and $ to match the end of the line. Then, just take everything after the timestamp rather than trying to enumerate all possible characters for all languages.

$theRegEx = "#[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}\t(?:.*)$#m"

preg_match_all($theRegEx, $theData, $theData2, PREG_SET_ORDER);

Open in new window

RowanCoder

ASKER

Hi tdterry:

The RegEx is very close. I need to separate the each time code plus text:

00:00:11:18 et suivaient la Coupe du monde qui se déroulait en Afrique du Sud.

00:00:16:11 D'un côté, on a la vision d'une Afrique qui avance,

00:00:21:00 d'une Afrique unifiée,

00:00:22:24 d'une Afrique qui se modernise et se crée des possibilités ;

Thanks much,
Karen

tdterry

Then you just need to add some grouping parens so you can get the data out. It depends on how you want the data back, but if you want the four numbers as a single value and the string as another single value do this:

$theRegEx = "#([0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2})\t(.*)$#m"

If you want each of the four numbers separately, try:

$theRegEx = "#([0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2})\t(.*)$#m"

Wherever you put (...), preg_match_all will return that portion of the match as a separate value for you to use.

RowanCoder

ASKER

Hi tdterry:

Still not quite it. My guess is that it has something to do with newline (\n).

The file I am parsing is something like:

00:00:11:18 et suivaient la Coupe du monde qui se déroulait en Afrique du Sud.

00:00:16:11 D'un côté, on a la vision d'une Afrique qui avance,

00:00:21:00 d'une Afrique unifiée,

00:00:22:24 d'une Afrique qui se modernise et se crée des possibilités ;

Each timestamp and text on a newline. The new regex current picks the first timestamp in the file and then puts everything following in the second key of the array, including the following timestamps.

My regex for english is:

#[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}\t[a-zA-Z0-9\s;\\,\\.\\!\\?]+\n#

but this won't work for other languages.

Thanks much again,
Karen

Ray Paseur

Please post a link to the file you are parsing, or post the file itself. Thanks, ~Ray

tdterry

You need the 'm' after the last '#'. This tells the regex parser that the input is multiple lines and the $ should match at the end of EACH line rather than the end of the string. The code snippet below correctly matches your sample lines. Note, I changed the \t to \s+ because I used a space separator rather than a tab, but otherwise, the example is the same.

$a = <<< EOS
00:00:11:18 et suivaient la Coupe du monde qui se déroulait en Afrique du Sud.
00:00:16:11 D'un côté, on a la vision d'une Afrique qui avance,
00:00:21:00 d'une Afrique unifiée,
00:00:22:24 d'une Afrique qui se modernise et se crée des possibilités ;
EOS;

preg_match_all("#([0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2})\s+(.*)$#m", $a, $matches);

var_export($matches);

Open in new window

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

RowanCoder

ASKER

Hi tdterry & Ray_Passeur:

This almost works:

#[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}\t(.*)$#m

It separates some.

Attached is the text file. And yes there are newlines.

fr.txt