Replace letters on the whole page, but not within the tags

I've been searching for this for days.. but no result.

I want to replace latin letters with cyrillic ones using php. But I want to exclude some words and letters within the specific tag <notranslate>

So if I have:
<p><b>Ovo je neki tekst</b> i ovo sigurno <notranslate>nece preci u cirilicu</translate>, hvala !</p>

Open in new window


I want it to become:
<p><b>Ово је неки текст</b> и ово сигурно <notranslate>nece preci u cirilicu</translate>, хвала !</p>

Open in new window


How to do this, using regex ?
LVL 6
V4nP3rs13Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Marco GasiFreelancerCommented:
First let me notice you have mismatching opening and closing tags: you open with <notranslate> and you close with </translate> instead of closing with </notranslate>.

Second, do you want to do it locally, before to publish the site or do you want this happens in real time, when the page is online?
Ray PaseurCommented:
How to do this, using regex ?
Bwahahaha!  You don't do this using regex.  You can't do this using regex.  Regex is the wrong tool.  This is humorous, and it explains why.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top

Let's take a step back and build the application correctly.  The first thing we need is the translation table.  We need to see the input Latin letters and the corresponding Cyrillic letters.  Where there is a one-to-one correspondence, it's relatively easy to get this right.  When you need more complex or idiomatic structures, it may be outside of the capabilities of PHP, but we can talk about that after we see your translation table.

Next we need to understand where this data comes from.  Where does the Latin character string originate?  How do the <notranslate> tags get inserted?

Finally, we need some test data.  The quality and completeness of any solution we can generate will be directly related to the quality and completeness of the test data.

If you can give us those things we can help you get the project working correctly.  If you're interested in the general design patterns for multi-language web sites, this seems to work well.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_8910-A-Polyglot-Web-Site-in-PHP.html
Terry WoodsIT GuruCommented:
It can be done with regex (even if it might be a big ugly and hard to understand) in special cases, such as when the tags in question aren't nested. However when the goal is mapping characters, regex is less appropriate anyway, and some simple parsing is probably better.

The question has some test data, but Marco's question should be addressed I think. And are you happy with having some simple parsing code instead of a regex? (The correct answer should be "yes" here!)
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Ray PaseurCommented:
@Terry, I agree that it has "some" test data, but as Marco noted, the tags seem to be wrong.  And I rather expect that a more comprehensive translation table would be useful, just in case there are other letters that need to be translated.  I would probably write a state engine to do the translation, if the translation must be done on the marked-up string.   If the data originates in a program that generates the mark-up, it would seem easier to translate before the mark-up is inserted, or before the template is populated with the data.
V4nP3rs13Author Commented:
my mistake.. it's </notranslate> instead of </translate>

so... what would be the solution ???
Marco GasiFreelancerCommented:
Well, it looks like you got a lot of good and important suggestions here. The most important one is that regular expresisons are not the right tool to parse HTML.
Personally I repeat my question: are you processing local files to prepare them to be published or do you want a live translation system? If you're going to add to your site a real-time translation system, you have a lot of options. I use jquery.lang.js, a jquery plugin I like a lot.
Give it a look. In this case, it's better for you using class:

<p lang='ru'><b lang='ru'>Ovo je neki tekst</b>  <span lang='ru'>i ovo sigurno</span> nece preci u cirilicu, <span lang='ru'>hvala !</span></p>

Open in new window


The jquery plugin will replace all strings contained in a block with lang attribute set with the corresponding string served by a file for that language. This file is a json file strutured as follows:

{
  token: {
    "Ovo je neki tekst":
       "Ово је неки текст",
    "i ovo sigurno":
       " и ово сигурно",
    "hvala !":
       "хвала !"
  }
}

Open in new window


Once initialized, the plugin will translate your page  and the result will be:
<p><b>Ово је неки текст</b> и ово сигурно <notranslate>nece preci u cirilicu</translate>, хвала !</p>

Open in new window


If you need to process local files, you have to look to DOM parsers: Ray helped me with them some day ago, and you can find this useful: simple_html_dom.php
Ray PaseurCommented:
This seems to test out OK: http://iconoun.com/demo/temp_v4n.php
<?php // demo/temp_v4n.php

/**
 * http://www.experts-exchange.com/questions/28694184/Replace-letters-on-the-whole-page-but-not-within-the-tags.html
 *
 * http://webdesign.about.com/od/localization/l/blhtmlcodes-ru.htm
 * https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
 */
error_reporting(E_ALL);

// TEST DATA FROM THE POST AT E-E
$alpha = '<p><b>Ovo je neki tekst</b> i ovo sigurno <notranslate>nece preci u cirilicu</notranslate>, hvala !</p>';

// MAKE THE TRANSLATION
$omega = translate($alpha);

// SHOW THE WORK PRODUCT
echo '<pre>';
echo PHP_EOL . $alpha;
echo PHP_EOL . $omega;


// A FUNCTION TO RETURN THE CYRILLIC CHARACTER ENTITY
function Cyrillic($char)
{
    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME ADDITIONS OR CHANGES HERE)
    static
    $translation
    = array
    ( 'O' => '&#x41e;'

    , 'a' => '&#x430;'
    , 'e' => '&#x435;'
    , 'g' => '&#x433;'
    , 'h' => '&#x445;'
    , 'i' => '&#x438;'
    , 'k' => '&#x43A;'
    , 'l' => '&#x43B;'
    , 'n' => '&#x43D;'
    , 'o' => '&#x43E;'
    , 't' => '&#x442;'
    , 'r' => '&#x440;'
    , 's' => '&#x441;'
    , 'u' => '&#x443;'
    , 'v' => '&#x432;'
    )
    ;
    if (array_key_exists($char, $translation)) return $translation[$char];
    return $char;
}


// A STATE ENGINE TO TRANSLATE THE CHARACTERS
function Translate($str, $signal='notranslate')
{
    $opentag = '<'  . $signal . '>';
    $stoptag = '</' . $signal . '>';
    $lower   = strtolower($str);

    // PROCESS CHARACTER BY CHARACTER
    $arr = str_split($str);
    $tag = 0;
    $ntr = 0;
    foreach ($arr as $key => $char)
    {
        // IF THIS CHARACTER ENDS A NO-TRANSLATE STATE
        if (substr($lower, $key, strlen($stoptag)) == $stoptag) $ntr--;

        // IF THIS CHARACTER ENDS ANY TAG
        if ($char == '>') $tag--;

        // IF WE ARE INSIDE A TAG OR NO-TRANSLATE STATE
        if ( $ntr || $tag ) continue;

        // IF THIS CHARACTER STARTS A TAG
        if ($char == '<') $tag++;

        // IF THIS CHARACTER STARTS A NO-TRANSLATE STATE
        if (substr($lower, $key, strlen($opentag)) == $opentag) $ntr++;

        // IF THIS CHARACTER IS ELIGIBLE FOR TRANSLATION
        if ( !$tag && !$ntr ) $arr[$key] = cyrillic($char);
    }

    return implode(NULL, $arr);
}

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
V4nP3rs13Author Commented:
@Ray Paseur perfect, just... can you recode it to works with double character replacement.. for example.. I need to replace latin "nj" with cyrillic "њ"
Ray PaseurCommented:
Is that the only double character you need?  If not, please tell me what all of the other are.
V4nP3rs13Author Commented:
"nj" to "њ", "lj" to "љ" and "dž" to "џ"... and their uppercase letters. But I will add them to the list later.
V4nP3rs13Author Commented:
Thank you a lot !!!

I added a replace function for double letters by putting str_replace on the bottom implode function :)

And this is my final version of your code with some little modifications:
$alpha = '<p><b>Ovo je neki tekst</b> i ovo sigurno <notranslate>nece preci u cirilicu</notranslate>, hvala !</p>';

function cyrillic($char) {
    $latin    = array("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "z");
    $cyrillic = array("А", "Б", "Ц", "Д", "Е", "Ф", "Г", "Х", "И", "Ј", "К", "Л", "М", "Н", "О", "П", "Q", "Р", "С", "Т", "У", "В", "З", "а", "б", "ц", "д", "е", "ф", "г", "х", "и", "ј", "к", "л", "м", "н", "о", "п", "q", "р", "с", "т", "у", "в", "з");
    $convert = array_combine($latin, $cyrillic);

    if (array_key_exists($char, $convert)) {
    	return $convert[$char];
    } else {
    	return $char;
    }
}

$arr = str_split($alpha);
$tag = 0;
$ntr = 0;

foreach ($arr as $key => $char) {
    if (substr(strtolower($alpha), $key, strlen('</notranslate>')) == '</notranslate>') $ntr--;
    if ($char == '>') $tag--;
    if ( $ntr || $tag ) continue;
    if ($char == '<') $tag++;
    if (substr(strtolower($alpha), $key, strlen('<notranslate>')) == '<notranslate>') $ntr++;
    if ( !$tag && !$ntr ) $arr[$key] = cyrillic($char);
}


$fix_from = array("š", "đ", "č", "ć", "ž", "Š", "Đ", "Č", "Ć", "Ž", "лј", "нј", "дж", "ЛЈ", "НЈ", "ДЖ", "Лј", "Нј", "Дж");
$fix_to = array("ш", "ђ", "ч", "ћ", "ж", "Ш", "Ђ", "Ч", "Ћ", "Ж", "љ", "њ", "џ", "Љ", "Њ", "Џ", "Љ", "Њ", "Џ");
echo str_replace($fix_from, $fix_to, implode(NULL, $arr));

Open in new window

Ray PaseurCommented:
Looks good.  Thanks for the points and thanks for using E-E, ~Ray
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.