Solved

Substituting ascii for latin symbol using regex and a dictionary

Posted on 2006-11-15
2
1,311 Views
Last Modified: 2008-02-01
I originally asked this in the wrong board.. maybe if a mod sees this they could close my last question as I think it will get better attention here..

In Perl I used this for replacing ascii text with latin-1 characters:

my %latin=(nbsp=>' ',iexcl=>'¡',cent=>'¢',pound=>'£',curren=>'¤',yen=>'¥',brvbar=>'¦',sect=>'§',uml=>'¨',copy=>'©',ordf=>'ª',laquo=>'«',
not=>'¬',shy=>'­',reg=>'®',macr=>'¯',deg=>'°',plusmn=>'±',sup2=>'²',sup3=>'³',acute=>'´',micro=>'µ',
para=>'¶',middot=>'·',cedil=>'¸',sup1=>'¹',ordm=>'º',raquo=>'»',frac14=>'¼',frac12=>'½',frac34=>'¾',
iquest=>'¿',Agrave=>'À',Aacute=>'Á',Acirc=>'Â',Atilde=>'Ã',Auml=>'Ä',Aring=>'Å',AElig=>'Æ',Ccedil=>'Ç',
Egrave=>'È',Eacute=>'É',Ecirc=>'Ê',Euml=>'Ë',Igrave=>'Ì',Iacute=>'Í',Icirc=>'Î',Iuml=>'Ï',ETH=>'Ð',Ntilde=>'Ñ',
Ograve=>'Ò',Oacute=>'Ó',Ocirc=>'Ô',Otilde=>'Õ',Ouml=>'Ö',times=>'×',Oslash=>'Ø',Ugrave=>'Ù',Uacute=>'Ú',
Ucirc=>'Û',Uuml=>'Ü',Yacute=>'Ý',THORN=>'Þ',szlig=>'ß',agrave=>'à',aacute=>'á',acirc=>'â',atilde=>'ã',auml=>'ä',
aring=>'å',aelig=>'æ',ccedil=>'ç',egrave=>'è',eacute=>'é',ecirc=>'ê',euml=>'ë',igrave=>'ì',iacute=>'í',icirc=>'î',i
uml=>'ï',eth=>'ð',ntilde=>'ñ',ograve=>'ò',oacute=>'ó',ocirc=>'ô',otilde=>'õ',ouml=>'ö',divide=>'÷',oslash=>'ø',
ugrave=>'ù',uacute=>'ú',ucirc=>'û',uuml=>'ü',yacute=>'ý',thorn=>'þ',yuml=>'ÿ');

And then the regex to substitute
$line =~ s/&(nbsp|iexcl|cent|pound|curren|yen|brvbar|sect|uml|copy|ordf|laquo|not|shy|reg|macr|deg|plusmn|
sup2|sup3|acute|micro|µpara|middot|cedil|sup1|ordm|raquo|frac14|frac12|frac34|iquest|Agrave|Aacute|Acirc|
Atilde|Auml|Aring|AElig|Ccedil|Egrave|Eacute|Ecirc|Euml|Igrave|Iacute|Icirc|Iuml|ETH|Ntilde|Ograve|Oacute|Ocirc
|Otilde|Ouml|times|Oslash|Ugrave|Uacute|Ucirc|Uuml|Yacute|THORN|szlig|agrave|aacute|acirc|atilde|auml|aring|
aelig|ccedil|egrave|eacute|ecirc|euml|igrave|acute|icirc|iuml|eth|ntilde|ograve|oacute|ocirc|otilde|ouml|divide|
oslash|ugrave|uacute|ucirc|uuml|yacute|thorn|yuml)\;/$latin{$1}/g;

The hash in Perl from above converted to a Python dictionary looks like this:
latin1={"nbsp":" ", "iexcl":"¡", "cent":"¢", "pound":"£", "curren":"¤", "yen":"¥", "brvbar":"¦", "sect":"§", "uml":"¨", "copy":"©", "ordf":"ª", "laquo":"«", "not":"¬", "shy":"­", "reg":"®", "macr":"¯", "deg":"°", "plusmn":"±", "sup2":"²", "sup3":"³", "acute":"´", "micro":"µ", "para":"¶", "middot":"·", "cedil":"¸", "sup1":"¹", "ordm":"º", "raquo":"»", "frac14":"¼", "frac12":"½", "frac34":"¾", "iquest":"¿", "Agrave":"À", "Aacute":"Á", "Acirc":"Â", "Atilde":"Ã", "Auml":"Ä", "Aring":"Å", "AElig":"Æ", "Ccedil":"Ç", "Egrave":"È", "Eacute":"É", "Ecirc":"Ê", "Euml":"Ë", "Igrave":"Ì", "Iacute":"Í", "Icirc":"Î", "Iuml":"Ï", "ETH":"Ð", "Ntilde":"Ñ", "Ograve":"Ò", "Oacute":"Ó", "Ocirc":"Ô", "Otilde":"Õ", "Ouml":"Ö", "times":"×", "Oslash":"Ø", "Ugrave":"Ù", "Uacute":"Ú", "Ucirc":"Û", "Uuml":"Ü", "Yacute":"Ý", "THORN":"Þ", "szlig":"ß", "agrave":"à", "aacute":"á", "acirc":"â", "atilde":"ã", "auml":"ä", "aring":"å", "aelig":"æ", "ccedil":"ç", "egrave":"è", "eacute":"é", "ecirc":"ê", "euml":"ë", "igrave":"ì", "iacute":"í", "icirc":"î", "iuml":"ï", "eth":"ð", "ntilde":"ñ", "ograve":"ò", "oacute":"ó", "ocirc":"ô", "otilde":"õ", "ouml":"ö", "divide":"÷", "oslash":"ø", "ugrave":"ù", "uacute":"ú", "ucirc":"û", "uuml":"ü", "yacute":"ý", "thorn":"þ", "yuml":"ÿ"}

The regular expression, when finding nbsp, iexcl, cent, etc in $line, would replace it for it's value from the associative array %latin, and display the character correctly. If possible, I would like to do the same thing in Python, however I'm not sure there's that $1 variable in regular expressions.. is there a way I can replicate this behaviour in Python?
Thanks!
0
Comment
Question by:Tabris42
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 14

Accepted Solution

by:
RichieHindle earned 500 total points
ID: 17947930
The re.sub() function is what you need - you can pass it a function which can do the lookup:

# coding: latin-1
import re

def substitute(text):
    def lookup(match):
        latin1={"nbsp":" ", "iexcl":"¡", "cent":"¢", "pound":"£", "curren":"¤", "yen":"¥", "brvbar":"¦", "sect":"§", "uml":"¨", "copy":"©", "ordf":"ª", "laquo":"«", "not":"¬", "shy":"­", "reg":"®", "macr":"¯", "deg":"°", "plusmn":"±", "sup2":"²", "sup3":"³", "acute":"´", "micro":"µ", "para":"¶", "middot":"·", "cedil":"¸", "sup1":"¹", "ordm":"º", "raquo":"»", "frac14":"¼", "frac12":"½", "frac34":"¾", "iquest":"¿", "Agrave":"À", "Aacute":"Á", "Acirc":"Â", "Atilde":"Ã", "Auml":"Ä", "Aring":"Å", "AElig":"Æ", "Ccedil":"Ç", "Egrave":"È", "Eacute":"É", "Ecirc":"Ê", "Euml":"Ë", "Igrave":"Ì", "Iacute":"Í", "Icirc":"Î", "Iuml":"Ï", "ETH":"Ð", "Ntilde":"Ñ", "Ograve":"Ò", "Oacute":"Ó", "Ocirc":"Ô", "Otilde":"Õ", "Ouml":"Ö", "times":"×", "Oslash":"Ø", "Ugrave":"Ù", "Uacute":"Ú", "Ucirc":"Û", "Uuml":"Ü", "Yacute":"Ý", "THORN":"Þ", "szlig":"ß", "agrave":"à", "aacute":"á", "acirc":"â", "atilde":"ã", "auml":"ä", "aring":"å", "aelig":"æ", "ccedil":"ç", "egrave":"è", "eacute":"é", "ecirc":"ê", "euml":"ë", "igrave":"ì", "iacute":"í", "icirc":"î", "iuml":"ï", "eth":"ð", "ntilde":"ñ", "ograve":"ò", "oacute":"ó", "ocirc":"ô", "otilde":"õ", "ouml":"ö", "divide":"÷", "oslash":"ø", "ugrave":"ù", "uacute":"ú", "ucirc":"û", "uuml":"ü", "yacute":"ý", "thorn":"þ", "yuml":"ÿ"}
        return latin1[match.group(1)]
    return re.sub(r'&(nbsp|iexcl|cent|pound|curren|yen|brvbar|sect|uml|copy|ordf|laquo|not|shy|reg|macr|deg|plusmn|sup2|sup3|acute|micro|µpara|middot|cedil|sup1|ordm|raquo|frac14|frac12|frac34|iquest|Agrave|Aacute|Acirc|Atilde|Auml|Aring|AElig|Ccedil|Egrave|Eacute|Ecirc|Euml|Igrave|Iacute|Icirc|Iuml|ETH|Ntilde|Ograve|Oacute|Ocirc|Otilde|Ouml|times|Oslash|Ugrave|Uacute|Ucirc|Uuml|Yacute|THORN|szlig|agrave|aacute|acirc|atilde|auml|aring|aelig|ccedil|egrave|eacute|ecirc|euml|igrave|acute|icirc|iuml|eth|ntilde|ograve|oacute|ocirc|otilde|ouml|divide|oslash|ugrave|uacute|ucirc|uuml|yacute|thorn|yuml)\;', lookup, text)

print substitute("Hello Wórld!")   # Prints "Hello Wórld!"


But having said that, Python already knows about all the HTML entities, so that program becomes:

import re, htmlentitydefs

def substitute(text):
    def lookup(match):
        codepoint = htmlentitydefs.name2codepoint.get(match.group(1))
        if codepoint:
            return chr(codepoint)
        else:
            return '&%s;' % match.group(1)
   
    return re.sub(r'&([a-zA-Z0-9]+)\;', lookup, text)

print substitute("Hello Wórld! &invalid;")   # Prints "Hello Wórld! &invalid;"


As you can see, that leaves unknown entities alone, just like yours.
0
 

Author Comment

by:Tabris42
ID: 17947963
That is perfect! I did not know about htmlentitydefs, that is a HUGE help. Thanks so much!
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

717 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question