Solved

Converting unicode punctuation to ASCII

Posted on 2008-06-19
8
1,262 Views
Last Modified: 2012-06-22
Hi,

I am reading in some text from a file. The text was saved (as tab-delimited text) from Excel, but may have originally been copied into Excel from Word. As a result of this (I assume) the text has characters such as the 'left single quotation mark' 8216 rather than the ASCII equivalent single quote. Another example is unicode 8230, which is the ellipsis (...) character which Exel seems to insert for some reason. I would like to convert any such characters into the ASCII equivalent. I understand not all incode characters have an ASCII equivalents, and those I will simple filter out, but does anyone know of a definitive list / table of which characters I can convert ? Java code would be even better of course. Thanks in advance.

0
Comment
Question by:emsttam
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 21821452
>>Java code would be even better of course

.. and yet it's been posted in the C# TA? I *can* give you a Java answer if that's really what you want
0
 

Author Comment

by:emsttam
ID: 21821488
CEHJ,

The Java bit was optional. The question asked for a list / table.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21821543
OK. I'm not sure if there's a list available - replacements would be subjective and differ among locales. Given access to the original Unicode string, you could do something like the following for the characters you mention:


s = s.replaceAll("\u2018", "'");
s = s.replaceAll("\u2019", "'");
s = s.replaceAll("\u2026", "");

Open in new window

0
Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

 

Author Comment

by:emsttam
ID: 21821759
CEHJ:

Sorry, perhaps I wasn't clear. It's not the coding that's the difficulty, it's getting a definitive list of which unicode characters to look out for which can be converted into an ASCII equivalent (such as the examples in the question).
0
 
LVL 13

Accepted Solution

by:
MicheleMarcon earned 500 total points
ID: 21822035
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21822117
>>it's getting a definitive list of which unicode characters to look out for which can be converted into an ASCII equivalent

As i mentioned, such things are subjective: do you want to delete ellipses or replace them with three dots? Do you want to replace the left quote with a backtick or a normal single quote? Only you know
0
 

Author Closing Comment

by:emsttam
ID: 31468735
Well that's certainly complete :)
I'll extract the most likely candidates from that page, the punctuation in particular. Thanks.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21822213
I'm confused emsttam. It would appear from your chosen answer that the question was really 'can you show me a table of Unicode character codes?' If so, the definitive ones are here:

http://www.unicode.org/charts/
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Computer science students often experience many of the same frustrations when going through their engineering courses. This article presents seven tips I found useful when completing a bachelors and masters degree in computing which I believe may he…
Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question