Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1334
  • Last Modified:

Converting unicode punctuation to ASCII

Hi,

I am reading in some text from a file. The text was saved (as tab-delimited text) from Excel, but may have originally been copied into Excel from Word. As a result of this (I assume) the text has characters such as the 'left single quotation mark' 8216 rather than the ASCII equivalent single quote. Another example is unicode 8230, which is the ellipsis (...) character which Exel seems to insert for some reason. I would like to convert any such characters into the ASCII equivalent. I understand not all incode characters have an ASCII equivalents, and those I will simple filter out, but does anyone know of a definitive list / table of which characters I can convert ? Java code would be even better of course. Thanks in advance.

0
emsttam
Asked:
emsttam
  • 4
  • 3
1 Solution
 
CEHJCommented:
>>Java code would be even better of course

.. and yet it's been posted in the C# TA? I *can* give you a Java answer if that's really what you want
0
 
emsttamAuthor Commented:
CEHJ,

The Java bit was optional. The question asked for a list / table.
0
 
CEHJCommented:
OK. I'm not sure if there's a list available - replacements would be subjective and differ among locales. Given access to the original Unicode string, you could do something like the following for the characters you mention:


s = s.replaceAll("\u2018", "'");
s = s.replaceAll("\u2019", "'");
s = s.replaceAll("\u2026", "");

Open in new window

0
What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

 
emsttamAuthor Commented:
CEHJ:

Sorry, perhaps I wasn't clear. It's not the coding that's the difficulty, it's getting a definitive list of which unicode characters to look out for which can be converted into an ASCII equivalent (such as the examples in the question).
0
 
MicheleMarconCommented:
0
 
CEHJCommented:
>>it's getting a definitive list of which unicode characters to look out for which can be converted into an ASCII equivalent

As i mentioned, such things are subjective: do you want to delete ellipses or replace them with three dots? Do you want to replace the left quote with a backtick or a normal single quote? Only you know
0
 
emsttamAuthor Commented:
Well that's certainly complete :)
I'll extract the most likely candidates from that page, the punctuation in particular. Thanks.
0
 
CEHJCommented:
I'm confused emsttam. It would appear from your chosen answer that the question was really 'can you show me a table of Unicode character codes?' If so, the definitive ones are here:

http://www.unicode.org/charts/
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now