Solved

Remove Unicode Charecter 'ÿ' from Text files using a script

Posted on 2010-08-30
11
1,548 Views
Last Modified: 2012-05-10
Hi!

i have a bunch of files which include unicode string - ÿ

i would like to replace it with a null and re-write the file

ive looked for a few vbscripts along with python scripts - but nothing can really nail it

it should preferably be able to go on all text file (*.txt) in a directory

VBS/Python/Batch would help :)

Thanks!

0
Comment
Question by:m0tek
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 17

Expert Comment

by:gelonida
ID: 33557819
Do you want to replace all non representable unicode strings or only the unicode string with the
ÿ


Is your file encoded with UTF-8?
If not please tell us the file encoding
0
 
LVL 17

Expert Comment

by:gelonida
ID: 33557822
in order to be 100% sure, that the script works on the correctly encoded txt files you could perhaps
upload a small example .txt file
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558131
Here's a simple script to replace the character and create a new copy of your file with the character removed:
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558146
The previous paste was bad...here's the correct script:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText
objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558157
It still did it!  Frustrating:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting)
objFile.WriteLine strNewText
objFile.Close
0
 
LVL 9

Expert Comment

by:asawatzki
ID: 33560378
Try specifying to open it in either Unicode or ANSI.  If the below code doesn't work, then try changing it from FormatUnicode to FormatANSI on both cases OpenTextFile lines.


Const ForReading = 1
Const ForWriting = 2
Const FormatUnicode = -1
Const FormatANSI = 0

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading, False, FormatUnicode)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting, False, FormatUnicode )
objFile.Write strNewText
objFile.Close
0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236602
I think you can do that in python like this:
# -*- coding: cp1252 -*-
import re

input_filepath = "C:\\temp\\input.txt"
output_filepath = "C:\\temp\\output.txt"

fip = open(input_filepath,"rb")
lines = fip.readlines()
fip.close()

fop = open(output_filepath,"wb")
for line in lines:
    l = re.sub("ÿ","",line)
    fop.write(l)
fop.close()

Open in new window

0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236627
Or if you want to change all files that end with .txt in a folder, you can try something like this:
# -*- coding: cp1252 -*-
import re, os
foldername = "C:\\temp\\"

for root, dirs, files in os.walk(foldername):
    for name in files:
        if re.search("(.*)\.txt$",name,re.IGNORECASE):
            filename = os.path.join(root, name)
            
            fip = open(filename,"rb")
            lines = fip.readlines()
            fip.close()

            fop = open(filename,"wb")
            for line in lines:
                l = re.sub("ÿ","",line)
                fop.write(l)
            fop.close()

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 34294328
My guess is that it is the first or second character of the file.  My second guess it is that your files are stored using utf-16 with BOM (little endian or big endian -- or it could be even utf-32).  If I am right you are interpreting the BOM bytes as characters using some encoding (based on my own recent observation).  If this is true, you should or skip the first two (four) bytes and read the rest as utf-16 encoded (or utf-32).  Try the following snippet with the attached files:
f = open('utf16be.txt')
s = f.read()
f.close()
print s

f = open('utf16Le.txt')
s = f.read()
f.close()
print s

import codecs

f = codecs.open('utf16be.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

f = codecs.open('utf16le.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

Open in new window

utf16le.txt utf16be.txt
0
 
LVL 29

Expert Comment

by:pepr
ID: 34471206
m0tek: Each question should be closed.  If you know the right answer, put it here, and accept your own comment. If there is no correct answer, just ask for deletion of the question with points refund.  

Or you can attach here the sampe file that shows the problem.  Then the solution could be found.  It is not clear now, what is the problem, whether it persists, whether you died or what.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article is the result of a quest to better understand Task Scheduler 2.0 and all the newer objects available in vbscript in this version over  the limited options we had scripting in Task Scheduler 1.0.  As I started my journey of knowledge I f…
With User Account Control (UAC) enabled in Windows 7, one needs to open an elevated Command Prompt in order to run scripts under administrative privileges. Although the elevated Command Prompt accomplishes the task, the question How to run as script…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Six Sigma Control Plans

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question