?
Solved

Remove Unicode Charecter 'ÿ' from Text files using a script

Posted on 2010-08-30
11
Medium Priority
?
1,578 Views
Last Modified: 2012-05-10
Hi!

i have a bunch of files which include unicode string - ÿ

i would like to replace it with a null and re-write the file

ive looked for a few vbscripts along with python scripts - but nothing can really nail it

it should preferably be able to go on all text file (*.txt) in a directory

VBS/Python/Batch would help :)

Thanks!

0
Comment
Question by:m0tek
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 17

Expert Comment

by:gelonida
ID: 33557819
Do you want to replace all non representable unicode strings or only the unicode string with the
ÿ


Is your file encoded with UTF-8?
If not please tell us the file encoding
0
 
LVL 17

Expert Comment

by:gelonida
ID: 33557822
in order to be 100% sure, that the script works on the correctly encoded txt files you could perhaps
upload a small example .txt file
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558131
Here's a simple script to replace the character and create a new copy of your file with the character removed:
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558146
The previous paste was bad...here's the correct script:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText
objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558157
It still did it!  Frustrating:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting)
objFile.WriteLine strNewText
objFile.Close
0
 
LVL 9

Expert Comment

by:asawatzki
ID: 33560378
Try specifying to open it in either Unicode or ANSI.  If the below code doesn't work, then try changing it from FormatUnicode to FormatANSI on both cases OpenTextFile lines.


Const ForReading = 1
Const ForWriting = 2
Const FormatUnicode = -1
Const FormatANSI = 0

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading, False, FormatUnicode)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting, False, FormatUnicode )
objFile.Write strNewText
objFile.Close
0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236602
I think you can do that in python like this:
# -*- coding: cp1252 -*-
import re

input_filepath = "C:\\temp\\input.txt"
output_filepath = "C:\\temp\\output.txt"

fip = open(input_filepath,"rb")
lines = fip.readlines()
fip.close()

fop = open(output_filepath,"wb")
for line in lines:
    l = re.sub("ÿ","",line)
    fop.write(l)
fop.close()

Open in new window

0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236627
Or if you want to change all files that end with .txt in a folder, you can try something like this:
# -*- coding: cp1252 -*-
import re, os
foldername = "C:\\temp\\"

for root, dirs, files in os.walk(foldername):
    for name in files:
        if re.search("(.*)\.txt$",name,re.IGNORECASE):
            filename = os.path.join(root, name)
            
            fip = open(filename,"rb")
            lines = fip.readlines()
            fip.close()

            fop = open(filename,"wb")
            for line in lines:
                l = re.sub("ÿ","",line)
                fop.write(l)
            fop.close()

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 34294328
My guess is that it is the first or second character of the file.  My second guess it is that your files are stored using utf-16 with BOM (little endian or big endian -- or it could be even utf-32).  If I am right you are interpreting the BOM bytes as characters using some encoding (based on my own recent observation).  If this is true, you should or skip the first two (four) bytes and read the rest as utf-16 encoded (or utf-32).  Try the following snippet with the attached files:
f = open('utf16be.txt')
s = f.read()
f.close()
print s

f = open('utf16Le.txt')
s = f.read()
f.close()
print s

import codecs

f = codecs.open('utf16be.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

f = codecs.open('utf16le.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

Open in new window

utf16le.txt utf16be.txt
0
 
LVL 29

Expert Comment

by:pepr
ID: 34471206
m0tek: Each question should be closed.  If you know the right answer, put it here, and accept your own comment. If there is no correct answer, just ask for deletion of the question with points refund.  

Or you can attach here the sampe file that shows the problem.  Then the solution could be found.  It is not clear now, what is the problem, whether it persists, whether you died or what.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question