Solved

Remove Unicode Charecter 'ÿ' from Text files using a script

Posted on 2010-08-30
11
1,460 Views
Last Modified: 2012-05-10
Hi!

i have a bunch of files which include unicode string - ÿ

i would like to replace it with a null and re-write the file

ive looked for a few vbscripts along with python scripts - but nothing can really nail it

it should preferably be able to go on all text file (*.txt) in a directory

VBS/Python/Batch would help :)

Thanks!

0
Comment
Question by:m0tek
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 10

Expert Comment

by:Kechka
ID: 33557794
0
 
LVL 16

Expert Comment

by:gelonida
ID: 33557819
Do you want to replace all non representable unicode strings or only the unicode string with the
ÿ


Is your file encoded with UTF-8?
If not please tell us the file encoding
0
 
LVL 16

Expert Comment

by:gelonida
ID: 33557822
in order to be 100% sure, that the script works on the correctly encoded txt files you could perhaps
upload a small example .txt file
0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558131
Here's a simple script to replace the character and create a new copy of your file with the character removed:
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558146
The previous paste was bad...here's the correct script:

Const ForReading = 1

Const ForWriting = 2



Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)



strText = objFile.ReadAll

objFile.Close



strNewText = Replace(strText, "ÿ", "")



Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText

objFile.Close

Open in new window

0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558157
It still did it!  Frustrating:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting)
objFile.WriteLine strNewText
objFile.Close
0
 
LVL 9

Expert Comment

by:asawatzki
ID: 33560378
Try specifying to open it in either Unicode or ANSI.  If the below code doesn't work, then try changing it from FormatUnicode to FormatANSI on both cases OpenTextFile lines.


Const ForReading = 1
Const ForWriting = 2
Const FormatUnicode = -1
Const FormatANSI = 0

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading, False, FormatUnicode)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting, False, FormatUnicode )
objFile.Write strNewText
objFile.Close
0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236602
I think you can do that in python like this:
# -*- coding: cp1252 -*-
import re

input_filepath = "C:\\temp\\input.txt"
output_filepath = "C:\\temp\\output.txt"

fip = open(input_filepath,"rb")
lines = fip.readlines()
fip.close()

fop = open(output_filepath,"wb")
for line in lines:
    l = re.sub("ÿ","",line)
    fop.write(l)
fop.close()

Open in new window

0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236627
Or if you want to change all files that end with .txt in a folder, you can try something like this:
# -*- coding: cp1252 -*-
import re, os
foldername = "C:\\temp\\"

for root, dirs, files in os.walk(foldername):
    for name in files:
        if re.search("(.*)\.txt$",name,re.IGNORECASE):
            filename = os.path.join(root, name)
            
            fip = open(filename,"rb")
            lines = fip.readlines()
            fip.close()

            fop = open(filename,"wb")
            for line in lines:
                l = re.sub("ÿ","",line)
                fop.write(l)
            fop.close()

Open in new window

0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 34294328
My guess is that it is the first or second character of the file.  My second guess it is that your files are stored using utf-16 with BOM (little endian or big endian -- or it could be even utf-32).  If I am right you are interpreting the BOM bytes as characters using some encoding (based on my own recent observation).  If this is true, you should or skip the first two (four) bytes and read the rest as utf-16 encoded (or utf-32).  Try the following snippet with the attached files:
f = open('utf16be.txt')
s = f.read()
f.close()
print s

f = open('utf16Le.txt')
s = f.read()
f.close()
print s

import codecs

f = codecs.open('utf16be.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

f = codecs.open('utf16le.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

Open in new window

utf16le.txt utf16be.txt
0
 
LVL 28

Expert Comment

by:pepr
ID: 34471206
m0tek: Each question should be closed.  If you know the right answer, put it here, and accept your own comment. If there is no correct answer, just ask for deletion of the question with points refund.  

Or you can attach here the sampe file that shows the problem.  Then the solution could be found.  It is not clear now, what is the problem, whether it persists, whether you died or what.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How do I send an email from raspberry pi at user login? 8 69
Python Regex Problem 24 125
Sending Attachment via CDO 3 59
read an xml file in perl 2 15
This article is the result of a quest to better understand Task Scheduler 2.0 and all the newer objects available in vbscript in this version over  the limited options we had scripting in Task Scheduler 1.0.  As I started my journey of knowledge I f…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now