Solved

Remove Unicode Charecter 'ÿ' from Text files using a script

Posted on 2010-08-30
11
1,442 Views
Last Modified: 2012-05-10
Hi!

i have a bunch of files which include unicode string - ÿ

i would like to replace it with a null and re-write the file

ive looked for a few vbscripts along with python scripts - but nothing can really nail it

it should preferably be able to go on all text file (*.txt) in a directory

VBS/Python/Batch would help :)

Thanks!

0
Comment
Question by:m0tek
  • 3
  • 2
  • 2
  • +3
11 Comments
 
LVL 10

Expert Comment

by:Kechka
ID: 33557794
0
 
LVL 16

Expert Comment

by:gelonida
ID: 33557819
Do you want to replace all non representable unicode strings or only the unicode string with the
ÿ


Is your file encoded with UTF-8?
If not please tell us the file encoding
0
 
LVL 16

Expert Comment

by:gelonida
ID: 33557822
in order to be 100% sure, that the script works on the correctly encoded txt files you could perhaps
upload a small example .txt file
0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558131
Here's a simple script to replace the character and create a new copy of your file with the character removed:
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText objFile.Close

Open in new window

0
 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558146
The previous paste was bad...here's the correct script:

Const ForReading = 1

Const ForWriting = 2



Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)



strText = objFile.ReadAll

objFile.Close



strNewText = Replace(strText, "ÿ", "")



Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting) objFile.WriteLine strNewText

objFile.Close

Open in new window

0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 17

Expert Comment

by:Tony Massa
ID: 33558157
It still did it!  Frustrating:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting)
objFile.WriteLine strNewText
objFile.Close
0
 
LVL 9

Expert Comment

by:asawatzki
ID: 33560378
Try specifying to open it in either Unicode or ANSI.  If the below code doesn't work, then try changing it from FormatUnicode to FormatANSI on both cases OpenTextFile lines.


Const ForReading = 1
Const ForWriting = 2
Const FormatUnicode = -1
Const FormatANSI = 0

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\file1.txt", ForReading, False, FormatUnicode)

strText = objFile.ReadAll
objFile.Close

strNewText = Replace(strText, "ÿ", "")

Set objFile = objFSO.OpenTextFile("C:\file2.txt", ForWriting, False, FormatUnicode )
objFile.Write strNewText
objFile.Close
0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236602
I think you can do that in python like this:
# -*- coding: cp1252 -*-
import re

input_filepath = "C:\\temp\\input.txt"
output_filepath = "C:\\temp\\output.txt"

fip = open(input_filepath,"rb")
lines = fip.readlines()
fip.close()

fop = open(output_filepath,"wb")
for line in lines:
    l = re.sub("ÿ","",line)
    fop.write(l)
fop.close()

Open in new window

0
 
LVL 3

Expert Comment

by:Mytix
ID: 34236627
Or if you want to change all files that end with .txt in a folder, you can try something like this:
# -*- coding: cp1252 -*-
import re, os
foldername = "C:\\temp\\"

for root, dirs, files in os.walk(foldername):
    for name in files:
        if re.search("(.*)\.txt$",name,re.IGNORECASE):
            filename = os.path.join(root, name)
            
            fip = open(filename,"rb")
            lines = fip.readlines()
            fip.close()

            fop = open(filename,"wb")
            for line in lines:
                l = re.sub("ÿ","",line)
                fop.write(l)
            fop.close()

Open in new window

0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 34294328
My guess is that it is the first or second character of the file.  My second guess it is that your files are stored using utf-16 with BOM (little endian or big endian -- or it could be even utf-32).  If I am right you are interpreting the BOM bytes as characters using some encoding (based on my own recent observation).  If this is true, you should or skip the first two (four) bytes and read the rest as utf-16 encoded (or utf-32).  Try the following snippet with the attached files:
f = open('utf16be.txt')
s = f.read()
f.close()
print s

f = open('utf16Le.txt')
s = f.read()
f.close()
print s

import codecs

f = codecs.open('utf16be.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

f = codecs.open('utf16le.txt', encoding='UTF-16')
s = f.read()
f.close()
print s

Open in new window

utf16le.txt utf16be.txt
0
 
LVL 28

Expert Comment

by:pepr
ID: 34471206
m0tek: Each question should be closed.  If you know the right answer, put it here, and accept your own comment. If there is no correct answer, just ask for deletion of the question with points refund.  

Or you can attach here the sampe file that shows the problem.  Then the solution could be found.  It is not clear now, what is the problem, whether it persists, whether you died or what.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Over the years I have built up my own little library of code snippets that I refer to when programming or writing a script.  Many of these have come from the web or adaptations from snippets I find on the Web.  Periodically I add to them when I come…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now