Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 398
  • Last Modified:

what is the python code to remove characters?

I have this in a text file

17336665552013070139      {(17336665552013070139,75),(17336665552013070139,35),(17336665552013070139,57)}
17336665592013070149      {(17336665592013070149,75),(17336665592013070149,57),(17336665592013070149,78)}
17336665792013070199      {(17336665792013070199,41)}
17349274502013070413      {(17349274502013070413,25),(17349274502013070413,54)}

I want to remove the first column and the repeated value of the first column, the parenthesis { and } and brackets ( and ).

Need it to look like this:
75,35,57
75,57,78
41
25,54

What is the Python code to do this and save as .csv file?

Thanks
0
Ricky Ng
Asked:
Ricky Ng
  • 4
  • 2
1 Solution
 
peprCommented:
Try the following code for Python 2. (For Python 3, it must be slightly modified.)
#!python2

import csv

fname = 'data.txt'
fcsvname = 'data.csv'
with open(fname) as fin, open(fcsvname, 'wb') as fout:
    writer = csv.writer(fout)
    for line in fin:
        print '----------------------------------'
        print line,
        # Extract the second part of the line, replace the {} by [] and 
        # convert it to the list. It uses eval() that can be dangerous
        # if someone put a command inside the string. Do that only if you
        # are sure the lines have the structure that you think.
        line = line.split()[1]   # split by whitespace, get only the second part
        print line
        line = '[' + line.rstrip()[1:-1] + ']'  # convert to a list representation
        print line
        lst = eval(line)    # do this dangerous command only when you know your data
        print lst
        
        # The row will be formed only from second parts of the tuples in the list.
        row = [t[1] for t in lst]
        
        # Write the row to the CSV output file.
        writer.writerow(row)
        
        ## remove the debug prints

Open in new window

Modify the name of the input file and of the output file. Ask for details. If eval() should not be used in your case, another approach for parsing can be used.
0
 
peprCommented:
It is not possible to remove a character from the existing string as Python strings are immutable. However, you can create a new string with removed characters. The above example uses spliting for removing the first number, and slicing for removing the { }. If s is a string, then s[x:y] is a substring from the zero-based index x to the index y excluding. The negative index counts from backward. Then s[1:-1] means a substring from second character to one to befor the last -- thus removing the { }.
0
 
Ricky NgAuthor Commented:
Hi pepr,

I am getting an error:

  File "dropchars.py", line 7
    with open(fname) as fin, open(fcsvname, 'wb') as fout:
                           ^
SyntaxError: invalid syntax



Thanks
0
Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

 
peprCommented:
Did you set the fname earlier?
0
 
aikimarkCommented:
You can also apply two regular expression patterns to do the parsing.
First apply the following pattern to all the text:
\{(.*)\}
Then iterate the matches and apply the following pattern to each match
,(\d\d)\)
0
 
peprCommented:
My +1 for aikimarks suggestion to use regular expression in this case. It will actually be safer (avoiding eval()), and I would not be surprised if the solution was also faster. I would only use a different regular expression that captures a single tuple (in parentheses) and then use the findall method of the regular expression to return the list of wanted elements. However, the result is the list of strings that should be converted to integer befor using the csv module:
#!python2

import csv
import re

fname = 'data.txt'
fcsvname = 'data.csv'
rexSecondItems = re.compile(r'\(\d+,(\d+)\)')

with open(fname) as fin, open(fcsvname, 'wb') as fout:
    writer = csv.writer(fout)
    for line in fin:
        lstS = rexSecondItems.findall(line)
        row = [int(s) for s in lstS]
        writer.writerow(row)

Open in new window

The r'...' means a raw-string. That means that the escape sequences (that start with backslash) will not be interpreted. This is usual when working with regular expressions, because regular expression use backslashes and they want to interpret them on their own. The \( means "one character equal to left parenthesis". It is written with backslash because parentheses without backslash group the part of the regular expression -- as the later part of the regular expression shows. The \d means decimal numeral, the + means one or more times. The .findall returns the list of all grouped matches.

If you know, that the data contain only numbers, you can even avoid using the csv module and join the list on your own:
#!python2

import re

fname = 'data.txt'
fcsvname = 'data.csv'
rexSecondItems = re.compile(r'\(\d+,(\d+)\)')

with open(fname) as fin, open(fcsvname, 'w') as fout:
    for line in fin:
        lst = rexSecondItems.findall(line)
        fout.write(','.join(lst) + '\n')

Open in new window

In this case, the file should be open in the text mode (unlike in the previous case where csv module requires binary mode).
0
 
aikimarkCommented:
If you read the file line-by-line you can skip the first pattern.  The second pattern will parse out the numbers.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now