Solved

Python, regex, replace characters not between quotes

Posted on 2013-02-06
12
761 Views
Last Modified: 2013-02-12
Hey,

I'm trying to create an (unusual?) regex statement using python str.replace or re.

Simply, I want to replace every character or number NOT between quotes in a string. I can do it using a "for i in " loop and just checking every character, but (I am assuming) regex would be a million times faster :-)  

I.e.

var = 'This is not in quotes "while this is" but this isnt'
var = <magic>
print var
cccc cc cccc cc cccccc while this is ccc cccc cccc

or

var = 'This num is 384 "but this num is 223" and neither is 444'
var = <magic>
print var
cccc ccc cc ### but this num is 223 ccc ccccccc cc ###


--- appended ---

I came up with this so far..

import re
pattern = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
var = "word is word word 'This is a test' is crap"
m = re.search(pattern, var)
replaced_var = re.sub(r'[a-zA-Z]', "c", var)
var = replaced_var[:(m.start())] + var[(m.start()+1):(m.end()-1)] + replaced_var[(m.end()):]
print var

[OUTPUT]
cccc cc cccc cccc This is a test cc cccc


IS THERE A WAY to get this to work with multiple quoted areas in a string. I.e. maybe using regex groups such that...

var = replaced_var[:<group 0 m.start] + var[<1st quoted text>] + replaced_var[<group 0 m.end>]

var =  var + replaced_var[:<group 1 m.start] + var[<2nd quoted text>] + replaced_var[<group 1 m.end>]

var = var + <etc>

Thanks!
0
Comment
Question by:Mike R.
  • 4
  • 2
  • 2
  • +3
12 Comments
 
LVL 9

Expert Comment

by:oheil
ID: 38858822
What is the expected result, when you have odd number of " ?
Like:

abc"def"ghi"jkl

What would be the desired result?

Oli
0
 
LVL 29

Expert Comment

by:pepr
ID: 38858835
There is at least one more problem. The quoted string may contain \" -- i.e. escaped quote.
0
 
LVL 9

Expert Comment

by:oheil
ID: 38858847
Yes, maybe, if escaping of double-quotes should be also possible.

Are we talking only about double-quotes <"> (like in the examples above) or should single-quotes <'> be equivalent.

I am thinking of a recursive solution, so the problem of odd number of " came into mind.

Oli
0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 16

Expert Comment

by:gelonida
ID: 38858876
agree with pepr.
the escaped quotes if they can exist in your context would complicate the situation a bit.

Next question as I don't know what you are parsing.

can you have multi-line entries, where the quote pairs stretch over multiple lines?

Example:
 'This num is
384 "but this num is
223" and neither is 444'

Depending on your answers it might be easiest to split your problem into two parts.
the first part would be an iterator yielding
a tuple (unquoted_part, quoted_part)
and then you can replace in the part where you'd like to with a classical regexp

This is more manual, but might be better to understand, debug and maintain and will probably not be that much slower.
0
 
LVL 31

Expert Comment

by:farzanj
ID: 38858884
Regular expression is not going to be faster than a loop at all.  A lot of Regex will be much slower indeed.  They are faster in terms of coding time, making your time of development faster and shorter not the speed of execution faster.
0
 
LVL 3

Author Comment

by:Mike R.
ID: 38858912
Thanks for the input. So, this is actually part of a script to "user-friendly-ize" input for users who DON'T know regex.

SHORT ANSWERS:
oheil: It checks for matching quotes before processing. If unmatched quotes are found, it just halts.

pepr: Escaped chars won't be allowed ( \ should be ignored)

oheil: I may restrict it to tics actually ``. I haven't decided.

LONG ANSWERS:
Actually, (because maybe you have a better solution altogether) I'm writing a function to check the format of user input without needing to write a bunch of matches into the calling script...i.e. ...

The function user can make a call defining the exact format of the input with wildcards. The things in quotes need to match the input EXACTLY. The other chars represent different variations of wildcards (I.e. one for char only, one for num only, one for anything...etc)

format_var = " '/dev/disk/by-'..'/scsi-'* "
inputf(format_var)

meaning the user must input a string ...
/dev/disk/by-<any two chars>/scsi-<any number of chars>

I.e.   /dev/disk/by-id/scsi-2133213213213213213213213213
...is good but...

/dev/dsk/by-id/scsi-2133213213213213213213213213
or
/dev/dIsk/by-id/scsi-2133213213213213213213213213
or
./by-id/scsi-2133213213213213213213213213

...will all fail as not matching the format_var string style.

Does this make sense. Is there a module that already does this? I didn't find one :-)
0
 
LVL 12

Expert Comment

by:Sharon Seth
ID: 38859016
Dosen't fnmatch do it already?
0
 
LVL 29

Expert Comment

by:pepr
ID: 38859022
@farzanj: > Regular expression is not going to be faster than a loop at all.

Regular expressions usually are faster. But I agree that loop may be easier in this case.

@rightmirem: If using regular expressions in Python, I suggest to prefer the compiled form almost always.
0
 
LVL 9

Accepted Solution

by:
oheil earned 500 total points
ID: 38859084
This is what I would do:

Splitting the input string into an array of parts. Each part with even index is outside single quotes (') and odd index is inside single quotes. The parts can be empty. This is the code:

import re

input = "abc''def'ghi'123'jkl'456"

p = re.compile("(.*?)('.*?')(.*)")

input_list = ["","","",input]

print "Input:"
print input_list[3]

result_list = []
loop=1
while len(input_list[0])==0 and len(input_list[3]) > 0 :
   input_list = p.split(input_list[3])
   if len(input_list[0]) > 0 :
      result_list.append(input_list[0])
      loop=0
   else :
      result_list.append(input_list[1])
      result_list.append(input_list[2])

print
print "Result:"
i=0
while i < len(result_list) :
   print result_list[i]
   i += 1

Open in new window


Oli
0
 
LVL 9

Expert Comment

by:oheil
ID: 38859093
Per definition, if the number of single quotes is odd, than the last part is not inside quotes but contains on quote.

You may check with the example string
input = "abc'def'ghi'123'jkl'456"

Open in new window


Oli
0
 
LVL 16

Expert Comment

by:gelonida
ID: 38859160
Who's choosing the format of format var.
 
Why don't you use a regexp for it?
0
 
LVL 3

Author Closing Comment

by:Mike R.
ID: 38879244
Worked beautifully actually. Thanks!
0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Recursively Delete Files 5 94
How to parse the string and split the sub directory names in asp.net 7 54
Logon script fails 23 45
Getting the NAO robot to play soccer 1 25
It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question