Mike R.
asked on
Python, regex, replace characters not between quotes
Hey,
I'm trying to create an (unusual?) regex statement using python str.replace or re.
Simply, I want to replace every character or number NOT between quotes in a string. I can do it using a "for i in " loop and just checking every character, but (I am assuming) regex would be a million times faster :-)
I.e.
var = 'This is not in quotes "while this is" but this isnt'
var = <magic>
print var
cccc cc cccc cc cccccc while this is ccc cccc cccc
or
var = 'This num is 384 "but this num is 223" and neither is 444'
var = <magic>
print var
cccc ccc cc ### but this num is 223 ccc ccccccc cc ###
--- appended ---
I came up with this so far..
import re
pattern = r"'([^'\\]*(?:\\.[^'\\]*)* )'"
var = "word is word word 'This is a test' is crap"
m = re.search(pattern, var)
replaced_var = re.sub(r'[a-zA-Z]', "c", var)
var = replaced_var[:(m.start())] + var[(m.start()+1):(m.end() -1)] + replaced_var[(m.end()):]
print var
[OUTPUT]
cccc cc cccc cccc This is a test cc cccc
IS THERE A WAY to get this to work with multiple quoted areas in a string. I.e. maybe using regex groups such that...
var = replaced_var[:<group 0 m.start] + var[<1st quoted text>] + replaced_var[<group 0 m.end>]
var = var + replaced_var[:<group 1 m.start] + var[<2nd quoted text>] + replaced_var[<group 1 m.end>]
var = var + <etc>
Thanks!
I'm trying to create an (unusual?) regex statement using python str.replace or re.
Simply, I want to replace every character or number NOT between quotes in a string. I can do it using a "for i in " loop and just checking every character, but (I am assuming) regex would be a million times faster :-)
I.e.
var = 'This is not in quotes "while this is" but this isnt'
var = <magic>
print var
cccc cc cccc cc cccccc while this is ccc cccc cccc
or
var = 'This num is 384 "but this num is 223" and neither is 444'
var = <magic>
print var
cccc ccc cc ### but this num is 223 ccc ccccccc cc ###
--- appended ---
I came up with this so far..
import re
pattern = r"'([^'\\]*(?:\\.[^'\\]*)*
var = "word is word word 'This is a test' is crap"
m = re.search(pattern, var)
replaced_var = re.sub(r'[a-zA-Z]', "c", var)
var = replaced_var[:(m.start())]
print var
[OUTPUT]
cccc cc cccc cccc This is a test cc cccc
IS THERE A WAY to get this to work with multiple quoted areas in a string. I.e. maybe using regex groups such that...
var = replaced_var[:<group 0 m.start] + var[<1st quoted text>] + replaced_var[<group 0 m.end>]
var = var + replaced_var[:<group 1 m.start] + var[<2nd quoted text>] + replaced_var[<group 1 m.end>]
var = var + <etc>
Thanks!
There is at least one more problem. The quoted string may contain \" -- i.e. escaped quote.
Yes, maybe, if escaping of double-quotes should be also possible.
Are we talking only about double-quotes <"> (like in the examples above) or should single-quotes <'> be equivalent.
I am thinking of a recursive solution, so the problem of odd number of " came into mind.
Oli
Are we talking only about double-quotes <"> (like in the examples above) or should single-quotes <'> be equivalent.
I am thinking of a recursive solution, so the problem of odd number of " came into mind.
Oli
agree with pepr.
the escaped quotes if they can exist in your context would complicate the situation a bit.
Next question as I don't know what you are parsing.
can you have multi-line entries, where the quote pairs stretch over multiple lines?
Example:
'This num is
384 "but this num is
223" and neither is 444'
Depending on your answers it might be easiest to split your problem into two parts.
the first part would be an iterator yielding
a tuple (unquoted_part, quoted_part)
and then you can replace in the part where you'd like to with a classical regexp
This is more manual, but might be better to understand, debug and maintain and will probably not be that much slower.
the escaped quotes if they can exist in your context would complicate the situation a bit.
Next question as I don't know what you are parsing.
can you have multi-line entries, where the quote pairs stretch over multiple lines?
Example:
'This num is
384 "but this num is
223" and neither is 444'
Depending on your answers it might be easiest to split your problem into two parts.
the first part would be an iterator yielding
a tuple (unquoted_part, quoted_part)
and then you can replace in the part where you'd like to with a classical regexp
This is more manual, but might be better to understand, debug and maintain and will probably not be that much slower.
Regular expression is not going to be faster than a loop at all. A lot of Regex will be much slower indeed. They are faster in terms of coding time, making your time of development faster and shorter not the speed of execution faster.
ASKER
Thanks for the input. So, this is actually part of a script to "user-friendly-ize" input for users who DON'T know regex.
SHORT ANSWERS:
oheil: It checks for matching quotes before processing. If unmatched quotes are found, it just halts.
pepr: Escaped chars won't be allowed ( \ should be ignored)
oheil: I may restrict it to tics actually ``. I haven't decided.
LONG ANSWERS:
Actually, (because maybe you have a better solution altogether) I'm writing a function to check the format of user input without needing to write a bunch of matches into the calling script...i.e. ...
The function user can make a call defining the exact format of the input with wildcards. The things in quotes need to match the input EXACTLY. The other chars represent different variations of wildcards (I.e. one for char only, one for num only, one for anything...etc)
format_var = " '/dev/disk/by-'..'/scsi-'* "
inputf(format_var)
meaning the user must input a string ...
/dev/disk/by-<any two chars>/scsi-<any number of chars>
I.e. /dev/disk/by-id/scsi-21332 1321321321 3213213213 213
...is good but...
/dev/dsk/by-id/scsi-21332132132132 1321321321 3213
or
/dev/dIsk/by-id/scsi-213321321321 3213213213 213213
or
./by-id/scsi-2133213213213 2132132132 13213
...will all fail as not matching the format_var string style.
Does this make sense. Is there a module that already does this? I didn't find one :-)
SHORT ANSWERS:
oheil: It checks for matching quotes before processing. If unmatched quotes are found, it just halts.
pepr: Escaped chars won't be allowed ( \ should be ignored)
oheil: I may restrict it to tics actually ``. I haven't decided.
LONG ANSWERS:
Actually, (because maybe you have a better solution altogether) I'm writing a function to check the format of user input without needing to write a bunch of matches into the calling script...i.e. ...
The function user can make a call defining the exact format of the input with wildcards. The things in quotes need to match the input EXACTLY. The other chars represent different variations of wildcards (I.e. one for char only, one for num only, one for anything...etc)
format_var = " '/dev/disk/by-'..'/scsi-'*
inputf(format_var)
meaning the user must input a string ...
/dev/disk/by-<any two chars>/scsi-<any number of chars>
I.e. /dev/disk/by-id/scsi-21332
...is good but...
/dev/dsk/by-id/scsi-21332132132132
or
/dev/dIsk/by-id/scsi-213321321321
or
./by-id/scsi-2133213213213
...will all fail as not matching the format_var string style.
Does this make sense. Is there a module that already does this? I didn't find one :-)
Dosen't fnmatch do it already?
@farzanj: > Regular expression is not going to be faster than a loop at all.
Regular expressions usually are faster. But I agree that loop may be easier in this case.
@rightmirem: If using regular expressions in Python, I suggest to prefer the compiled form almost always.
Regular expressions usually are faster. But I agree that loop may be easier in this case.
@rightmirem: If using regular expressions in Python, I suggest to prefer the compiled form almost always.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Per definition, if the number of single quotes is odd, than the last part is not inside quotes but contains on quote.
You may check with the example string
Oli
You may check with the example string
input = "abc'def'ghi'123'jkl'456"
Oli
Who's choosing the format of format var.
Why don't you use a regexp for it?
Why don't you use a regexp for it?
ASKER
Worked beautifully actually. Thanks!
Like:
abc"def"ghi"jkl
What would be the desired result?
Oli