perl: REGEX help

Hi I need to test meta data keys for valid
These Formats are valid

geo.0123.lowercaseword
AB.lowercaseword.lowercaseword
AB.lowercaseword.123
ABC.lowercaseword123
ABC.lowercaseword.lowercaseword
exception(s) or special cases where above rules don't apply

not valid
ab.UPPERcaseword123
AD.not132valid


so unless geo or exception
2 or 3 uppercase letters DOT lowercaseword DOT or numbers



for my $metaKey (sort keys $metaHash){
  next if($metaKey =~ m/(exceptions|list/);
  if($metaKey =~ m/^([A-Z]{2,3}\.[a-z0-9|\.]|geo\.\d.*[a-z])$/{
      print "$metaKey good\n";

  }
  else{
     print "$metaKey bad\n";

  }
}

Open in new window


I suspect I'll need to do this in 2 goes as the exceptions list might be quite long but if the $metaKey is not in this list or it's the wrong format it's bad

I'm assuming a hash key is case sensitive
so "HashKey" is not the same as "hashkey" and 1 doesn't overwrite the other?
LVL 1
trevor1940Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wilcoxonCommented:
I think this will do what you want.  I changed the geo portion slightly to do better checking based on the example - if other geo cases are valid, you can change it back - the only critical change to that portion was adding + after [a-z].  Most of your problem was due to anchoring the regex to the end of the string ($) but not giving full string matching in the regex (I prefer full validation so changed the match rather than remove the end-of-string anchor).
for my $metaKey (sort keys $metaHash){
  next if($metaKey =~ m/(exceptions|list/);
  if ($metaKey =~ m{^(?:[A-Z]{2,3}\.[a-z]+(?:[0-9]+|\.[a-z]+)|geo\.\d+\.[a-z]+)$}) {
      print "$metaKey good\n";
  }
  else{
     print "$metaKey bad\n";
  }
}

Open in new window

ozoCommented:
print "$_ ".(/geo|exception|^[A-Z]{2,3}\.[a-z]+[.\d]+$/?"good":"bad")."\n" for sort keys %metaHash;
Rgonzo1971Commented:
Hi,

pls try
^([A-Z]{2,3}\.[a-z]+\.?(\d+|[a-z]+)|geo\.\d.*[a-z])$

Open in new window

REgards
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

Rgonzo1971Commented:
maybe if exception
^([A-Z]{2,3}\.[a-z]+\.?(\d+|[a-z]+)|geo\.\d.*[a-z]|exception.*)$

Open in new window

ozoCommented:
print "$_ ".(/geo|exception|^[A-Z]{2,3}\.[a-z]+(\d+|\.(\d+|[a-z]+))$/?"good":"bad")."\n" for sort keys %metaHash;
trevor1940Author Commented:
@wilcoxon

I've now found an valid example where

geo.lowercaseword

will this capture that?

if ($metaKey =~ m{^(?:[A-Z]{2,3}\.[a-z]+(?:[0-9]+|\.[a-z]+)|geo\.(?:\d+\.+)*[a-z]+)$})

Open in new window


Has I suspected the list  of exceptions is growing so having it all in 1 REGEX will work but it makes it un readable
ozoCommented:
geo.lowercaseword  is already covered by "so unless geo or exception"
But if ABC.lowercaseword is valid, that could simplify it to /geo|exception|^[A-Z]{2,3}\.[a-z]+\.?(\d+|[a-z]+)$/
wilcoxonCommented:
If geo.lowercaseword is valid, I would then amend the line to be:
if ($metaKey =~ m{^(?:[A-Z]{2,3}\.[a-z]+(?:[0-9]+|\.[a-z]+)|geo(?:\.\d+\.+)?\.[a-z]+)$})

Open in new window


Your change would work but would also allow things like geo.numbers.numbers.numbers.lowercaseword

Alternatively, as ozo points out, you could simply remove most validation from geo:
if ($metaKey =~ m{^[A-Z]{2,3}\.[a-z]+(?:[0-9]+|\.[a-z]+)$|^geo\.})

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ozoCommented:
Is there anything containing "geo" or "exception" that is not valid?
trevor1940Author Commented:
Is there anything containing "geo" or "exception" that is not valid?
"exception"  is going to  be an hard coded list of valid metadata that doesn't fit the format in the opening remarks most are single words so to answer your question no!
Hence the reason to treat these separately and simple skip over using next

in theory every thing starting with lower case   "geo" should also be valid as long as every thing after the first dot is either [a-z] or [0-9] or \.

So for simplicity I may  end up treating these separately
ozoCommented:
in theory every thing starting with lower case   "geo" should also be valid as long as every thing after the first dot is either [a-z] or [0-9] or \.
That's a little different from how I originally interpreted the problem.
Is "every thing after the first dot is either [a-z] or [0-9] or \." meant to be the same condition as "DOT lowercaseword DOT or numbers"
or are there things that are valid following ^geo but not after ^[A-Z]{2,3}?
If the latter, then you might use
/exception|^(geo[^.]*(\.[a-z\d.]*)?|[A-Z]{2,3}\.[a-z]+\.?(\d+|[a-z]+))$/
Unless "every thing after the first dot is either [a-z] or [0-9] or \." is meant to imply that it is required that a first dot is present in which case
/exception|^(geo[^.]*\.[a-z\d.]*|[A-Z]{2,3}\.[a-z]+\.?(\d+|[a-z]+))$/,
could suffice
trevor1940Author Commented:
yes unless in the exceptions list the first DOT is mandatory

and unless geo what will precede   the first DOT should be upper case AB, KLM or XYZ

Some valid examples of the geo metadata are

geo.datum                =   Transvers mercator
geo.projection          =   WGS 1984
geo.0001.position    =   Lat Long
geo.0001.shape        =   point

Some valid examples of other metadata are

AB.docid
KLM.pubdate
XYZ.title
XYZ.lcword.lcword.1234


the problem I have is that  some have both

AB.docid and ab.docid or ab.docID etc.


So I need to locate the second 2 and remove them

Hopping this makes sence
ozoCommented:
A few clarifications:

Is "geo" in the exceptions list?

Although you only stated restrictions on what follows the first dot, and did not state any restrictions on what can be between "geo" and the first dot, in your valid examples there is nothing between "geo" and the first dot. Should that be a requirement?

You say
geo.datum                =   Transvers mercator
is valid, but "                =   Transvers mercator" is after the first dot, and is not either [a-z] or [0-9] or \.
Should "                =   .*" be added to the list of every thing after the first dot?

When you have
AB.docid and ab.docid or ab.docID etc.
and you need to locate the second 2 and remove them, should the result be
AB.docid and  or  etc.
Or do you mean that whenever the beginning of a key looks valid except for some extra stuff following, everything after the valid first part should be removed?

Do you want to remove invalid keys and leave valid keys instead of just printing "good" after valid keys and "bad" after invalid keys?

If you change an invalid key into a valid key by removing invalid parts, what should happen if that key then collides with another valid key?  Should the data associated with both keys be combined somehow?
trevor1940Author Commented:
Wow

"geo" is not in the list of exceptions

There is nothing between "geo" and the first DOT the same with the other valid examples so only AB | KLM |XYZ | geo  should be before the first DOT

"geo.datum                =   Transvers mercator"  was an example of a valid key value pair I apologize for the confusion

I am still trying to establish what to do with the non valid metadata

The %metaHash is generated from an SQL query my hunch is once I've worked out what is 'good' and what isn't valid 'bad' I will run update script to remove the 'bad' from the hash and the database

I am also still trying to establish what to do when %metaHash keys clash   this is being discussed by people above my pay scale

Combining the key values and updating the database is 1 option my recommendation is going to be if the key isn't valid the probability is the value will also be invalid however this may not work with every metadata key value pair

For now I'm just trying to establish the good from the bad

I'm hopping for a list of 'bad' meta keys I can then do a visual check of the values to see if these are right or wrong

I can then report back up the pay scale
wilcoxonCommented:
Based on the last few comments, this should get you closer:
$metaKey =~ m{^[A-Z]{2,3}(?:\.[a-z])+(?:\.?\d+)?$|^geo(?:\.\d+)?\.[a-z]+(?:$|\s)}

Open in new window


You may want to remove the last (?:$|\s) - it was an attempt to validate geo keys but exclude the \s+=.*
trevor1940Author Commented:
Hi
Solved it found 138 non valid mata tags

I was able to simplify the regex like so

if ($metaKey =~ m{^([A-Z]{2,3}|geo)\.([a-z]+|[0-9]|\.)$.})

Not sure who wins so shared between you
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.