Solved

Matching a random pattern with one common character

Posted on 2016-11-18
2
21 Views
Last Modified: 2016-11-22
Hi, I have a file with a large number of character conversion errors and all non-ASCII characters were converted to question marks - "?" - so there are a number of instances of such strings as: Jos?, Company?s, ???????ahs-dhdh, The???hdh--dhd?, etc. The length of the string will vary along with the number of questions mark in it

Is there a regular expression(s) I can use in a Perl script that will match any string with x number of characters and at least one question mark or more in it? Thanks
0
Comment
Question by:hadrons
2 Comments
 
LVL 20

Accepted Solution

by:
jmcg earned 500 total points
ID: 41893815
Perhaps this little snippet will get you started.

my @TestStrings = ("NoMatch", "Jos?", "Company?s", "???????ahs-dhdh", "The???hdh--dhd?");
for (@TestStrings) {
        printf "%s: %s\n", $_, ($_ =~ m/\?/ ? "matched" : "not matched");
        }

Open in new window


The part of your question I'm not sure I'm understanding properly is the "x number of characters". Using the above script, you can decide what divides strings into strings, then check each one for whether or not it contains a question mark. The results look like:
NoMatch: not matched
Jos?: matched
Company?s: matched
???????ahs-dhdh: matched
The???hdh--dhd?: matched

Open in new window


Another approach might look something like the following:
my $TestData = "NoMatch Jos? Company?s ???????ahs-dhdh The???hdh--dhd?";
for ($TestData =~ m/([\w\?\-]*\?[\w\?\-]*)/g) {
        printf "%s: %s\n", $_, "matched";
        }

Open in new window

In this case, you're pulling out the strings of interest from a large batch of data. The character class [\w\?\-] can be expanded if there are other characters you want considered part of your strings. In this case, the results leave out that first non-matched string and look like:
Jos?: matched
Company?s: matched
???????ahs-dhdh: matched
The???hdh--dhd?: matched

Open in new window

I don't envy whoever has the task of trying to make sensible back-substitutions for the lost characters.
0
 

Author Comment

by:hadrons
ID: 41897954
Hi, I few days I thought I hit the best solutions button, but it may not have went thru, but the solution worked great; thanks
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now