Solved

Regular Expression help needed to find and replace specific numbers in a text file.

Posted on 2016-07-18
21
59 Views
Last Modified: 2016-07-20
I have a very large text file, hundreds of thousands of rows, that repeats the same 10 types rows for each new record in the file, where each row contains different information about a given record.  The file is a combination of nearly 200 other text files that I combined to prevent our employees from having to deal with hundreds of files.  Combining the file means I have to renumber specific lines in the file to be sequential.  Each row I need to find begins with HL.

Here's an example of a few rows I have to find and renumber:
HL*65*2*22*0~
HL*104*2*22*0~
HL*8*22*2*0~
HL*1052*2*0~

I have to replace the numeric value found after the first *, so 65, 104, 8 and 1052 in the examples shown above.  The rest of the strings has to be left alone.

Does anyone know how to do this?

Note - I'll be doing this find and replace in a C#.NET console application.
0
Comment
Question by:fcsIT
  • 9
  • 8
  • 3
  • +1
21 Comments
 
LVL 32

Expert Comment

by:it_saige
ID: 41717683
These are EDI markers used to signify the Hierarchial Level, which are normally file specific and not unique (in other words, multiple files will use the same HL's).  These are used by the EDI specification to mark where in the EDI hierarchy the proceeding information can be found.  As such I, personally, would first ensure that the EDI specification you are using has a file/loop separation marker.  If you find that it does, I would employ that marker when joining the files.

-saige-
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41717689
How many values will be replace in the file at any given time? Is the lines / rows in the file in any order or are they unordered?
0
 

Author Comment

by:fcsIT
ID: 41717694
You are correct, these are Hierarchical Level markers for EDI files, specifically HIPAA 270 files, however there's not a separation marker available in the 270 spec that I can find.
0
 

Author Comment

by:fcsIT
ID: 41717699
The rows are in a very specific order and can't be modified or the file will be rejected by the state EDI system.  The number of rows that will be changed will vary slightly each time this is ran, but it's roughly 10% of the total lines in the file.

The file I'm working with right now has almost 425,000 lines, so roughly 42,500 will be modified by the needed regular expression.
0
 
LVL 1

Expert Comment

by:MotKohn
ID: 41718149
^HL\*(\d+).*$

Open in new window

in multiline mode will get the number into group 1 then replace with different number.
0
 

Author Comment

by:fcsIT
ID: 41719154
MotKohn, thank you for helping!  Unfortunately, this regex didn't do what I needed it to.

It found all of the HL lines, but instead of removing the line counter portion of them, and writing a new sequential value to them, it actually deleted all of the HL lines from the file.

Here's my code.  It's using Multiline mode.

File.WriteAllText(
file,
Regex.Replace(File.ReadAllText(file), @"^HL\*(\d+).*$", "", RegexOptions.Multiline)
);

Open in new window

0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719186
Hi fcsIT;

I do not believe that using Regular Expressions will do what you want it to do. Regular Expressions will find parts of strings that match a particular pattern. Regular Expressions will also find a pattern and replace that pattern with another value but it will not replace each one of the patterns with a different value.

Do you have before hand a list of the rows that need to be updated and the value it needs to update it to?
0
 

Author Comment

by:fcsIT
ID: 41719195
I don't have a specific list of just the HL lines, no.

These are EDI files that are dumped out of a system we have into a very specific file format used by the government.  The HL lines are just one type of line in the file, along with a couple dozen other line types.  As for the value they need to be updated to, all that matters there is that they're sequential from the beginning of the file to the end, so HL*1, HL*2, HL*3, HL*4 etc.  It doesn't matter what value an HL line was when it comes to changing it to a different value, they just have to be in order.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719230
In your file are all the HL*?? line in sequential order or are they stored in random order?
0
 

Author Comment

by:fcsIT
ID: 41719235
Well, when they're dumped out of the system, they're sequential, but since I've combined 190 files dumped from the system, they're no longer sequential (hence my problem).

The numbering now resets 190 times in my file, and ends at different numbers each time, based on how big each of the original 190 files was.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719264
Well from your last post the only way to add a new row with the next sequence number is to first order the HL*?? rows in the file then go to the last row in the ordered list and increment that HL*?? and assign it to the new line and store it at the end of the file.
0
 

Author Comment

by:fcsIT
ID: 41719295
Sorry, I'm not explaining the need very well I don't think.

Here's a sample of the layout of the file:

ISA*00*
GS*HS*
ST*270*
BHT*0022*
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*6*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~

And on and on.  (I abbreviated all of the lines except the HL ones.)

I don't need something that looks at the current HL*1, HL*2, etc values.  I need something that overwrites them, regardless of what they are, beginning with HL*3.  (The HL*1 and HL*2 are special lines that cannot be changed.)  I also don't need new HL lines.  What's needed is to overwrite the counter in the existing HL lines with a sequential number beginning with 3.

These are HIPAA EDI file formats.  There's nothing simple about them.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719312
Please explain as in an algorithm the steps you need to accomplish to achieve your goals. For example,

1. Find All HL*4XXXXX
2. Replace All 4 digit 4 in HL*4XXXXX to 5 so that all HL*4XXXXX now look like HL*5XXXXX
3. Save the file.
0
 

Author Comment

by:fcsIT
ID: 41719319
1. Find all HL* lines beginning with HL*3.
2. Replace the value found between the first and second asterisks (*) in each HL line with a new auto-incrementing value beginning with 3.
3. Save the file.


Note: There are no leading zeros for these values, so it will range from a single digit number up to four digit numbers.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719340
So in step 2 the auto-incrementing number starts with 3 and increments by 1 each time until you have no more rows to re-number. Correct. So from your example it will end up looking as follows.
ISA*00*
GS*HS*
ST*270*
BHT*0022*
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*6*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*7*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*8*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*9*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~

Open in new window

0
 

Author Comment

by:fcsIT
ID: 41719351
You are correct.
0
 
LVL 1

Expert Comment

by:MotKohn
ID: 41719661
warning i did not debug this but just to give an idea:
        int x = 1;
        public void doReplace()
        {
            string txt = File.ReadAllText(file);
            txt = Regex.Replace(txt, @"^HL\*(\d+).*$", new MatchEvaluator(this.m), RegexOptions.Multiline);
            File.WriteAllText(file, txt);
        }
        private string m(Match match)
        {
            string s = match.Value;
            s = s.Remove(match.Groups[1].Index, match.Groups[1].Length);
            s = s.Insert(match.Groups[1].Index, x++.ToString());
            return s;
        }

Open in new window

0
 
LVL 62

Accepted Solution

by:
Fernando Soto earned 500 total points
ID: 41719740
Hi fcsIT;

The following code should do what you need.
private List<StringBuilder> HIPAA = new List<StringBuilder>();
private List<StringBuilder> HL_rows;

// Load the lines from your file into the List<StringBuilder> HIPAA each line as a StringBuilder object.
File.ReadLines( "C:/Working Directory/HIPAA-EDI.txt" ).ToList().ForEach(r => HIPAA.Add(new StringBuilder().Append(r)));
// Find all the lines that need to be modified and load them into HL_rows
HL_rows = HIPAA.Where( r =>
            r.ToString().StartsWith( "HL*" ) &&
            ( r.ToString( ).Substring( 2, 3 ) != "*1*" && r.ToString( ).Substring( 2, 3 ) != "*2*" ) 
        ).ToList( );

// keeps track of the next sequence number to use.
var seqNo = 3;
// Resequence all the found lines
for ( int row = 0; row < HL_rows.Count; row++ ) {
    int secondAsterisk = HL_rows[row].ToString().IndexOf( "*", 3 );
    HL_rows[row].Remove( 3, secondAsterisk - 3 ).Insert( 3, seqNo.ToString( ) );
    seqNo++;
}

// Open a StreamWriter to write all the lines back to the file
StreamWriter sw = new StreamWriter("C:/Working Directory/HIPAA-EDI-New.txt");
// Write the lines back to the file system
HIPAA.ForEach( r => sw.WriteLine( r.ToString( ) ) );

// File clean up
sw.Flush( );
sw.Close( );

Open in new window

0
 

Author Closing Comment

by:fcsIT
ID: 41719772
You NAILED it!  I'd buy you lunch if you were here.  Thank you so much!
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41719778
Not a problem fcsIT, glad I was able to help. Have a great day.
0
 
LVL 1

Expert Comment

by:MotKohn
ID: 41721692
This is a simpler version if anybody is interested:
txt = Regex.Replace(txt, @"^(HL\*)(\d+)(.*)$", new MatchEvaluator(match => match.Groups[1].Value + x++ + match.Groups[3].Value), RegexOptions.Multiline);

Open in new window

0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now