Link to home
Start Free TrialLog in
Avatar of fcsIT
fcsITFlag for United States of America

asked on

Regular Expression help needed to find and replace specific numbers in a text file.

I have a very large text file, hundreds of thousands of rows, that repeats the same 10 types rows for each new record in the file, where each row contains different information about a given record.  The file is a combination of nearly 200 other text files that I combined to prevent our employees from having to deal with hundreds of files.  Combining the file means I have to renumber specific lines in the file to be sequential.  Each row I need to find begins with HL.

Here's an example of a few rows I have to find and renumber:
HL*65*2*22*0~
HL*104*2*22*0~
HL*8*22*2*0~
HL*1052*2*0~

I have to replace the numeric value found after the first *, so 65, 104, 8 and 1052 in the examples shown above.  The rest of the strings has to be left alone.

Does anyone know how to do this?

Note - I'll be doing this find and replace in a C#.NET console application.
Avatar of it_saige
it_saige
Flag of United States of America image

These are EDI markers used to signify the Hierarchial Level, which are normally file specific and not unique (in other words, multiple files will use the same HL's).  These are used by the EDI specification to mark where in the EDI hierarchy the proceeding information can be found.  As such I, personally, would first ensure that the EDI specification you are using has a file/loop separation marker.  If you find that it does, I would employ that marker when joining the files.

-saige-
How many values will be replace in the file at any given time? Is the lines / rows in the file in any order or are they unordered?
Avatar of fcsIT

ASKER

You are correct, these are Hierarchical Level markers for EDI files, specifically HIPAA 270 files, however there's not a separation marker available in the 270 spec that I can find.
Avatar of fcsIT

ASKER

The rows are in a very specific order and can't be modified or the file will be rejected by the state EDI system.  The number of rows that will be changed will vary slightly each time this is ran, but it's roughly 10% of the total lines in the file.

The file I'm working with right now has almost 425,000 lines, so roughly 42,500 will be modified by the needed regular expression.
Avatar of MotKohn
MotKohn

^HL\*(\d+).*$

Open in new window

in multiline mode will get the number into group 1 then replace with different number.
Avatar of fcsIT

ASKER

MotKohn, thank you for helping!  Unfortunately, this regex didn't do what I needed it to.

It found all of the HL lines, but instead of removing the line counter portion of them, and writing a new sequential value to them, it actually deleted all of the HL lines from the file.

Here's my code.  It's using Multiline mode.

File.WriteAllText(
file,
Regex.Replace(File.ReadAllText(file), @"^HL\*(\d+).*$", "", RegexOptions.Multiline)
);

Open in new window

Hi fcsIT;

I do not believe that using Regular Expressions will do what you want it to do. Regular Expressions will find parts of strings that match a particular pattern. Regular Expressions will also find a pattern and replace that pattern with another value but it will not replace each one of the patterns with a different value.

Do you have before hand a list of the rows that need to be updated and the value it needs to update it to?
Avatar of fcsIT

ASKER

I don't have a specific list of just the HL lines, no.

These are EDI files that are dumped out of a system we have into a very specific file format used by the government.  The HL lines are just one type of line in the file, along with a couple dozen other line types.  As for the value they need to be updated to, all that matters there is that they're sequential from the beginning of the file to the end, so HL*1, HL*2, HL*3, HL*4 etc.  It doesn't matter what value an HL line was when it comes to changing it to a different value, they just have to be in order.
In your file are all the HL*?? line in sequential order or are they stored in random order?
Avatar of fcsIT

ASKER

Well, when they're dumped out of the system, they're sequential, but since I've combined 190 files dumped from the system, they're no longer sequential (hence my problem).

The numbering now resets 190 times in my file, and ends at different numbers each time, based on how big each of the original 190 files was.
Well from your last post the only way to add a new row with the next sequence number is to first order the HL*?? rows in the file then go to the last row in the ordered list and increment that HL*?? and assign it to the new line and store it at the end of the file.
Avatar of fcsIT

ASKER

Sorry, I'm not explaining the need very well I don't think.

Here's a sample of the layout of the file:

ISA*00*
GS*HS*
ST*270*
BHT*0022*
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*6*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~

And on and on.  (I abbreviated all of the lines except the HL ones.)

I don't need something that looks at the current HL*1, HL*2, etc values.  I need something that overwrites them, regardless of what they are, beginning with HL*3.  (The HL*1 and HL*2 are special lines that cannot be changed.)  I also don't need new HL lines.  What's needed is to overwrite the counter in the existing HL lines with a sequential number beginning with 3.

These are HIPAA EDI file formats.  There's nothing simple about them.
Please explain as in an algorithm the steps you need to accomplish to achieve your goals. For example,

1. Find All HL*4XXXXX
2. Replace All 4 digit 4 in HL*4XXXXX to 5 so that all HL*4XXXXX now look like HL*5XXXXX
3. Save the file.
Avatar of fcsIT

ASKER

1. Find all HL* lines beginning with HL*3.
2. Replace the value found between the first and second asterisks (*) in each HL line with a new auto-incrementing value beginning with 3.
3. Save the file.


Note: There are no leading zeros for these values, so it will range from a single digit number up to four digit numbers.
So in step 2 the auto-incrementing number starts with 3 and increments by 1 each time until you have no more rows to re-number. Correct. So from your example it will end up looking as follows.
ISA*00*
GS*HS*
ST*270*
BHT*0022*
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*3*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*4*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*5*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*1**20*1~
NM1*PR2*
HL*2*1*21*1~
MN1*1P*2*
HL*6*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*7*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*8*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~
HL*9*2*22*0~
TRN*1*
MN1*IL*
REF*SY*
N3*
N4*
DMG*D8*
DTP*291
EQ*30~
III*ZZ*53~

Open in new window

Avatar of fcsIT

ASKER

You are correct.
warning i did not debug this but just to give an idea:
        int x = 1;
        public void doReplace()
        {
            string txt = File.ReadAllText(file);
            txt = Regex.Replace(txt, @"^HL\*(\d+).*$", new MatchEvaluator(this.m), RegexOptions.Multiline);
            File.WriteAllText(file, txt);
        }
        private string m(Match match)
        {
            string s = match.Value;
            s = s.Remove(match.Groups[1].Index, match.Groups[1].Length);
            s = s.Insert(match.Groups[1].Index, x++.ToString());
            return s;
        }

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fcsIT

ASKER

You NAILED it!  I'd buy you lunch if you were here.  Thank you so much!
Not a problem fcsIT, glad I was able to help. Have a great day.
This is a simpler version if anybody is interested:
txt = Regex.Replace(txt, @"^(HL\*)(\d+)(.*)$", new MatchEvaluator(match => match.Groups[1].Value + x++ + match.Groups[3].Value), RegexOptions.Multiline);

Open in new window