Link to home
Start Free TrialLog in
Avatar of doc_jay
doc_jay

asked on

format file with sed

Hey All - just need some help with sed please?

I've almost got my output file where I would like it to be.  Here is what I am starting with:

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex
E: DcmElement: Unknown Tag & Data (3028,3130) larger (808463408) than remaining bytes in file
E: dcmdump: I/O suspension or premature end of stream: reading file: d:\import\output.txt

Open in new window


...notice that this just repeats, I am only wanting one copy of each unique line.  This could end up repeated more than what is shown depending on what I run my script against to generate this text.

This is what I would like for it to end up displaying as (but with two 'tabs' after the field description so that everything is spaced correctly):

StudyDate		20130305
AccessionNumber	1074110
PatientName	TESTERMAN^TESTERMAN^^^
PatientID		10026487
PatientBirthDate	18750101
PatientSex		M

Open in new window


...& here is the code I am currently using:

sort output.txt | uniq | sed /E:/d | sed -e "/\[/s/.*\[\(.*\)\]/\1/"
e "s/^\(.*(no value available)\)$//" > test.txt

Open in new window


this is what I am currently getting back when I run the above code:

20130305                               #   8, 1 StudyDate
1074110                                #   8, 1 AccessionNumber
TESTERMAN^TESTERMAN^^^                       #  16, 1 PatientName
10026487                               #   8, 1 PatientID
18750101                               #   8, 1 PatientBirthDate
F                                      #   2, 1 PatientSex

Open in new window


I am currently running the above code in windows with gnuwin32 (coreutils for windows).  I could also run this in cygwin, but it will have to be executed on a PC that quite a few people will be doing the same task on.
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Seems like you want to replace a pattern with one or more spaces by 2 tab characters :)

Like adding this at the end:

| sed 's/[ ][ ]*/\t\t/'

Open in new window

Avatar of doc_jay
doc_jay

ASKER

thanks, but the above code only put two tabs 'in front' of everything to the left.  I would like for it all to be indented to the left side.  Also, I am looking for the right side of the text file to be swapped with the left side so that it ends up looking like:

StudyDate		20130305
AccessionNumber	1074110
PatientName	TESTERMAN^TESTERMAN^^^
PatientID		10026487
PatientBirthDate	18750101
PatientSex		M

Open in new window

I see, try this:
sort -u output.txt  | grep -v "^[ ]*$" | grep -v "^E:" | sed 's/\[//g;s/\]//g' | awk '{ if (length($7) > 15) print $7 "\t" $1; else print $7 "\t\t" $1 }'

Open in new window


Or do you want a sed only version?
Avatar of doc_jay

ASKER

I'm running this in a windows shell right now and it comes back with:

Input file specified two times.

awk: '{
awk: ^ invalid char ''' in expression

Do you mind posting a sed only version?
Can you try this first:

awk: "{ ... }"

(so a double quote instead of a single quote)
Avatar of doc_jay

ASKER

That didn't seem to work either.  Here is the output:

D:\import>sort -u output.txt  | grep -v "^[ ]*$" | grep -v "^E:" | sed 's/\[//g;
s/\]//g' | awk "{ if (length($7) > 15) print $7 "\t" $1; else print $7 "\t\t" $1
 }"
Input file specified two times.

awk: { if (length($7) > 15) print $7 \t $1; else print $7 \t\t $1 }
awk:                                 ^ backslash not last character on line

Open in new window

I don't have cygwin atm, so can really test that. I've got an sed suggestion for you:

sort -u output.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/\(.* \).* .* .* .* .* \(.*\)/\2\t\1/"

Open in new window

Output:

StudyDate	(0008,0020) 
AccessionNumber	(0008,0050) 
PatientName	(0010,0010) 
PatientID	(0010,0020) 
PatientBirthDate	(0010,0030) 
PatientSex	(0010,0040) 

Open in new window



The first sed is removing unwanted lines, the 2nd one is replacing multiple spaces with one space and the third one is printing the last and the first field with a tab as a separator. Getting a conditional one or 2 tabs by looking at the length of the first field is not really possible in sed. If you can use expand for example, you can align like this:

sort -u output.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/\(.* \).* .* .* .* .* \(.*\)/\2\t\1/" | expand -t 20

Open in new window


Output:

StudyDate           (0008,0020) 
AccessionNumber     (0008,0050) 
PatientName         (0010,0010) 
PatientID           (0010,0020) 
PatientBirthDate    (0010,0030) 
PatientSex          (0010,0040) 

Open in new window

Avatar of doc_jay

ASKER

thanks -

  I tried your first 'sed' suggestion from your last post and it is removing the wrong info and leaving info that I would like stripped away.

I would like this to left in the output file:

StudyDate		20130305
AccessionNumber	1074110
PatientName	TESTERMAN^TESTERMAN^^^
PatientID		10026487
PatientBirthDate	18750101
PatientSex		M

Open in new window


instead I am left with:  

StudyDate           (0008,0020) 
AccessionNumber     (0008,0050) 
PatientName         (0010,0010) 
PatientID           (0010,0020) 
PatientBirthDate    (0010,0030) 
PatientSex          (0010,0040) 

Open in new window


--also I needed to remove the '-u' option from sort for it to work in cygwin.

As for your last example to try, I can't use the last command 'expand' with the '-t' option.  Is 'expand' a linux tool that I can get?
Ah, got the wrong field :)

About expand - it is a standard Linux tool: you should be able to add it to cygwin using the setup.exe of cygwin itself.

This is getting you the correct field:
sort -u sample.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/.* .* \(.* \).* .* .* \(.*\)/\2\t\1/" | expand -t 20

Open in new window


sort -u -> this is a unique sort, you can replace by:

sort | uniq
Avatar of doc_jay

ASKER

thanks - its almost there, here is the output:

StudyDate
          20130305 
AccessionNumber
    1074110 
PatientName
        HYDE^JECKYL^^^ 
PatientID
          10026XXX 
PatientBirthDate
   19011231 
PatientSex
         M 

Open in new window


except in notepad (for windows) it is all displayed on one line.
It seems the seconde pattern matched includes the newline, can you try this:
sort -u sample.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/.* .* \(.* \).* .* .* \(.*\)$/\2\t\1/" | expand -t 20

Open in new window

Avatar of doc_jay

ASKER

no luck, now there is a new line after each word or 'entry':

StudyDate
          20130305 
AccessionNumber
    10741103 
PatientName
        TESTERMAN^TESTERMAN^^^ 
PatientID
          100263487 
PatientBirthDate
   18750101 
PatientSex
         M

Open in new window

Pls post (attach file) your output.txt - I'll check later with cygwin

<edit>

I checked with cygwin, output looks OK to me:

$ head output.txt
(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate
(0008,0020) DA [20130305]                               #   8, 1 StudyDate
(0008,0050) SH [1074110]                                #   8, 1 AccessionNumber
(0010,0040) CS [M]                                      #   2, 1 PatientSex

(0010,0010) PN [TESTERMAN^TESTERMAN^^^]                       #  16, 1 PatientName
(0010,0020) LO [10026487]                               #   8, 1 PatientID
(0010,0030) DA [18750101]                               #   8, 1 PatientBirthDate

user@host ~
$ sort -u output.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/.* .* \(.* \).* .* .* \(.*\)/\2\t\1/" | expand -t 20
StudyDate           20130305
AccessionNumber     1074110
PatientName         TESTERMAN^TESTERMAN^^^
PatientID           10026487
PatientBirthDate    18750101
PatientSex          M

user@host ~

Open in new window

So how are you running this in cygwin and how to you get the output into notepad?
Avatar of doc_jay

ASKER

@gerwinjansen -

  sorry - it looked like I missed this sed
 sed "s/  */ /g"

Open in new window


--after the 'expand' command I am doing
 > output_test.txt

Open in new window


This is a SS of how I am viewing the final file with Notepad++

User generated image
Can you answer my 2 questions from above? I checked in cygwin an my output file has the 2 fields on the same line. Note that your screenshot shows 2 tabs on the 'next' line where my sed command inserts just one.
Avatar of doc_jay

ASKER

This is what I am running in cygwin:

$ sort -u output.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/  */ /g" | sed "s/.* .* \(.* \).* .* .* \(.*\)$/\2\t\1/" | expand -t 20 > test_output.txt

Open in new window


I get the output into notepad++ by using ">test_output.txt"

I have also attached the test_output.txt file that is generated.
test-output.txt
Thanks, please post your 'input' file as well, you called it 'output.txt', this is what I asked for as well.
Avatar of doc_jay

ASKER

Here it is for you.  its called output.txt becuase this info is being generated from another source.

thanks
output.txt
ASKER CERTIFIED SOLUTION
Avatar of Gerwin Jansen
Gerwin Jansen
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of doc_jay

ASKER

Great!  I'll give this a shot 1st thing in the morning and hopefully I can get the whole process to flow.  End result is to get this emailed to myself and a co-worker when a remote site rips a CD with dicom to send to us.
Avatar of doc_jay

ASKER

So, this works great in cygwin.  I made a script to do this and I can run it and it creates the  new 'output' file within cygwin.  My ultimate goal is to run this from a command prompt in windows, which shouldn't be a problem if I just call bash from a command line to run the unix script.  

Here is where I am hitting a wall, when I run my .bat file:

echo
SET dcmtk=d:\apps\dcmtk\bin
%dcmtk%\dcmdump d:\import +sd +r -s +P  "0010,0010" +P "0010,0020" +P "0010,0030" +P "0008,0020" +P "0008,0050" +P "0010,0040" > d:\import\output.txt
c:\cygwin\bin\bash dcmdump_out_format_script.sh

Open in new window


my 'dcmdump_out_format_script.sh'

#!/bin/bash
sort -u /cygdrive/d/import/output.txt  | sed "/^[ ]*$/d;/^E:/d;s/\[//g;s/\]//g" | sed "s/  */ /g" | sed "s/  */ /g" | sed "s/.* .* \(.* \).* .* .* \(.*\)$/\2\t\1/;s/$/\r/" | expand -t 20 > /cygdrive/d/import/test_output.txt

Open in new window


it errors out with:  Invalid switch

and it also displays in my 'test_output.txt' file below

Microsoft (R) File Expansion Utility  Version 5.1.2600.0
Copyright (C) Microsoft Corp 1990-1999.  All rights reserved.

Unrecognized switch -t.

Open in new window


I don't know how to tell  it to run the 'expand' tool from cygwin instead of the microsoft tool!

any ideas?

thanks for all of your help by the way!!
Ah, it should be a path issue, the Windows path containing 'expand' is in front of the cygwin path that has expand.  You could add the full path to your cygwin expand or try copying the cywin expand to expand1 for example and replace in the sed line above. Let me know if you need help with that.

<edit>

I checked, copying cygwin\bin\expand.exe to expand1.exe works the way I intended. You can try and verify.
Avatar of doc_jay

ASKER

Thanks for your help on this - I got this running through a command prompt in windows with some different code.  I kept on getting a lot of 'invalid switch' when I would use bash from windows command prompt, even though it ran fine within a cygwin console.

I ended up following the suggestion here.

points on the way & thank you again!
Avatar of doc_jay

ASKER

excellent work!
Thanks ;)