Solved

C# Concatenated text or tiff files

Posted on 2014-01-07
16
988 Views
Last Modified: 2014-04-24
This is a unique project. We are trying to separate a concatenated tiff or text file. When we read the files it only gives us the first file when there are more there. This is not a multiple tiff file. It is a concatenated tiff file. Is there a special character that I must use so that it goes to the next file. When I look at the file in notepad++, I can see all the binary data starting with  an "II*" for each file. All help is appreciated.
0
Comment
Question by:rkspence
  • 6
  • 5
  • 4
  • +1
16 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39763131
I'm confused:  Where does the text file come into play?
0
 

Author Comment

by:rkspence
ID: 39763175
When you view the tiff file in binary format it acts just like a text file.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39763200
Ah, well that's functionality of the text editor you are viewing the file with, not the file itself.

According to Wiki, the "magic bytes" for tiff files can come in two varieties (textually):  II*. and MM.*. If these files were truly concatenated together at a binary level, then you should be able to loop the bytes until you find these markers. Once you find one, you are at the next file.

When working with binary files (like images), you generally work with the bytes, and you do not treat them as characters. So the corresponding byte representations of the magic numbers I referenced above are (respectively): 49 49 2A 00 and 4D 4D 00 2A.
0
 

Author Comment

by:rkspence
ID: 39763234
Yes you are absolutely correct and this is what we are trying to do. Can you please give us and example in C# how to loop for this configuration because we are only getting the first file as we stream the file and we are not sure how to handle the binary format.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39763244
What is the end result? What are you trying to do with each image?
0
 

Author Comment

by:rkspence
ID: 39763281
These are check images from the bank and we are trying to split each front and back into separate tiff files to import into a document imaging system. We used NotePad++ to see if multiple image files really were there. We were able to delete binary data but not copy and paste it in that program. By deleting all the binary data from the second instance of II back to the beginnign of the file we were able to see the next image so it really is a concatenated TIFF file and not a multi-page one (viewers only show the first image).

We thought the easiest way to break them apart again would be to look for the "II" sequence in the stream. The bank also provided a text data file with start positions and lengths which appear to be byte related. Here is some sample data for the first ten images:

##Start Position## ##Length#
000000000000000000 000011486 Check#1 Front
000000000000011486 000007010 Check#1 Back
000000000000018496 000012370 Check#2 Front
000000000000030866 000014120 Check#2 Back
000000000000044986 000011618 Check#3 Front
000000000000056604 000008887 Check#3 Back
000000000000065491 000011522 Check#4 Front
000000000000077013 000009119 Check#4 Back
000000000000086132 000017028 Check#5 Front
000000000000103160 000016444 Check#5 Back

Sorry I can't post the TIFF file but it contains all 356 live check images and I don't know how to mock one up as a fake sample. It would be neater if the start position and length data could be used in C# to process the TIFF files for each image. That file also contains check#, account#, etc which we would use to rename the files as we process them.

Once they are split into individual TIFF files we would like to combine the front and back image pairs into single true multipage TIFFs; one for each check.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39764198
I've never seen a TIFF file concatenated like that and I have no idea why the bank would do it that way. If it were me, I'd press the bank for a more standard file format. That said, my first question is if the byte string beginning at [Start Position] and running for [Length] bytes is a well-formed TIFF file. If it is, then it should be easy to read the entire file into a string variable and then loop through the string, selecting each substring beginning at [Start Position] and for [Length] bytes, writing each such substring to a file. You don't have to "handle the binary format" in any way, since you know the starting byte number and the number of bytes for each file – just parse the string that way, ignoring any meaning to the binary format. The text data file also seems to provide a file name for each file, e.g., <check_1_front.tif>, <check_1_back.tif>, <check_2_front.tif>, <check_2_back.tif>, etc. I'm not a C# programmer, but could write such a program in a language that I know, and I'm sure C# has enough functionality as a programming language for this to be done. Of course, if each substring from [Start Position] for [Length] bytes is not a well-formed TIFF file, then the whole idea is for naught.

Once you have the individual TIFF files for front and back, there are various ways to combine each front-and-back set into a multi-page (in this case, two-page) TIFF. I'd probably use the "/multitif" option of IrfanView, as shown in this EE article. Regards, Joe
0
 

Author Comment

by:rkspence
ID: 39765387
Unfortunately the bank is not being helpful. Their handbook for using the files appears to have been written for mainframe programmers and actually says they will offer no support for parsing or splitting the data and that customers will need the expertise to do those functions.

We are able to split off the first image file using both VBA and C#. The current problem appears to be that the BinaryReader.ReadBytes is not reading the entire file. Like the TIFF Viewer programs it only sees the first image. We know it is working as we have split off a viewable TIFF file of 15kb out of the full 4Mb file but still can't move to the next image.

I'm guessing it is hitting some type of End of File marker. Is there some way to tell ReadBytes to ignore EOF and read the entire concatenated file?
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39765986
Sorry...don't know ReadBytes. Perhaps a C# expert familiar with it will jump in. In the meantime, if you're willing to try another language, I recommend AutoHotkey (AHK), an excellent (free!) programming/scripting language. There have been several forks of the original language and my preferred one now is AutoHotkey_L. It comes with a Windows installer, as well as a compiler that turns the AHK source code (plain text) into a stand-alone/no-install executable (an EXE file).

If you're willing to give it a try, attached is an AHK program (the source code in a text file) that will read a binary file into a variable and then write the variable out to a binary file. I just ran it on a multi-page TIFF file and it worked perfectly – the output file is identical to the input file (determined by using a binary file comparison program). Change these two lines of code to point to your test files:

filein:="d:\0tempD\multipagein.tif"
fileout:="d:\0tempD\multipageout.tif"

It would be very interesting to know if this AHK program gets your entire check file or if it also stops at an EOF marker. I'd be happy to test it for you, but you said that you can't post the TIFF file. So to test it yourself, all you need to do is install AutoHotkey and it will own the AHK file type – just double-click the attached AHK source code after downloading it and AutoHotkey will run it (no need to compile it into an EXE until and unless you want a stand-alone executable). Since the source code is provided, you may see for yourself that the code is not malicious.

Note that the AHK function that reads the file into a variable (readfiletovar) takes as parameters a starting location and a number of bytes to read. So you could use the bank's text data file to drive the file-reading once we know that the basic concept is sound. But the attached AHK program simply reads (and writes) the whole file just to see if there's an EOF issue. Regards, Joe
read-and-write-checks.ahk
0
 

Author Comment

by:rkspence
ID: 39766387
Joe, Thanks for the suggestion of AutoHotKey. We need to end up with a .exe we can call from a script. I really prefer C# as we are trying to standardize on it. We were able to do a proof of concept in VBA using the Start Position and Byte Length provided in the Bank's text file:

Private Sub Command0_Click()
    Dim FileIn As Long
    Dim FileOut As Long
    Dim StartPos As Long
    Dim Length As Long
    Dim ImageData As String
       
    FileIn = FreeFile()
    FileOut = FreeFile()
   
      'I hard coded in sample data for testing
      StartPos = 2613076
    Length = 10786
      'This next lint was apparently the trick.
      'I created an empty string the exact length needed for the image file.
    ImageData = String(Length, " ")
   
    Open "C:\input.tiff" For Binary Access Read As #FileIn
    'The +1 below was needed to correct start position
      Get #FileIn, StartPos + 1, ImageData
    Close #FileIn
   
    Open "C:\Output.tiff" For Binary Access Write As #FileOut
    Put #FileOut, , ImageData
    Close #FileOut
   
End Sub

This succesfully writes a TIFF file with the single image specified.
Does anyone know what the equivalent of this would be in C#?
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39766435
> We need to end up with a .exe we can call from a script.

That's exactly what the AutoHotkey compiler will create.

> I really prefer C# as we are trying to standardize on it.

Let's hope a C# expert jumps in...that's not me.

> We were able to do a proof of concept in VBA using the Start Position and Byte Length provided in the Bank's text file

Good to hear! I'm guessing you could complete the effort in VB (and I'm pretty sure I could do it in AHK), but if you're committed to doing it in C#, so be it. In any case, I was happy to try to help. Regards, Joe
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 39769783
Here's a quick-and-dirty approach to separating the files:

namespace _28332667
{
    class Program
    {
        static void Main(string[] args)
        {
            using (System.IO.FileStream inputStream = System.IO.File.Open(@"C:\path\to\file.tiff", System.IO.FileMode.Open))
            {
                System.IO.FileStream outputStream = null;
                int fileCount = 0;

                if (inputStream.Length > 0)
                {
                    int datum = inputStream.ReadByte();

                    fileCount++;
                    outputStream = System.IO.File.Create(@"C:\path\to\file" + fileCount.ToString() + ".tiff");

                    while (inputStream.Position != inputStream.Length && datum >= 0)
                    {
                        outputStream.WriteByte((byte)datum);
                        datum = inputStream.ReadByte();

                        if (datum == (int)'I')
                        {
                            int temp = inputStream.ReadByte();

                            if (temp == (int)'I')
                            {
                                temp = inputStream.ReadByte();

                                if (temp == (int)'*')
                                {
                                    temp = inputStream.ReadByte();

                                    if (temp == (int)'\0')
                                    {
                                        outputStream.Close();
                                        fileCount++;
                                        outputStream = System.IO.File.Create(@"C:\path\to\file" + fileCount.ToString() + ".tiff");
                                    }

                                    inputStream.Position--;
                                }

                                inputStream.Position--;
                            }

                            inputStream.Position--;
                        }
                    }
                }

                if (outputStream != null)
                {
                    outputStream.Close();
                }
            }
        }
    }
}

Open in new window


The idea is that you loop through all the bytes of the source file looking for the magic number. If you find it, then you close the current output file and start a new one. The FileStream class doesn't expose a Peek method, so we simulate one by reading ahead when we find magic number characters, and if we don't find the complete magic number, then we move the read position backward however many characters forward we moved.

The above is based on the TIF magic number of 49 49 2A 00. If the format you are provided used the other magic number instead (4D 4D 00 2A), then you simply need to adjust the character values in lines 24, 28, 32, and 36. It would be odd that the bank would mix the two magic numbers, so it should be consistent throughout the file.

P.S.

In case you're not aware, the character "\0" is one character (even though it looks like two). It is the null character, and it has a numerical value of zero.

P.P.S.

The above simply writes out each image to its own file. I am not familiar enough with the multipage TIFF file specification to show how how to create outputs of that type.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39772874
> I am not familiar enough with the multipage TIFF file specification to show how how to create outputs of that type.

This is where the "/multitif" option of IrfanView can help. I mentioned an EE article earlier, but here's just the command showing the IrfanView command line call:

i_view32.exe /multitif=(c:\path\front_back_in_2page_file.tif,c:\path\front.tif,c:\path\back.tif) /killmesoftly /silent /tifc=4

The "/multitif" syntax is such that the first file is the name of the multi-page output (combined/merged) TIFF file and all of the subsequent files are the input files. The "/killmesoftly" and "/silent" params are nice for calling it in a program. The "/tifc" param is for the TIFF file compression. Its values may be:

0=None
1=LZW
2=Packbits
3=ITU-T Group 3
4=ITU-T Group 4
5=Huffman
6=JPG
7=ZIP

I have experimented extensively with them and unless you have a reason for picking something else, I strongly recommend ITU-T (previously known as CCITT) Group 4. Btw, all of the IrfanView command line parameters are documented in the file <i_options.txt> that is created in the IrfanView install directory. I have attached it to this thread for the latest version of IrfanView (4.37). Regards, Joe
IrfanView-4.37-i-options.txt
0
 

Author Comment

by:rkspence
ID: 39777808
¿kaufmed?,

Thanks, the Q&D based on the Magic Numbers appears to work fine. We tried using the start posiiton and length in VBA and got it to work pulling out individual images but could not get it to loop through all the images.

Using the Magic numbers to break them apart does not provide any easy way to rename them with the check number and front/back designation. We would have to read the txt data file separately and rename the checks in order. I think that will work OK but we will need to test and make sure we have the correct number of images we are expecting. The bank recently reminded me that some images may be missing. i think they will still be in the txt file with 0 length but still showing check#, date, and amount.

Is there any option for using Start Position, and length in C# or will we run into the same problem when we try to run it in a loop?
0
 

Expert Comment

by:smithmrk
ID: 40021774
Hey kaufmed,

I have a similar issue...except I want to do it in reverse!
I want to create that type of file vs reading from it.
I work for a Bank and need to create a file just like this!

Please see this question I submitted:
http://www.experts-exchange.com/Programming/Languages/.NET/Visual_Basic.NET/Q_28419769.html

Thanks,
Mark
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 40021860
Hi smithmrk,

I just saw your post. I'm actually off to bed now, but I'll take a look at your question tomorrow.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

A list of useful business intelligence software.
Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
Viewers will learn how to use the Hootsuite Dashboard.
The viewer will learn how to successfully download and install the SARDU utility on Windows 8, without downloading adware.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now