Link to home
Start Free TrialLog in
Avatar of rschaaf
rschaaf

asked on

How to read binary file data directly into a structure in C#?


My C# apps needs to process many terrabytes of binary sensor data, stored in binary files as consecutive structures of various sizes.

I need to read the binary sensor data into structures (or classes if need be) but the only way I can see to do this is to 1) read one value at a time, or 2) read data into a byte array and then marshal the data over to the structure.  I can't find a way to read data directly into the structure.  I've tried unmanaged code and even calling "C" code in a DLL, but I still can't find a way to read data directly into a structure.

I know that I'm trying to violate one of the managed code design goals of C#, but this is a small part of a large program where performance is crucial.

I have a structure defined as follows:

    [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
    public struct SENSORDATA1 {
         public byte FileFormat;
         byte SystemType;
         [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
         float DataBias;
         [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
         ...
    } // This struct is actually 1024 bytes.  Others have various sizes.


THIS CODE EXAMPLE WORKS BUT IS SLOW
-----------------------------------
    class FileReader {
         IntPtr handle;
         // ... functions to open, close, etc. go here
         public int Read(IntPtr buffer, int count) {
              int n = 0;
              if (Win32.Kernel.ReadFile(handle, buffer, count, ref n, 0) == 0) {
                   return 0;
              }
              return n;
         }
    }
    public bool ReadSensorDataSlow() {
         // FileReader fr = new FileReader(FileName) calls omitted from this example
         IntPtr BinBlock = Marshal.AllocHGlobal(1024);
         int amt = fr.Read(BinBlock, 1024); // this fn reads binary data into BinBlock
         // copy binary data into the structure - slow!
         SENSORDATA1 SensorData1 = (SENSORDATA1) Marshal.PtrToStructure(BinBlock, typeof(SENSORDATA1));
         return true;
    }

THIS FASTER CODE EXAMPLE SEEMS IMPOSSIBLE IN C#
-----------------------------------------------
    public bool ReadSensorDataSlow() {
         // FileReader fr = new FileReader(FileName) calls omitted from this example
         SENSORDATA1 SensorData1;
          IntPtr StructPtr = &SensorData1; // IMPOSSIBLE? CAN'T CAST ADDRESS OF STRUCTURE IN C#
         int amt = fr.Read(StructPtr, 1024);
          return true;
    }

So, how can I call fr.Read() and directly populate SensorData1 from the file?
Avatar of psdavis
psdavis
Flag of United States of America image

Since you already are using Interop to access ReadFile,

// if (Win32.Kernel.ReadFile(handle, buffer, count, ref n, 0) == 0)

You can rewrite the way that the DllImport handles the ReadFile function.  Instead of telling it that it needs a pointer, you can tell it that it needs a SENSORDATA1 instead.

See if it helps at all.  Let us know how it goes.
Avatar of testn
testn

try

   public unsafe bool ReadSensorDataSlow() {
        // FileReader fr = new FileReader(FileName) calls omitted from this example
        SENSORDATA1 SensorData1;
        void * temp = &SensorData1
          IntPtr StructPtr = (IntPtr) temp; // IMPOSSIBLE? CAN'T CAST ADDRESS OF STRUCTURE IN C#
        int amt = fr.Read(StructPtr, 1024);
          return true;
   }
It would be easy to do if you didn't have those pesky strings in there. I think you have two options:

1) The one that psdavis gives is fairly good. You could define ReadFile to take a "ref struct" of the type that you want to get, and since you're calling through interop, it would handle the string marshalling.

2) If you want to stay in the managed world, you can define your structure as follows:

[StructLayout(LayoutKind.Explicit)]
 public struct SENSORDATA1 {
        [FieldOffset(0)]
        public byte FileFormat;
        [FieldOffset(1)]
        byte SystemType;
        [FieldOffset(2)]
        char binName;
        [FieldOffset(10)]
        float DataBias;
        [FieldOffset(14)]
        char binVersion;
        ...
   } // This struct is actually 1024 bytes.  Others have various sizes.

After setting all the offsets, you can deal with the strings by getting the address of the first character and then building your real string from that.


Here is a quick snippet I use for sending binary formatted data out very quickly



                  object myObj = "This is a string Object";
                  
                  using(Stream myOutStream = new FileStream(@"C:\temp\test_out.dat", FileMode.OpenOrCreate))
                  {                        
                        IFormatter myFmt = new BinaryFormatter();      
                        myFmt.Serialize(myOutStream, myObj);
                        myOutStream.Flush();
                  }



You can also use the IFormatter object to deserialize binary data into an object graph.  If your sensor data is already in a format consistant with your structs, then you should have no problem deserializing them into your structs.   However, if your sensor data is just a binary dump of data and you need to specify which data goes where (first 8 bytes is variable 1, next 14 bytes is variable 2, etc) then you may have no choice but to deserialize each segment and do it step by step.
I also forgot about  "Marshal.StructureToPtr" and "Marshal.PtrToStructure".  You may want to look into these two methods to see if it might help.  

> StructureToPtr is useful for swapping one structure with another in the same memory location.

Bah, NM... Like I ever read the question...  Ignore my last post.
Since I get to go home in a few, I'll continue on with my idea.

My guess is that your ReadFile looks like

[DllImport("kernel32")] public static extern int ReadFile(HANDLE hFile, IntPtr lpBuffer, int nNumberOfBytesToRead, ref int lpNumberOfBytesRead, ref OVERLAPPED lpOverlapped);

change it to look like

[DllImport("kernel32")] public static extern int ReadFile(HANDLE hFile, SENSORDATA1 lpBuffer, int nNumberOfBytesToRead, ref int lpNumberOfBytesRead, ref OVERLAPPED lpOverlapped);

public bool ReadSensorDataLightning()
{
   SENSORDATA1 pSensorData;

    if (Win32.Kernel.ReadFile(handle, pSensorData, count, ref n, 0) == 0)
        return false;

   return true;
}

I'm sure you'll hit some problems off the bat since I'm just typing without testing it.  You may have to add a 'ref' before the SENSORDATA1 in the function call.  Or you may have to initialize pSensorData = new SENSORDATA( ) before you call ReadFile.  The point is, let Interop handle the conversion for ya.

Going home, good luck!
Avatar of rschaaf

ASKER

Great comments!  I'm still testing these solutions, particularly the ideas from psdavis (the serialization idea has merit also).   The problem right now is that I'm passing a "ref SensorData" to a fr.Read(), which then passes "ref SensorData" toReadFile through InterOp, at which time I'm getting  a System.ExecutionEngineException.  This is probably a dumb error on my part... still scratching my head.  I'll comment back here in a while.
SOLUTION
Avatar of psdavis
psdavis
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rschaaf

ASKER

Hi psdavis,

Thanks for the example.  I tried this technique with structs and it crashes every time with "An unhandled exception of type 'System.ExecutionEngineException' occurred in whatever.exe"  Defining my storage object as a class (like you did) does work, but the performance is REALLY bad, about 50% slower than the Marshal.PtrToStructure() call.  I'm not sure why.

_TAD_'s suggestion of using serialization doesn't work because the files were created by "C" and not the serializer.  Thus BinaryFormatter gives a Version Incompatibility error while reading the file.  And setting the fields one-by-one is what I'm trying to avoid here.

I'm still playing this psdavis' suggestion on my end.  I sure wish I understood what InteropServices is doing during my call to Win32's ReadFile()


...eh, I figured as much.  

Although... now that I've thought about it a little bit, I think serializing/deserializing directly into the object is defiantely the fastest way to go, however that obviously won't work in the case.

Unless...

What is the lifecycle of this file??

I assume some sensor array of some kind (apparently written in C) is collecting all kinds of data and writing it out to a binary file. This file is deposited somewhere and then picked up by your C# program where it is interpretted and used.

What if you were to create a filesystemwatcher service that monitored the directory where your binary file is deposited.  When the file is detected, the filesystemwatcher pulls in the file and then deciphers it (possibly over night, taking as much time as it needs) and then creates a NEW file that is a formatted and serialized version of your classes/structs.  You can then read/deserialize these files as you need them and all of the deciphering has already been done during down hours.  Therefore these files would be quickly and easily read ON DEMAND and virtually instantaneously!


...eh, just a thought anyway.

Then,
Avatar of rschaaf

ASKER

Thanks for the suggestion, _TAD_, but I have many terrabytes of data here to process.  It would be easier to go back to "C" and get the "instantaneous" response by reading the data directly into a struct pointer (which is what I'm trying to do in C#).

I'm pulling my hair out over this problem!  I'm right on the edge of abanding C# entirely and going back to my normal mix of C/C++ in an unmanaged environment.


unfotunately C and C# are different enough that I don't think you can read a file (created in c) directly into a program/struct created in C#...

But then again, I think you can go from Java to C#.... (??)

there has got to be a way to go from c to C# just as easily.

I'll see if I can find anything that might help.  If I can't find a solution in the next 24 hours it's probably not worth waiting for.
Cheeze way.

Allocate your structures in C#
Create a C++ DLL that accepts the structure and fills your structure with ReadFile.
When the function call is complete, your C# structure should be filled.
Avatar of rschaaf

ASKER

psdavis - it looks like this won't work.  Interop will not let you blindly pass the address of a structure.  It "unboxes" the object on the [In] side and "boxes" the object on the [Out] side, and performance is HORRIBLE -- at least that's what appears to be going on.
Avatar of rschaaf

ASKER

I am doubling the points to 650 for someone can show me code that:

1. Creates/Allocates a class or structure. The class/structure needs to contain ints, floats, bytes and strings (or byte arrays, or some way to represent a sequence of single-byte characters).
2. Reads a block of binary data from a disk file.  The layout of the data matches the layout of the class/structure, AND
3. Puts that binary data DIRECTLY into the class/stucture without marshaling, boxing/unboxing, etc.
Avatar of rschaaf

ASKER

(looks like 500 is the max points - so I'm increasing to 500)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rschaaf

ASKER

I've been working on this for a while, and I finally figured out a solution on my own.  The key was a call to GCHandle.GetAddrOfPinnedObject().

The trick is to avoid using InterOp and Marshaling completely.  The only way to do this is to use unmanaged memory.  Here is my solution:

Note that I re-wrote the structures to remove any reference to Marshaling.  I used structs of bytes in place of MarshalAs/ByValTStr attributes.

   [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
   public struct string8 {      byte b0,b1,b2,b3,b4,b5,b6,b7; }

    [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
   public struct SENSORDATA1 {
        public byte FileFormat;
        byte SystemType;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
        float DataBias;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
        ...
     } // This struct is actually 1024 bytes.  Others have various sizes.

   
public unsafe bool LoadXTFHeader() {

                  XTFHEADER XTFHeaderImage = new XTFHEADER();
                  GCHandle gch = GCHandle.Alloc(XTFHeaderImage, GCHandleType.Pinned);
                  XTFHEADER *XTFHeaderPtr = (XTFHEADER *) gch.AddrOfPinnedObject();
                  IntPtr xi = gch.AddrOfPinnedObject();
Avatar of rschaaf

ASKER

[[[ previous post incomplete.  Here is the complete post]]]


The trick is to avoid using InterOp and Marshaling completely.  The only way to do this is to use unmanaged memory.  Here is my solution:

Note that I re-wrote the structures to remove any reference to Marshaling.  I used structs of bytes in place of MarshalAs/ByValTStr attributes.

  [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
  public struct string8 {     byte b0,b1,b2,b3,b4,b5,b6,b7; }

   [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
  public struct SENSORDATA1 {
       public byte FileFormat;
       byte SystemType;
       [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
       float DataBias;
       [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
       ...
  } // This struct is actually 1024 bytes.  Others have various sizes.

 
  public unsafe bool LoadDataHeader() {

    SENSORDATA1 SensHeaderImage = new SENSORDATA1(); // used as model object for Alloc below
    GCHandle gch = GCHandle.Alloc(SensHeaderImage, GCHandleType.Pinned); // unmanaged memory block
    IntPtr xi = gch.AddrOfPinnedObject(); // ptr to unmanaged memory
    SENSORDATA1 *SensHeaderPtr = (SENSORDATA1 *) xi; // also point a struct ptr to it

      ...
    int amt = 0;
    if (ReadFile(fr.handle, xi.ToPointer(), 1024, ref amt, 0) == false) break;
      // At this point, SensorHeaderPtr points to a SENSORDATA1 struct read from disk
       ...
    gch.Free();
  }

    //--------
    [DllImport("kernel32", SetLastError=true)]
    static unsafe extern bool ReadFile(
      IntPtr hFile,                       // handle to file
      void *DataDest,                      // data buffer
      int NumberOfBytesToRead,            // number of bytes to read
      ref int pNumberOfBytesRead,         // number of bytes read
      int Overlapped                      // overlapped buffer
      );

I could probably use the FileStream BinaryReader to do my reads without having to use the Win32.Kernel32 ReadFile() call.

On my 850MHz IBM Thinkpad A21p notebook computer, using this code, I can read binary structures from a 59MB file at 202MB per second, when the reading was done repeatedly in a tight loop so that the data was coming from the disk cache.  Some of the Marshaling/Interop solutions were between 10 and 30 times slower.

I'm going to split the points between psdavis and _TAD_ for giving me good suggestions.  They indirectly helped me find this solution faster.
Thanks for responding back with the solution rschaaf, you've taught me something today.


That is some groovy code there rschaaf.


You mentioned possibly using FileStream and BinaryReader instead of the Win32.Kernel32 ReadFile() method.  Here is a quick snippet that I use as a template when I want to read very large files.




            private void bufferStreamPlayAround()
            {

                  string myStr;
                  
                  Stream rdStream = new FileStream(@"C:\temp\test.txt", FileMode.Open);

                  BufferedStream rdBuff = new BufferedStream(rdStream);

                  StreamReader rdRead = new StreamReader(rdBuff, Encoding.ASCII, false, 1024);
                  
                  myStr = rdRead.ReadLine();

                  Console.WriteLine(myStr);

                  rdRead.Close();
                  rdBuff.Close();
                  rdStream.Close();


                  myStr = "Adding this Line to the end of the File";

                  Stream wtStream = new FileStream(@"C:\temp\test.txt", FileMode.Append);

                  BufferedStream wtBuff = new BufferedStream(wtStream);

                  StreamWriter wtWrite = new StreamWriter(wtBuff, Encoding.ASCII, 1024);

                  wtWrite.AutoFlush = true;
                  
                  wtWrite.WriteLine("\n");
                  wtWrite.WriteLine(myStr);
                  

                  wtWrite.Close();
                  wtBuff.Close();
                  wtStream.Close();


            }



In a nutshell this code reads 1024 bytes into a buffer, and then the buffer gets manipulated.  Then the same thing in reverse, 1024 bytes are written to the buffer and then the buffer gets written out to a file.  The reason I am posting this is to point out the power of multi-threading.  You said that you are going to be reading terrabytes of data and processing it.  By using a reading thread with asynchronus call backs, and a processing thread you could be reading part of the file while your program processes another part.  Because you are loading a struct, it is going to be a little different then interpreting the data into ASCII code, but I think the mechanics are the same.


Here are some links to a single conversation about reading very large text files  (it's in several parts/links, I don't know why)

part 1:
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03066.html

part 2:  (with links to part 2a, and 2b on the bottom)
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03067.html

part 3: (with links to parts 3a, 3b, 3c, 3d, etc.. on the bottom)
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03072.html



Here's a quick snippet of how to read (using mutiple threads) a file with an asynchronus callback


//class constructor
private void ReadLargeFile()
{
    ThreadStart readStartPoint = new ThreadStart(ReadStartPoint);
    Thread readThread = new Thread(readStartPoint);
    readThread.Name = "Async Read Thread";
    readThread.Priority = System.Threading.ThreadPriority.BelowNormal;
    readThread.Start();                
}

private void ReadStartPoint()
{
    for( ;; )
    {
        ThreadStart dataStartPoint = new ThreadStart(DataStartPoint);
        Thread dataThread = new Thread(dataStartPoint);
        dataThread.Name = "Async Data Thread";
                               
        dataThread.Start(); //will lock var in here

        AsyncReadFunction();
       
        if( FileComplete )
            break;
        }
}

private void DataStartPoint()
{
    lock(returnBuffer) // locking gDataStreamVar
    {
        rtb.Text += Encoding.ASCII.GetString( returnBuffer, 0, bytesRead);
    }

    // we are finished with our data thread
    Thread.CurrentThread.Abort();
}

private void AsyncReadFunction()
{
    SetReadState = new AsyncCallback(this.OnCompletedRead);
   
    fileStream.BeginRead( returnBuffer, 0, 1048576, SetReadState, null );
}

private void OnCompletedRead( IAsyncResult asyncResult )
{
    bytesRead = fileStream.EndRead( asyncResult );

    if( bytesRead == 0 )
        FileComplete = true;
}