How to read binary file data directly into a structure in C#?

rschaaf
rschaaf used Ask the Experts™
on

My C# apps needs to process many terrabytes of binary sensor data, stored in binary files as consecutive structures of various sizes.

I need to read the binary sensor data into structures (or classes if need be) but the only way I can see to do this is to 1) read one value at a time, or 2) read data into a byte array and then marshal the data over to the structure.  I can't find a way to read data directly into the structure.  I've tried unmanaged code and even calling "C" code in a DLL, but I still can't find a way to read data directly into a structure.

I know that I'm trying to violate one of the managed code design goals of C#, but this is a small part of a large program where performance is crucial.

I have a structure defined as follows:

    [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
    public struct SENSORDATA1 {
         public byte FileFormat;
         byte SystemType;
         [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
         float DataBias;
         [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
         ...
    } // This struct is actually 1024 bytes.  Others have various sizes.


THIS CODE EXAMPLE WORKS BUT IS SLOW
-----------------------------------
    class FileReader {
         IntPtr handle;
         // ... functions to open, close, etc. go here
         public int Read(IntPtr buffer, int count) {
              int n = 0;
              if (Win32.Kernel.ReadFile(handle, buffer, count, ref n, 0) == 0) {
                   return 0;
              }
              return n;
         }
    }
    public bool ReadSensorDataSlow() {
         // FileReader fr = new FileReader(FileName) calls omitted from this example
         IntPtr BinBlock = Marshal.AllocHGlobal(1024);
         int amt = fr.Read(BinBlock, 1024); // this fn reads binary data into BinBlock
         // copy binary data into the structure - slow!
         SENSORDATA1 SensorData1 = (SENSORDATA1) Marshal.PtrToStructure(BinBlock, typeof(SENSORDATA1));
         return true;
    }

THIS FASTER CODE EXAMPLE SEEMS IMPOSSIBLE IN C#
-----------------------------------------------
    public bool ReadSensorDataSlow() {
         // FileReader fr = new FileReader(FileName) calls omitted from this example
         SENSORDATA1 SensorData1;
          IntPtr StructPtr = &SensorData1; // IMPOSSIBLE? CAN'T CAST ADDRESS OF STRUCTURE IN C#
         int amt = fr.Read(StructPtr, 1024);
          return true;
    }

So, how can I call fr.Read() and directly populate SensorData1 from the file?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Since you already are using Interop to access ReadFile,

// if (Win32.Kernel.ReadFile(handle, buffer, count, ref n, 0) == 0)

You can rewrite the way that the DllImport handles the ReadFile function.  Instead of telling it that it needs a pointer, you can tell it that it needs a SENSORDATA1 instead.

See if it helps at all.  Let us know how it goes.

Commented:
try

   public unsafe bool ReadSensorDataSlow() {
        // FileReader fr = new FileReader(FileName) calls omitted from this example
        SENSORDATA1 SensorData1;
        void * temp = &SensorData1
          IntPtr StructPtr = (IntPtr) temp; // IMPOSSIBLE? CAN'T CAST ADDRESS OF STRUCTURE IN C#
        int amt = fr.Read(StructPtr, 1024);
          return true;
   }

Commented:
It would be easy to do if you didn't have those pesky strings in there. I think you have two options:

1) The one that psdavis gives is fairly good. You could define ReadFile to take a "ref struct" of the type that you want to get, and since you're calling through interop, it would handle the string marshalling.

2) If you want to stay in the managed world, you can define your structure as follows:

[StructLayout(LayoutKind.Explicit)]
 public struct SENSORDATA1 {
        [FieldOffset(0)]
        public byte FileFormat;
        [FieldOffset(1)]
        byte SystemType;
        [FieldOffset(2)]
        char binName;
        [FieldOffset(10)]
        float DataBias;
        [FieldOffset(14)]
        char binVersion;
        ...
   } // This struct is actually 1024 bytes.  Others have various sizes.

After setting all the offsets, you can deal with the strings by getting the address of the first character and then building your real string from that.
Exploring SQL Server 2016: Fundamentals

Learn the fundamentals of Microsoft SQL Server, a relational database management system that stores and retrieves data when requested by other software applications.

Commented:


Here is a quick snippet I use for sending binary formatted data out very quickly



                  object myObj = "This is a string Object";
                  
                  using(Stream myOutStream = new FileStream(@"C:\temp\test_out.dat", FileMode.OpenOrCreate))
                  {                        
                        IFormatter myFmt = new BinaryFormatter();      
                        myFmt.Serialize(myOutStream, myObj);
                        myOutStream.Flush();
                  }



You can also use the IFormatter object to deserialize binary data into an object graph.  If your sensor data is already in a format consistant with your structs, then you should have no problem deserializing them into your structs.   However, if your sensor data is just a binary dump of data and you need to specify which data goes where (first 8 bytes is variable 1, next 14 bytes is variable 2, etc) then you may have no choice but to deserialize each segment and do it step by step.

Commented:
I also forgot about  "Marshal.StructureToPtr" and "Marshal.PtrToStructure".  You may want to look into these two methods to see if it might help.  

> StructureToPtr is useful for swapping one structure with another in the same memory location.

Commented:
Bah, NM... Like I ever read the question...  Ignore my last post.

Commented:
Since I get to go home in a few, I'll continue on with my idea.

My guess is that your ReadFile looks like

[DllImport("kernel32")] public static extern int ReadFile(HANDLE hFile, IntPtr lpBuffer, int nNumberOfBytesToRead, ref int lpNumberOfBytesRead, ref OVERLAPPED lpOverlapped);

change it to look like

[DllImport("kernel32")] public static extern int ReadFile(HANDLE hFile, SENSORDATA1 lpBuffer, int nNumberOfBytesToRead, ref int lpNumberOfBytesRead, ref OVERLAPPED lpOverlapped);

public bool ReadSensorDataLightning()
{
   SENSORDATA1 pSensorData;

    if (Win32.Kernel.ReadFile(handle, pSensorData, count, ref n, 0) == 0)
        return false;

   return true;
}

I'm sure you'll hit some problems off the bat since I'm just typing without testing it.  You may have to add a 'ref' before the SENSORDATA1 in the function call.  Or you may have to initialize pSensorData = new SENSORDATA( ) before you call ReadFile.  The point is, let Interop handle the conversion for ya.

Going home, good luck!

Author

Commented:
Great comments!  I'm still testing these solutions, particularly the ideas from psdavis (the serialization idea has merit also).   The problem right now is that I'm passing a "ref SensorData" to a fr.Read(), which then passes "ref SensorData" toReadFile through InterOp, at which time I'm getting  a System.ExecutionEngineException.  This is probably a dumb error on my part... still scratching my head.  I'll comment back here in a while.
Commented:
Let me show you a real world example.  This is from a TWAIN library that I use.  You can see that the same function has the last parameter overridden over 10 times.

// ------ DSM entry point DAT_ variants to DS:
[DllImport("twain_32.dll", EntryPoint="#1")]
internal static extern TwRC DS_UserInterface( [In, Out] TwIdentity origin, [In, Out] TwIdentity dest, TwDG dg, TwDAT dat, TwMSG msg, TwUserInterface pUserInterface );

[DllImport("twain_32.dll", EntryPoint="#1")]
internal static extern TwRC DSevent( [In, Out] TwIdentity origin, [In, Out] TwIdentity dest, TwDG dg, TwDAT dat, TwMSG msg, ref TwEvent evt );

[DllImport("twain_32.dll", EntryPoint="#1")]
internal static extern TwRC DSstatus( [In, Out] TwIdentity origin, [In] TwIdentity dest, TwDG dg, TwDAT dat, TwMSG msg, [In, Out] TwStatus dsmstat );

[DllImport("twain_32.dll", EntryPoint="#1")]
internal static extern TwRC DS_ImageNativeXfer( [In, Out] TwIdentity origin, [In] TwIdentity dest, TwDG dg, TwDAT dat, TwMSG msg, ref IntPtr hbitmap );

You won't have to assign unique names to the functions since the different parameter will make it unique enough.

> Here's some miscellaneous calls that are used where the application calls them with structures

   [StructLayout(LayoutKind.Sequential, Pack=2)]
   internal class TwUserInterface
   {
      public short      ShowUI;           // bool is strictly 32 bit, so use short
      public short      ModalUI;
      public IntPtr     ParentHand;
   }

         TwUserInterface pUserInterface   = new TwUserInterface( );
         pUserInterface.ShowUI            = 1;
         pUserInterface.ModalUI           = 1;
         pUserInterface.ParentHand        = m_hWnd;

         rc = DS_UserInterface( m_AppID, m_SourceId, TwDG.Control, TwDAT.UserInterface, TwMSG.EnableDS, pUserInterface );

> I hope that helps!!

Author

Commented:
Hi psdavis,

Thanks for the example.  I tried this technique with structs and it crashes every time with "An unhandled exception of type 'System.ExecutionEngineException' occurred in whatever.exe"  Defining my storage object as a class (like you did) does work, but the performance is REALLY bad, about 50% slower than the Marshal.PtrToStructure() call.  I'm not sure why.

_TAD_'s suggestion of using serialization doesn't work because the files were created by "C" and not the serializer.  Thus BinaryFormatter gives a Version Incompatibility error while reading the file.  And setting the fields one-by-one is what I'm trying to avoid here.

I'm still playing this psdavis' suggestion on my end.  I sure wish I understood what InteropServices is doing during my call to Win32's ReadFile()

Commented:


...eh, I figured as much.  

Although... now that I've thought about it a little bit, I think serializing/deserializing directly into the object is defiantely the fastest way to go, however that obviously won't work in the case.

Unless...

What is the lifecycle of this file??

I assume some sensor array of some kind (apparently written in C) is collecting all kinds of data and writing it out to a binary file. This file is deposited somewhere and then picked up by your C# program where it is interpretted and used.

What if you were to create a filesystemwatcher service that monitored the directory where your binary file is deposited.  When the file is detected, the filesystemwatcher pulls in the file and then deciphers it (possibly over night, taking as much time as it needs) and then creates a NEW file that is a formatted and serialized version of your classes/structs.  You can then read/deserialize these files as you need them and all of the deciphering has already been done during down hours.  Therefore these files would be quickly and easily read ON DEMAND and virtually instantaneously!


...eh, just a thought anyway.

Then,

Author

Commented:
Thanks for the suggestion, _TAD_, but I have many terrabytes of data here to process.  It would be easier to go back to "C" and get the "instantaneous" response by reading the data directly into a struct pointer (which is what I'm trying to do in C#).

I'm pulling my hair out over this problem!  I'm right on the edge of abanding C# entirely and going back to my normal mix of C/C++ in an unmanaged environment.

Commented:


unfotunately C and C# are different enough that I don't think you can read a file (created in c) directly into a program/struct created in C#...

But then again, I think you can go from Java to C#.... (??)

there has got to be a way to go from c to C# just as easily.

I'll see if I can find anything that might help.  If I can't find a solution in the next 24 hours it's probably not worth waiting for.

Commented:
Cheeze way.

Allocate your structures in C#
Create a C++ DLL that accepts the structure and fills your structure with ReadFile.
When the function call is complete, your C# structure should be filled.

Author

Commented:
psdavis - it looks like this won't work.  Interop will not let you blindly pass the address of a structure.  It "unboxes" the object on the [In] side and "boxes" the object on the [Out] side, and performance is HORRIBLE -- at least that's what appears to be going on.

Author

Commented:
I am doubling the points to 650 for someone can show me code that:

1. Creates/Allocates a class or structure. The class/structure needs to contain ints, floats, bytes and strings (or byte arrays, or some way to represent a sequence of single-byte characters).
2. Reads a block of binary data from a disk file.  The layout of the data matches the layout of the class/structure, AND
3. Puts that binary data DIRECTLY into the class/stucture without marshaling, boxing/unboxing, etc.

Author

Commented:
(looks like 500 is the max points - so I'm increasing to 500)
Commented:


rschaaf...  Have you tried using unsafe code?


something like the following:  (This uses pointers to pull the data out, it should be faster... but it may not be fast enough)



public struct YourStruct
{ // only value-types possible here:
public int First;
public long Second;
public double Third;
}

static unsafe byte[] YourStructToBytes( YourStruct s )
{
byte[] arr = new byte[ sizeof(YourStruct) ];
fixed( byte* parr = arr )
{ *((YourStruct*)parr) = s; }
return arr;
}

static unsafe YourStruct BytesToYourStruct( byte[] arr )
{
if( arr.Length < sizeof(YourStruct) )
throw new ArgumentException();

YourStruct s;
fixed( byte* parr = arr )
{ s = *((YourStruct*)parr); }
return s;
}

// usage:
YourStruct s0;
s0.First = 1;
s0.Second = 2;
s0.Third = 3.5;
byte[] ab = YourStructToBytes( s0 );
// reverse:
YourStruct s1 = BytesToYourStruct( ab );

//

Author

Commented:
I've been working on this for a while, and I finally figured out a solution on my own.  The key was a call to GCHandle.GetAddrOfPinnedObject().

The trick is to avoid using InterOp and Marshaling completely.  The only way to do this is to use unmanaged memory.  Here is my solution:

Note that I re-wrote the structures to remove any reference to Marshaling.  I used structs of bytes in place of MarshalAs/ByValTStr attributes.

   [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
   public struct string8 {      byte b0,b1,b2,b3,b4,b5,b6,b7; }

    [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
   public struct SENSORDATA1 {
        public byte FileFormat;
        byte SystemType;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
        float DataBias;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
        ...
     } // This struct is actually 1024 bytes.  Others have various sizes.

   
public unsafe bool LoadXTFHeader() {

                  XTFHEADER XTFHeaderImage = new XTFHEADER();
                  GCHandle gch = GCHandle.Alloc(XTFHeaderImage, GCHandleType.Pinned);
                  XTFHEADER *XTFHeaderPtr = (XTFHEADER *) gch.AddrOfPinnedObject();
                  IntPtr xi = gch.AddrOfPinnedObject();

Author

Commented:
[[[ previous post incomplete.  Here is the complete post]]]


The trick is to avoid using InterOp and Marshaling completely.  The only way to do this is to use unmanaged memory.  Here is my solution:

Note that I re-wrote the structures to remove any reference to Marshaling.  I used structs of bytes in place of MarshalAs/ByValTStr attributes.

  [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
  public struct string8 {     byte b0,b1,b2,b3,b4,b5,b6,b7; }

   [StructLayout(LayoutKind.Sequential, Pack=1, CharSet=CharSet.Ansi)]
  public struct SENSORDATA1 {
       public byte FileFormat;
       byte SystemType;
       [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinName;
       float DataBias;
       [MarshalAs(UnmanagedType.ByValTStr, SizeConst=8)] public string BinVersion;
       ...
  } // This struct is actually 1024 bytes.  Others have various sizes.

 
  public unsafe bool LoadDataHeader() {

    SENSORDATA1 SensHeaderImage = new SENSORDATA1(); // used as model object for Alloc below
    GCHandle gch = GCHandle.Alloc(SensHeaderImage, GCHandleType.Pinned); // unmanaged memory block
    IntPtr xi = gch.AddrOfPinnedObject(); // ptr to unmanaged memory
    SENSORDATA1 *SensHeaderPtr = (SENSORDATA1 *) xi; // also point a struct ptr to it

      ...
    int amt = 0;
    if (ReadFile(fr.handle, xi.ToPointer(), 1024, ref amt, 0) == false) break;
      // At this point, SensorHeaderPtr points to a SENSORDATA1 struct read from disk
       ...
    gch.Free();
  }

    //--------
    [DllImport("kernel32", SetLastError=true)]
    static unsafe extern bool ReadFile(
      IntPtr hFile,                       // handle to file
      void *DataDest,                      // data buffer
      int NumberOfBytesToRead,            // number of bytes to read
      ref int pNumberOfBytesRead,         // number of bytes read
      int Overlapped                      // overlapped buffer
      );

I could probably use the FileStream BinaryReader to do my reads without having to use the Win32.Kernel32 ReadFile() call.

On my 850MHz IBM Thinkpad A21p notebook computer, using this code, I can read binary structures from a 59MB file at 202MB per second, when the reading was done repeatedly in a tight loop so that the data was coming from the disk cache.  Some of the Marshaling/Interop solutions were between 10 and 30 times slower.

I'm going to split the points between psdavis and _TAD_ for giving me good suggestions.  They indirectly helped me find this solution faster.

Commented:
Thanks for responding back with the solution rschaaf, you've taught me something today.

Commented:


That is some groovy code there rschaaf.


You mentioned possibly using FileStream and BinaryReader instead of the Win32.Kernel32 ReadFile() method.  Here is a quick snippet that I use as a template when I want to read very large files.




            private void bufferStreamPlayAround()
            {

                  string myStr;
                  
                  Stream rdStream = new FileStream(@"C:\temp\test.txt", FileMode.Open);

                  BufferedStream rdBuff = new BufferedStream(rdStream);

                  StreamReader rdRead = new StreamReader(rdBuff, Encoding.ASCII, false, 1024);
                  
                  myStr = rdRead.ReadLine();

                  Console.WriteLine(myStr);

                  rdRead.Close();
                  rdBuff.Close();
                  rdStream.Close();


                  myStr = "Adding this Line to the end of the File";

                  Stream wtStream = new FileStream(@"C:\temp\test.txt", FileMode.Append);

                  BufferedStream wtBuff = new BufferedStream(wtStream);

                  StreamWriter wtWrite = new StreamWriter(wtBuff, Encoding.ASCII, 1024);

                  wtWrite.AutoFlush = true;
                  
                  wtWrite.WriteLine("\n");
                  wtWrite.WriteLine(myStr);
                  

                  wtWrite.Close();
                  wtBuff.Close();
                  wtStream.Close();


            }



In a nutshell this code reads 1024 bytes into a buffer, and then the buffer gets manipulated.  Then the same thing in reverse, 1024 bytes are written to the buffer and then the buffer gets written out to a file.  The reason I am posting this is to point out the power of multi-threading.  You said that you are going to be reading terrabytes of data and processing it.  By using a reading thread with asynchronus call backs, and a processing thread you could be reading part of the file while your program processes another part.  Because you are loading a struct, it is going to be a little different then interpreting the data into ASCII code, but I think the mechanics are the same.


Here are some links to a single conversation about reading very large text files  (it's in several parts/links, I don't know why)

part 1:
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03066.html

part 2:  (with links to part 2a, and 2b on the bottom)
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03067.html

part 3: (with links to parts 3a, 3b, 3c, 3d, etc.. on the bottom)
http://www.mail-archive.com/c_sharp@p2p.wrox.com/msg03072.html



Here's a quick snippet of how to read (using mutiple threads) a file with an asynchronus callback


//class constructor
private void ReadLargeFile()
{
    ThreadStart readStartPoint = new ThreadStart(ReadStartPoint);
    Thread readThread = new Thread(readStartPoint);
    readThread.Name = "Async Read Thread";
    readThread.Priority = System.Threading.ThreadPriority.BelowNormal;
    readThread.Start();                
}

private void ReadStartPoint()
{
    for( ;; )
    {
        ThreadStart dataStartPoint = new ThreadStart(DataStartPoint);
        Thread dataThread = new Thread(dataStartPoint);
        dataThread.Name = "Async Data Thread";
                               
        dataThread.Start(); //will lock var in here

        AsyncReadFunction();
       
        if( FileComplete )
            break;
        }
}

private void DataStartPoint()
{
    lock(returnBuffer) // locking gDataStreamVar
    {
        rtb.Text += Encoding.ASCII.GetString( returnBuffer, 0, bytesRead);
    }

    // we are finished with our data thread
    Thread.CurrentThread.Abort();
}

private void AsyncReadFunction()
{
    SetReadState = new AsyncCallback(this.OnCompletedRead);
   
    fileStream.BeginRead( returnBuffer, 0, 1048576, SetReadState, null );
}

private void OnCompletedRead( IAsyncResult asyncResult )
{
    bytesRead = fileStream.EndRead( asyncResult );

    if( bytesRead == 0 )
        FileComplete = true;
}

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial