Link to home
Create AccountLog in
Avatar of AlphaLolz
AlphaLolzFlag for United States of America

asked on

Searching Binary data in VB

I have a need to "split" through a file that contains an XML header followed by a PDF document (which can be binary).

So, I need to be able to open and read this into memory (some sort of structure), but then fine where the XML ends, and the binary starts.

We need to do this in VB.Net (that's the preference) on Windows 2003.

I'm looking for code that can do this?
Avatar of Göran Andersson
Göran Andersson
Flag of Sweden image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of AlphaLolz


What we have here is essentially a file with a PDF in it and a bunch of attribute data (in XML format) in front of that.  We understand our XML structure quite well and what the end tag looks like, and we also know what the start of the PDF should look like (the signature).

What we don't understand is that the file is (overall) a binary file, with non-binary up front.  We don't understand how to work with the XML string functions on this if we've read it as a byte array.  They won't work (will they)?
Also, how would we write this back out once we have this position?
If you have questions about the solution, you should not accept it and downgrade it. Just reply and wait for clarification. You can post a message in the Community Support zone to reopen the question.

If you know the name of the root element (and that it's not reused as a child element), you can just use the code from the comment "find end element" and on.

If you know the start of the pdf data, you also have the alternative to look for that. If the singnature is for example 42, 13, 37, 0:

Dim pos3 As Integer = 0
While not (data(pos3)=42 and data(pos3+1)=13 and data(pos3+2)=37 and data(pos3+3)=0)
   pos3 += 1
End While

> What we don't understand is that the file is (overall) a binary file, with non-binary up front.

All files are binary. The file system doesn't make any distrinction between text and binary data. A text file is simply text encoded into binary data, stored in a file.

In your case you have one part of the file that needs decoding and not the other. You just have to find out where to split the data, then decode the xml part.

> We don't understand how to work with the XML string functions on this if we've read it as a byte array.

Once you have decoded the xml data into a string, as I have already shown you, you can use it as usual. For example load it into an XmlDocument using the LoadXml method.
Thanks so much
Sorry, one last follow on.

From what you've said, it seems we could:

1 - read into a byte array
2 - convert the entire array into a string (even though it's not)
3 - do a substring on the new array for the end tag knowing that it would never get to the binary data before it found the  end tag
4 - get the position from there and split/save the front to an XML file and the last to a binary file

Is that right?
Sorry, I mean IndexOf.
Yes, that is correct.

It's safe to decode the binary data to a string using the ASCII encoding, as long as you don't expect the part of the string that came from the binary data to be useful (i.e. possible to encode back to binary data).

You don't have to decode the entire array, you only have to decode enough of it to be sure to get all the xml. PDF files can get rather large, so you should use an upper limit for how much you decode. If you are uncertain of the size of the XML, you can use a rather high value, like perhaps 100 kb, to make sure that you get all the XML but still protect you from decoding several MB of PDF data.

Note that the xml end tag might be followed by one or more line breaks that you should skip before reaching the binary data. Se the code that I posted before.
To clarify: You should not get either the XML or the PDF data from the string that you decoded using Encoding.ASCII.GetString. You only use that string to determine the length of the xml data.

You need to decode the XML data with the encoding that was used to encode the XML in the first place (which is usually UTF-8).

The length of the XML data (the length in the ASCII string, not the properly decoded string) determines which part of the byte array contains the PDF data.