PCM/WAV file format information, please.

Posted on 1998-12-11
Medium Priority
Last Modified: 2013-12-03

Hello!  I'm currently writing a program that will be reading in and working with PCM/WAV files.  I already know the header formats, gleaned from http://www.intersrv.com/~dcross/wavio.html but I need some help with regards to the meaning of the data portion of the file.  I.e. how do the numbers reflect the waveform of the sound being generated at that time (in that sample), or some such relationship.


Question by:ap9
  • 3
  • 2

Accepted Solution

trillo earned 800 total points
ID: 1417094
Ok... here it is:
All wave data is stored in 8-bit bytes. The bytes of multiple-byte values are stored with the low-order (ie, least significant) bytes first. Data bits are as follows (ie, shown with bit numbers on top):

             7  6  5  4  3  2  1  0
     char: | lsb               msb |

                 7  6  5  4  3  2  1  0 15 14 13 12 11 10  9  8
short(2 bytes): | lsb     byte 0        |       byte 1      msb |

A WAVE file is a collection of a number of different types of chunks. There is a required Format ("fmt ") chunk which contains important parameters describing the waveform, such as its sample rate. The Data chunk, which contains the actual waveform data, is also required. All other chunks are optional. Among the other optional chunks are ones which define cue points, list instrument parameters, store application-specific information, etc.
All applications that use WAVE must be able to read the 2 required chunks and can choose to selectively ignore the optional chunks. A program that copies a WAVE should copy all of the chunks in the WAVE, even those it chooses not to interpret.
There are no restrictions upon the order of the chunks within a WAVE file, with the exception that the Format chunk must precede the Data chunk.

Very Important: Sample Points and Sample Frames
A large part of interpreting WAVE files revolves around the two concepts of sample points and sample frames.
A sample point is a value representing a sample of a sound at a given moment in time. For waveforms with greater than 8-bit resolution, each sample point is stored as a linear, 2's-complement value which may be from 9 to 32 bits wide (as determined by the wBitsPerSample field in the Format Chunk, assuming PCM format (uncompressed). For example, each sample point of a 16-bit waveform would be a 16-bit word (ie, two 8-bit bytes) where 32767 (0x7FFF) is the highest value and -32768 (0x8000) is the lowest value. For 8-bit (or less) waveforms, each sample point is a linear, unsigned byte where 255 is the highest value and 0 is the lowest value. Obviously, this signed/unsigned sample point discrepancy between 8-bit and larger resolution waveforms was one of those "oops" scenarios where some Microsoft employee decided to change the sign sometime after 8-bit wave files were common but 16-bit wave files hadn't yet appeared. Remember 8 bit sound is unsigned and 16 bit is signed. This is important when building your buffers.
Because most CPU's read and write operations deal with 8-bit bytes, it was decided that a sample point should be rounded up to a size which is a multiple of 8 when stored in a WAVE. This makes the WAVE easier to read into memory. If your ADC produces a sample point from 1 to 8 bits wide, a sample point should be stored in a WAVE as an 8-bit byte (ie, unsigned char). If your ADC produces a sample point from 9 to 16 bits wide, a sample point should be stored in a WAVE as a 16-bit word (ie, signed short). If your ADC produces a sample point from 17 to 24 bits wide, a sample point should be stored in a WAVE as three bytes. If your ADC produces a sample point from 25 to 32 bits wide, a sample point should be stored in a WAVE as a 32-bit doubleword (ie, signed long). Etc.

Furthermore, the data bits should be left-justified, with any remaining (ie, pad) bits zeroed. For example, consider the case of a 12-bit sample point. It has 12 bits, so the sample point must be saved as a 16-bit word. Those 12 bits should be left-justified so that they become bits 4 to 15 inclusive, and bits 0 to 3 should be set to zero. Shown below is how a 12-bit sample point with a value of binary [1010 00010111] is formatted left-justified as a 16-bit word.

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| 1   0   1   0   0   0   0   1   0   1   1   1   0   0   0   0 |
 <---------------------------------------------> <------------->
        12 bit sample point is left justified          rightmost
                                                      4 bits are
                                                      zero padded

But note that, because the WAVE format uses Intel little endian byte order, the LSB is stored first in the wave file as so:

 ___ ___ ___ ___ ___ ___ ___ ___  ___ ___ ___ ___ ___ ___ ___ __
|   |   |   |   |   |   |   |   ||   |   |   |   |   |   |   |  |
| 0   1   1   1   0   0   0   0 || 1   0   1   0   0   0   0   1
|___|___|___|___|___|___|___|___| __|___|___|___|___|___|___|___|
<-------------> <------------->  <----------------------------->
       bits 0 to 3     4 pad bits                 bits 4 to 11

For multichannel sounds (for example, a stereo waveform), single sample points from each channel are interleaved. For example, assume a stereo (ie, 2 channel) waveform. Instead of storing all of the sample points for the left channel first, and then storing all of the sample points for the right channel next, you "mix" the two channels' sample points together. You would store the first sample point of the left channel. Next, you would store the first sample point of the right channel. Next, you would store the second sample point of the left channel. Next, you would store the second sample point of the right channel, and so on, alternating between storing the next sample point of each channel. This is what is meant by interleaved data; you store the next sample point of each of the channels in turn, so that the sample points that are meant to be "played" (ie, sent to a DAC) simultaneously are stored contiguously.

The sample points that are meant to be "played" (ie, sent to a DAC) simultaneously are collectively called a sample frame. In the example of our stereo waveform, every two sample points makes up another sample frame. This is illustrated below for that stereo example.

      sample       sample              sample
      frame 0      frame 1             frame N
     _____ _____ _____ _____         _____ _____
    | ch1 | ch2 | ch1 | ch2 | . . . | ch1 | ch2 |
    |_____|_____|_____|_____|       |_____|_____|
    |     | = one sample point

For a monophonic waveform, a sample frame is merely a single sample point (ie, there's nothing to interleave). For multichannel waveforms, you should follow the conventions shown below for which order to store channels within the sample frame. (ie, Below, a single sample frame is displayed for each example of a multichannel waveform).

      channels       1         2
                 _________ _________
                | left    | right   |
      stereo    |         |         |

                     1         2         3
                 _________ _________ _________
                | left    | right   | center  |
      3 channel |         |         |         |

The sample points within a sample frame are packed together; there are no unused bytes between them. Likewise, the sample frames are packed together with no pad bytes.

Voila, I hope I could help you.

Note: To see propery the diagrams copy them into Notepad

Author Comment

ID: 1417095
Whoa, ok, good information there, but my actual question is how do you interpret the SAMPLE frames?  Like, if I wanted to draw out the waveform, what would I need to do to extract that information from a sample frame (or frames).

Or maybe, by way of example, suppose I have a simple wave (i.e. a sine wave -- sin(x)) sampled at 44.1kHz mono, what happens when I encode it into WAV format (not worrying about the headers, just how it is represented in the data chunk).

I've increased the point value of the question to 200, as I do appreciate your help!


Author Comment

ID: 1417096
Oh, a small point -- for the data chunk, there is a string that consists of "data".  Now, is there a 4 byte value after this that represents the length of the data chunk, or not?  I've read conflicting reports about this.  Thanks.


Expert Comment

ID: 1417097
How do you interpret the sample frames?.. It depends on your meaning of "interpreting" a sample frame.
First of all you should base your code on the format header. In this piece of text we'll work with a simple example: A wave file with:
SampleRate 11025
BitperSample 16
Data bytes = 1024

Before going on, you're right!... After each header there is a value representing the length in bytes of the chunk (without incluthin the chunk name string).

In our example we can see that each sample is represented by a 16 bit value, so a "Char" data type won't be enough to store the samples, so we choose "Integers" to store our values (Remember an integer needs 2 bytes in memory). We also see that our wave file is Stereo, this means that we will have 2 integer values per sample. In conclusion each sample frame is formed by 4 byes = 2 integers = 2 sample points (the first int for the left, and the second for the right channel).
Our data chunk says that our wave file has 1024 bytes, this means that we will have 1024/4 = 256 sample frames, this means that we will have 256 values for the left channel and 256 values for the right channel.
I've choosen a Stereo example here because it's a little more difficult, (but no too much). In this case, if you want to graph the wave form, you should make two graphs, one for each channel (Remember that the left speaker can have a completely different music that the right speaker)... If you want to make only one graph, you can maybe calculate the average of the right and left sample points.
Of course for Mono sound you avoid all this trouble.
It's not very difficult if you see.... You should just read values according to the Wave format.... In stereo you read: left1, right1, left2, right2, left3, right3, left4, right4, etc... In mono you read: value1, value2, value3, value4, etc... and finally you draw those values.


Author Comment

ID: 1417098
Ah, I see!  Yes, you're right, it isn't very hard -- I was thinking that there was more involved than just that.  Thank you very much!  I accept your answer.


Featured Post

 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes how to programmatically preset the "Pages per Sheet" option that's available with most printer drivers.   This setting lets you do "n-Up" printing, where two, four, or more pages are printed on each sheet of paper. If your …
After several hours of googling I could not gather any information on this topic. There are several ways of controlling the USB port connected to any storage device. The best example of that is by changing the registry value of "HKEY_LOCAL_MACHINE\S…
This is Part 3 in a 3-part series on Experts Exchange to discuss error handling in VBA code written for Excel. Part 1 of this series discussed basic error handling code using VBA. http://www.experts-exchange.com/videos/1478/Excel-Error-Handlin…
When cloud platforms entered the scene, users and companies jumped on board to take advantage of the many benefits, like the ability to work and connect with company information from various locations. What many didn't foresee was the increased risk…

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question