Tag: reading byte arrays

Dropping the BOM … Removing ‘Invalid’ characters at the beginning of a string!

Ever encountered some strange characters in a string representing an XML document you’ve just read from a file? Well you’ve forgotten to take into account the encoding that was used to write the file to disk. It is bad practice to just assume UTF-8 was used!

The strange characters are in fact the Byte Order Mark (BOM).

The BOM, sometimes called the preamble, is a set of bytes at the beginning of your document and can take up to four bytes depending on the encoding used (see list below). The BOM allows applications to correctly identify the UNICODE encoding that was used for your document.

If none of the previous doesring any bell you should really read the primer by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Possible BOM-s:

UTF-8                  = EF BB BF 
little endion, UTF-16  = FF FE 
big endian, UTF-16     = FE FF  
little endian, UTF-32  = FF FE 00 00 
big-endian, UTF-32     = 00 00 FE FF

Below you’ll find a code example you can use to indentify and remove the characters when converting a byte array, containing an XML document, to a string.

using System.Xml;
using System.IO; 

...

public string GetMessage(byte[] message)
{
    try
    {
        /* We have a boot-strap problem here: to know the encoding of the
           message we have to look inside the message to read the document
           element of the message. So we use an XmlDocument object to find 
           the document element of the message.    */

        // Load byte[] into XmlDocument (assumes valid XML!).
        MemoryStream inputStream = new MemoryStream(message);
        XmlDocument doc = new XmlDocument();
        doc.Load(inputStream);

        // Identify the XmlDocument encoding by reading the document element.
        XmlDeclaration dcl = (XmlDeclaration)doc.FirstChild;
        Encoding enc = Encoding.GetEncoding(dcl.Encoding);
        int preambleSize = enc.GetPreamble().Length;

        /* Drop the preamble bytes if they match the encoding's preamble 
           definition.
           The preamble bytes or BOM charactes are optional so the might 
           not be there! */
        bool preambleMatches = true;
        for (int i = 0; i &lt; preambleSize; i++)
        {
            preambleMatches = message[i] == enc.GetPreamble()[i];
            if (!preambleMatches) break;
        }

        byte[] cleanedMessage;
        if (preambleMatches)
        {
            // Remove the preamble from the message.
            cleanedMessage = new byte[message.Length - preambleSize];
            Array.Copy(message, preambleSize, cleanedMessage, 
                       0, message.Length - preambleSize);
        }
        else
        {
            // The preamble was not added to the message so nothing to remove.
            cleanedMessage = message;
        }

        return enc.GetString(cleanedMessage);
    }
    catch (Exception)
    {
        // Encoding could not be detected so we assume UTF8-encoding.
        return Encoding.UTF8.GetString(message);
    }
}

Dirk's Daily Digits

This blog groups all the information I like to share about my daily encounters with information technology related problems.

Tag Archives: reading byte arrays

Dropping the BOM