Fake binary trees
... rants, ramblings and occasional good idea ...

Generating PDF documents on fly with iTextSharp

Recently, I was bitten in the ass while trying to generate the PDF documents on fly inside an ASP.NET web application, to be downloaded from a web site.

First, some background to describe the problem:

In short, the idea was that a user would click a link, and the application would generate the PDF in the memory (without backing file on the disk) and write it to the Response stream, which results in document being downloaded to client machine. PDF generation is done using the excellent iTextSharp open-source library, which gives you all that you need to build PDF documents - at least if you requirements are as modest as mine.

Let's see how this method looks like, stripped of all unnecessary details.

...
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
...

void Download()
{
    Document document = new Document();    // instantiate a iTextSharp.text.pdf.Document
    MemoryStream mem = new MemoryStream(); // PDF data will be written here
    PdfWriter.GetInstance(document, mem);  // tie a PdfWriter instance to the stream

    document.Open();
    ...                 // adding content to iTextSharp Document instance
    ...                 // tutorial: http://itextsharp.sourceforge.net/tutorial/index.html
    document.Close();   // automatically closes the attached MemoryStream

    byte[] docData = mem.GetBuffer(); // get the generated PDF as raw data
       
    // write the document data to response stream and set appropriate headers:
    Response.AppendHeader("Content-Disposition", "attachment; filename=testdoc.pdf");
    Response.ContentType = "application/pdf";
    Response.BinaryWrite(docData);
    Response.End();
}

Initially, everything looked great, but as documents grew more complex, some of them turned out to be corrupted, when opened on the client machine. At the same time, if they were created through a backing FileStream instead of MemoryStream everything was fine.

It turned out (after couple of hours of guessing and debugging), that the problem is in a way how MemoryStream and Document objects interact. When data is written to a MemoryStream instance, underlying buffer is increased in blocks which are multiples of 512 bytes long (an implementation detail which might change, don't rely on it). This means that the capacity of the underlying buffer is always at least equal to the real length of written data, but typically it is larger.  When the Document instance s closed, it also closes the stream. However, the length of written data is less than the capacity. When you ask for buffer (mem.GetBuffer()), the whole buffer is retrieved, including the padding bytes which are set to zero. When this buffer is transferred to client machine, excessive padding bytes result in a corrupted document.

There is no way to read the real length of written data prior to document.Close() because document has not been flushed yet. There is also no way to read it after the Close() call, because the stream was disposed already.

I solved this by introducing a new stream class, which records the real data length while being disposed, and overrides GetBuffer() method to return the buffer resized to correct length: 

class LengthFixingStream : MemoryStream // feel free to suggest better name than LengthFixingStream
{
    private long m_length;

    protected override void Dispose(bool disposing)
    {
        m_length = Length;
        base.Dispose(disposing);
    }

    public override byte[] GetBuffer()
    {
        byte[] buffer = base.GetBuffer();
        Array.Resize<byte>(ref buffer, (int) ContentLength); // dirty, but my documents are < 2GB long
        return buffer;
    }

    public long ContentLength
    {
        get { return m_length; }
    }
}

Now, the only change which has to be done is to replace MemoryStream reference in Download() method to LengthFixingStream:

void Download()
{
    Document document = new Document();
    LengthFixingStream mem = new LengthFixingStream();
    ... // yadda yadda yadda... the rest is the same
}

Posted at 14:11 on July 21, 2009
Categories:   E-mail | del.icio.us | Permalink | Comments (0) | Post RSSRSS comment feed