Daily Archives: Wednesday, August 20, 2014

  • Extract Text from RTF in C#/.Net

    At work, I was tasked with creating a class to strip RTF tags from RTF formatted text, leaving only the plain text. Microsoft’s RichTextBox can do this with its Text property, but it was unavailable in the context in which I’m working.

    RTF formatting uses control characters escaped with backslashes along with nested curly braces. Unfortunately, the nesting means I can’t kill the control characters using a single regex, since I’d have to process the stack, and in addition, some control characters should be translated, such as newline and tab characters.

    Example:

    {\rtf1\ansi\deff0
    {\colortbl;\red0\green0\blue0;\red255\green0\blue0;}
    This line is the default color\line
    \cf2
    This line is red\line
    \cf1
    This line is the default color
    }

    Thankfully, Markus Jarderot provided a great answer over at StackOverflow, but unfortunately for me, it’s written in Python. I don’t know Python, but I translated it to the best of my abilities to C# since it was very readable.

    If this is useful to you, you can download the C# version, or view the original/new code below.

    View Original Python Code

    View Translated C# Code