Extract Text from RTF in C#/.Net

Extract Text from RTF in C#/.Net

At work, I was tasked with creating a class to strip RTF tags from RTF formatted text, leaving only the plain text. Microsoft’s RichTextBox can do this with its Text property, but it was unavailable in the context in which I’m working.

RTF formatting uses control characters escaped with backslashes along with nested curly braces. Unfortunately, the nesting means I can’t kill the control characters using a single regex, since I’d have to process the stack, and in addition, some control characters should be translated, such as newline and tab characters.

Example:

{\rtf1\ansi\deff0
{\colortbl;\red0\green0\blue0;\red255\green0\blue0;}
This line is the default color\line
\cf2
This line is red\line
\cf1
This line is the default color
}

Thankfully, Markus Jarderot provided a great answer over at StackOverflow, but unfortunately for me, it’s written in Python. I don’t know Python, but I translated it to the best of my abilities to C# since it was very readable.

If this is useful to you, you can download the C# version, or view the original/new code below.

The code in this post is licensed Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), as is all code on Stack Overflow.

View Original Python Code

View Translated C# Code

Update: Johnny Lie pointed out some important performance improvements that I have incorporated. Instead of loading all the regex matches, it iterates through them one by one now. This allows larger regex to be processed successfully. Additionally, I have clarified the code license as CC BY-SA 3.0, due to the origin code coming from Stack Overflow, thanks to a comment by Spencer Schneidenbach.

Author: Chris Benard

Chris Benard is a software developer in the Dallas area specializing in payments processing, medical claims processing, and Windows/Web services.

Comments