Macteki Guide on Java Programming: Text Processing

Friday, August 5, 2011

Text Processing - html to text

Two days ago, I was asked to write a method htmlToText(). The method would take an html filename as parameter. It would simply copy the content of the html file to a text file, with all the html tags removed.

For example, consider the following html file :

With all the tags removed, the output would be :

Hello, there

I wrote this method with pen and paper. Yes, I was actually "WRITING" program with pen, not "TYPING" program in front of a computer. Here is the version I wrote :

Drawback

The above program read the input file line by line, remove the tags for each line, and then append the line to the output file.

There is a major drawback of the above implementation. Since the removeTag() method is applied on a line by line basis, it doesn't work if the open tag and the close tag are on separated lines, such as :

Alternative : Reading the whole file into a content buffer

When I finally arrived home and got access to a computer, I rewrote the method.

This method consumes more memory. However, it is a completely feasible method. Most html files are not very big. You won't find a html with a file size of 100 megabytes.

Regular Expression

Finally, I simplified the method further with the help of regular expression.

Lesson Learned

Nowadays, good programmers are not necessary good at "writing" program, the are good at "typing" program in front of a computer, with google as their friends.

Macteki Guide on Java Programming

Pages