Monday, 8 February 2016

Remove HTML from String using Regex and String.replaceAll()


Using simple Regex and String.replaceAll() function to easily remove HTML from String:

We can use a simple JAVA regex to remove all HTML tags from a string.
The demerit of this is that it is difficult to selectively remove HTML tags. For example if we want to remove all HTML tags except the <span> and <br> tag it will increase the complexity of the regex expression.


package net.codermag.example;

public class ConvertHTML {
 public static void main(String[] args) {

  String text = "<div><span><b style='color:blue;'>CoderMagnet:</b>The Developer playground.</span></div>";
  System.out.println(text.replaceAll("\\<[^>]*>", ""));
 }
}


Output:

CoderMagnet:The Developer playground.

Pitfall:
An issue with conventional techniques of HTML tag removal is that it does not replace <br> and <p> tags with new lines. This gives rise to a text which is not properly formatted and sometimes difficult to read. For this we need to handle newlines explicitly as shown below.

package net.codermag.example;

public class ConvertHTML {
 public static void main(String[] args) {

  String text = "<div><span><br>CoderMagnet:<br/>The <p>Developer</p> playground.</span></div>";

  // Replacing the <br> and <p> with newlines
  text = text.replaceAll("<br>", "\n").replaceAll("<br/>", "\n").replaceAll("</br>", "\n");
  text = text.replaceAll("<p>", "\n");

  text = text.replaceAll("\\<[^>]*>", "");

  System.out.println(text);
 }
}


Output:


CoderMagnet:
The 
Developer playground.


Please note that any malformed HTML might cause problems. So please watch out during your daily development scenarios.