Monday, 8 February 2016

Remove HTML from String using Regex and String.replaceAll()


Using simple Regex and String.replaceAll() function to easily remove HTML from String:

We can use a simple JAVA regex to remove all HTML tags from a string.
The demerit of this is that it is difficult to selectively remove HTML tags. For example if we want to remove all HTML tags except the <span> and <br> tag it will increase the complexity of the regex expression.


package net.codermag.example;

public class ConvertHTML {
 public static void main(String[] args) {

  String text = "<div><span><b style='color:blue;'>CoderMagnet:</b>The Developer playground.</span></div>";
  System.out.println(text.replaceAll("\\<[^>]*>", ""));
 }
}


Output:

CoderMagnet:The Developer playground.

Pitfall:
An issue with conventional techniques of HTML tag removal is that it does not replace <br> and <p> tags with new lines. This gives rise to a text which is not properly formatted and sometimes difficult to read. For this we need to handle newlines explicitly as shown below.

package net.codermag.example;

public class ConvertHTML {
 public static void main(String[] args) {

  String text = "<div><span><br>CoderMagnet:<br/>The <p>Developer</p> playground.</span></div>";

  // Replacing the <br> and <p> with newlines
  text = text.replaceAll("<br>", "\n").replaceAll("<br/>", "\n").replaceAll("</br>", "\n");
  text = text.replaceAll("<p>", "\n");

  text = text.replaceAll("\\<[^>]*>", "");

  System.out.println(text);
 }
}


Output:


CoderMagnet:
The 
Developer playground.


Please note that any malformed HTML might cause problems. So please watch out during your daily development scenarios.

No comments:

Post a Comment

Coder Magnet
CoderMagnet is full of resources from our daily development activities. It has solutions for common problematic scenarios in technologies like Java 8, AEM, JCR and also occasionally gives you tips on Blogger as well.