The credit for this ruby procedure must be given to Dwight Shih & Brad Choate.
The usefulness of this type of an action cannot be overemphasized. You can essentially use it to define the specific tags that you want to insure are not malformed and abused maliciously. Feel free to test it and share any suggestions.
def sanitize_html( html, okTags='a href, b, br, i, p, ul, li, span, ol, div, font, strong, style' ) # no closing tag necessary for these soloTags = ["br"]
# Build hash of allowed tags with allowed attributes tags = okTags.downcase().split(',').collect!{ |s| s.split(' ') } allowed = Hash.new tags.each do |s| key = s.shift allowed[key] = s end # Analyze all <> elements stack = Array.new result = html.gsub( /(<.*?>)/m ) do | element | if element =~ /\A<\/(\w+)/ then # </tag> tag = $1.downcase if allowed.include?(tag) && stack.include?(tag) then # If allowed and on the stack # Then pop down the stack top = stack.pop out = "</#{top}>" until top == tag do top = stack.pop out << "</#{top}>" end out end elsif element =~ /\A<(\w+)\s*\/>/ # <tag /> tag = $1.downcase if allowed.include?(tag) then "<#{tag} />" end elsif element =~ /\A<(\w+)/ then # <tag ...> tag = $1.downcase if allowed.include?(tag) then if ! soloTags.include?(tag) then stack.push(tag) end if allowed[tag].length == 0 then # no allowed attributes "<#{tag}>" else # allowed attributes? out = "<#{tag}" while ( $' =~ /(\w+)=("[^"]+")/ ) attr = $1.downcase valu = $2 if allowed[tag].include?(attr) then out << " #{attr}=#{valu}" end end out << ">" end end end end # eat up unmatched leading > while result.sub!(/\A([^<]*)>/m) { $1 } do end # eat up unmatched trailing < while result.sub!(/<([^>]*)\Z/m) { $1 } do end # clean up the stack if stack.length > 0 then result << "</#{stack.reverse.join('></')}>" end result end