Skip to content

Instantly share code, notes, and snippets.

@1RedOne
Last active August 16, 2018 14:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 1RedOne/ef797aad307878914b8b4c99988fb7b5 to your computer and use it in GitHub Desktop.
Save 1RedOne/ef797aad307878914b8b4c99988fb7b5 to your computer and use it in GitHub Desktop.
Parsing Malformed HTML via SQL

First off the source HTML has a lot of issues, as we can see here:

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE>
<META name=GENERATOR content="MSHTML 11.00.10586.35"></HEAD>

<BODY leftMargin=1 topMargin=1 rightMargin=1>

<DIV><FONT size=2 face="Segoe UI">

<DIV>I asked for Shigeru Miyamoto and the call was heavily screened including wanting to know if he was expecting my call. I told he wasn''t,... </DIV></FONT></DIV></BODY></HTML>'

So we first begin by using the Replace operator to clean this text up

DECLARE @removeThis  nvarchar(max) = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">'
DECLARE @removeThis2 nvarchar(max) = '<STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE>'
DECLARE @removeThis3 nvarchar(max) = '<META name=GENERATOR content="MSHTML 11.00.10586.35">'
-- calling replace a bunch


SET @notes = replace(@notes,@removeThis,'')
SET @notes = replace(@notes,@removeThis2,'')
SET @notes = replace(@notes,@removeThis3,'')
SET @notes = replace(@notes, '<BODY leftMargin=1 topMargin=1 rightMargin=1>' ,'<BODY>')
SET @notes = replace(@notes, '<FONT size=2 face="Segoe UI">' ,'<FONT>')

This leaves us with easier to parse HTML:

<HTML>
  <HEAD></HEAD>
  <BODY>
    <DIV>
      <FONT>
        <DIV>I asked for Shigeru Miyamoto and the call was heavily screened including wanting to know if he was expecting my call. I told he wasn't... </DIV>
      </FONT>
    </DIV>
  </BODY>
</HTML>

At this point, we can cast the HTML to XML, and then use SQL's built-in xPath support to search for tags within an XML document.

SET @XML = @notes
    
    --use an xPath Query to retrieve the deepest nested (DeepNest, spooky!) DIV element
    set @return = (SELECT top 1 Tab.col.value('.', 'NVARCHAR(MAX)')
		FROM @xml.nodes('//DIV') as Tab(Col))

giving us the following output

I asked for Shigeru Miyamoto and the call was heavily screened including wanting to know if he was expecting my call. I told he wasn't, she put me on hold for a bit then told me he was tied up right now and asked me to take you information. Jennifer took my information and said she would give it to him. Jennifer said she works reception there. 

Now, we simply place it within a Function, and run this once to save it in SQL (see the other .sql file).

Then call it like this:

select dbo.RetrieveNotesFromHTML(@notes)

You can embed this withina subselect to retrieve the .notes values from all records in GoldMine.

CREATE FUNCTION dbo.[RetrieveNotesFromHTML]
(
@HTML_STRING VARCHAR(MAX) -- Variable for string
)
RETURNS VARCHAR(MAX)
BEGIN
DECLARE @XML XML;
DECLARE @notes nvarchar(max) = @HTML_STRING;
DECLARE @return nvarchar(max);
DECLARE @removeThis nvarchar(max) = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">'
DECLARE @removeThis2 nvarchar(max) = '<STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE>'
DECLARE @removeThis3 nvarchar(max) = '<META name=GENERATOR content="MSHTML 11.00.10586.35">'
SET @notes = replace(@notes,@removeThis,'')
SET @notes = replace(@notes,@removeThis2,'')
SET @notes = replace(@notes,@removeThis3,'')
SET @notes = replace(@notes, '<BODY leftMargin=1 topMargin=1 rightMargin=1>' ,'<BODY>')
SET @notes = replace(@notes, '<FONT size=2 face="Segoe UI">' ,'<FONT>')
--with this done, we can now cast the HTML to XML and retrieve values
SET @XML = @notes
--use an xPath Query to retrieve the deepest nested (DeepNest, spooky!) DIV element
set @return = (SELECT top 1 Tab.col.value('.', 'NVARCHAR(MAX)')
FROM @xml.nodes('//DIV') as Tab(Col))
return @return
END
GO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment