search
Twitter Rss Feeds
MicrosoftArticlesForumsGroups
C# .NET
VB.NET
Visual Studio .NET
ADO.NET
Xml/Xslt
VB 6.0
.NET CF
GDI+
LINQ
Deployment
Security
FoxPro
Silverlight / WPF
Entity Framework
RIA Services

Web ProgrammingArticlesForumsGroups
JavaScript
ASP
ASP.NET
Web Services

Non-MicrosoftArticlesForumsGroups
NHibernate
Perl
PHP
Ruby
Java
Linux / Unix
Apple
Open Source

DatabasesArticlesForumsGroups
SQL Server
Access
Oracle
MySQL
Other Databases

OfficeArticlesForumsGroups
Microsoft Excel
Microsoft Word
Microsoft Powerpoint
Publisher
Money

Operating SystemsArticlesForumsGroups
Windows 7
Windows Server
Windows Vista
Windows XP
Windows Update
MAC
Linux / UNIX

Server PlatformsArticlesForumsGroups
Share Point
BizTalk
Site Server
Exhange Server
IIS
Transaction Server

Graphic DesignArticlesForumsGroups
Macromedia Flash
Adobe PhotoShop
Microsoft Expression

OtherArticlesForumsGroups
Subversion / CVS
Ask Dr. Dotnetsky
Active Directory
Networking
Uninstall Virus
Job Openings
Reviews
Search Engines
Resumes

 
HTML to XHTML Conversion with SGMLReader
By Peter A. Bromberg, Ph.D.
Printer - Friendly Version
Peter Bromberg

This is a web - based implementation of converting HTML to well-formed XHTML using Chris Lovett of Microsoft's excellent SGMLReader. Chris's code has a command - line interface; however I needed an in-memory implementation for some work we're experimenting on that takes well-formed XHTML and converts it to RTF for display in a RichTextBox control. There are many other uses for XHTML compliant HTML, not the least of which is the fact that an XHTML page is a legitimate, well-formed XML document, which opens up a whole new range of possibilities for HTML processing when you think about it...

In order to make this work as a class library for use on the web or in-memory in an application, I needed to write a small "helper class", and I also needed to change the way errors are written in Lovett's SgmlReader class to a string property (the existing code was designed to write errors to an optional log file with a TextWriter, I needed to be able to return the concatenated error string to the web page for display instead). Below appears my helper class code:

using System;
            using Sgml;
            using System.IO;
            using System.Xml;
            using System.Text;
            using System.Web;

namespace SgmlReaderDll { /// <summary> /// Helper class to allow string processing using SGMLReader/Parser /// </summary> public class SGMLReaderHelper { private string _errors; public string Errors { get { return _errors; } set { _errors = value; } }

public SGMLReaderHelper() { }
public string ProcessString(string strInputHtml) { string strOutputXhtml = String.Empty; SgmlReader reader = new SgmlReader(); reader.DocType ="HTML"; StringReader sr = new System.IO.StringReader(strInputHtml); reader.InputStream = sr; StringWriter sw = new StringWriter(); XmlTextWriter w =new XmlTextWriter( sw); reader.Read(); while(!reader.EOF) { w.WriteNode(reader,true); } w.Flush(); w.Close(); this.Errors=reader.ErrorLog; return sw.ToString(); } } }

There are a lot of interesting uses for this type of utility. One which I use again and again is the ability to take an HTML web page that is not XHTML compliant, run it through this utility, and get back a valid XML document that fixes attributes with no quotes around them, self-closes HTML tags that need to be closed, and automatically surrounds script blocks in CDATA sections. The result can be saved with an XSL extension, and you are on your way to creating your XSL Stylesheet for your XML Transformation to create dynamic web pages!

And now for the fun part. Click the link below, which will bring you to the ASP.NET web page that allows you to paste your HTML and receive back XHTML, along with a report from Chris's creation that reports any errors:

Try the HTML to XHTML web page

As always, the full solution may be downloaded from the link below. Thanks to Chris Lovett for some really useful code.

Download the code that accompanies this article


 


Peter Bromberg is a C# MVP, MCP, and .NET consultant who has worked in the banking and financial industry for 20 years. He has architected and developed web - based corporate distributed application solutions since 1995, and focuses exclusively on the .NET Platform. Pete's samples at GotDotNet.com have been downloaded over 41,000 times. You can read Peter's UnBlog Here.  --><--NOTE: Post QUESTIONS on FORUMS!

Do you have a question or comment about this article? Have a programming problem you need to solve? Post it at eggheadcafe.com forums and receive immediate email notification of responses.



Pete's Blog   |    Pete's Resume   |    Robbe's Blog   |    Robbe's Resume   |    Archive #2   |    Archive #3   |    Dotnetslackers   |    XmlPitStop   |    Advertise   |   Contact Us   |   Privacy   |   Copyright (c) 2000 - 2009 eggheadcafe.com  All rights reserved.