Network Developer:
Tidying up your HTML code for XML
Seamus Phan , 1-Aug-2002
Is there an easy way to port static content so that it conforms to XHTML without breaking the bank, or our backs? The answer lies in an open source project known as Tidy (tidy.sourceforge.net).

What is Tidy?

Tidy was created by Dave Raggett, who has since passed the Tidy project to Sourceforge as an open source collaborative project. At Sourceforge, there are ports of Tidy for many different OSes, including Windows, UNIX, Linux, FreeBSD, Mac OS X/Darwin, and even Java. There is also a Web-based Tidy HTML validator, last found at http://www.thedumbterminal.co.uk/services/tidy.shtml.

To programmers, Tidy is simply a HTML syntax checker and Òpretty printerÓ, which means that Tidy is able to present code in such a way that it is easily understood. Tidy can also clean up HTML code that is generated by Microsoft Word, and for programmers who prefer to hand-code, it can clean up manual errors as well.

From HTML to XHTML

XHTML 1.0 is the transitional standard that bridges HTML ÒloosenessÓ with XMLÕs stricter code presentation. XHTML conforms to XMLÕs data presentation, and makes uses of Cascading Style Sheets (CSS) to denote how text is displayed. In HTML, text can have individual display code added to it, such as . In CSS, text can still be described using the description, but rather than denote individual characteristics such as typeface, style, size and so on, the description merely points to the CSS.

Also, for basic HTML description, CSS can have a sweeping type description so that Web designers do not need to bother with individual text description. The advantage of CSS is obvious. If you have an external CSS file (.css), you can easily make changes to the CSS description and your Web site will instantly reflect the text presentation changes. Most commercial Web authoring tools are less efficient in this regard, and still attach individual font descriptions to individual blocks of text.

Tidy seeks to clean up HTML code by converting not only HTML code to conform to XHTMLÕs stricter descriptions, but also adds CSS to the code. You can modify the code so that it calls to an external CSS file, by cutting and pasting the CSS portion to an external .css text file. Sample HTML 4.0 page:

Òhttp://www.w3.org/TR/1998/REC-html40- 19980424/loose.dtdÓ>

content=Ótext/html; charset=iso-8859-1Ó>

Sample Text here.

More text here.

Sample XHTML 1.0 page: Òhttp://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtdÓ>

content=Ótext/html; charset=iso-8859-1Ó />

Sample Text here.

More text here.

The most obvious difference in XHTML code, when compared to traditional HTML 4.x code, is that tags and tags have a trailing Ò/Ó before the last angular bracket (>). This is to conform the code to XML requirements.

Tidying up errors

Of course, Tidy wouldnÕt have found favour with programmers and Web gurus if it only converts HTML to XHTML. Tidy also cleans up human errors in hand-coding.

While most Web browsers such as Microsoft Internet Explorer and Netscape Communicator are fairly forgiving and can still display them, XHTML compliance means that no single error is acceptable. We can rely on Tidy to clean up errors, and while it is not perfect, it can clear up more than 99% of HTML errors.

For example, Tidy can clean up missing or mismatched end tags. If you forget to type brackets after the text, or if you mistyped, it can correct that.

If you place end tags in the wrong order, that is, creating an overlapping close rather than an enveloped close, Tidy can guess that too.

This description is bold here, bold italic here, and should be bold here.

becomes

This description is corrected to bold, bold italic, and rightfully bold.

Also, Tidy understands how strict XML is, and can repair formatting for tags which are enveloped in the wrong hierarchy. For example, heading tags should encapsulate italic or bold description tags, and not the other way around. Therefore,

illegal heading

is changed by Tidy to

correct heading

For older Web browsers and Web page authoring software, some end tags are not inserted nor do they seem to be important. For example,

* tags for list items are often not closed with, but they usually display just fine on modern Web browsers which are usually forgiving.
* First list item
* Second list item

Tidy will clear them up to conform to XML strict guidelines, yielding:

* First list item
* Second list item

Some authoring tools also do not insert quotes around attribute values, such as instead of the more correct form of . Tidy will insert all such quotes around attribute values, and will report it in a text log so that Web developers will know what have been changed.

One interesting feature of Tidy is ÒBurst into slidesÓ, where tags are treated as breaks. This feature allows Web authors to write simple Web-based presentations. If you are not fussy about complex presentations, this automated feature within Tidy can help you create lightweight, completely compatible slide shows that can be embedded on CD-ROMs or used as e-learning content.

As a Òpretty printerÓ, Tidy can also help to indent code, so that it is more optically readable on a computer screen or when printed. If you use monospaced fonts for editing, display and printing, you can even designate Tidy to wrap text at a specified soft line break.

And for those who prefer to rework or tweak code themselves, Tidy can be configured to report errors rather than automatically change and conform errors. After all, Tidy is not 100% perfect, and small margins of errors do occur.

Is Tidy right for you?

Tidy is nifty and efficient, and does its job wonderfully in a simple and uncluttered way. However, as Dave Raggett mentioned in his original plans before turning Tidy over to the open source community at Sourceforge, he intended to include support for Big5 and ShiftJIS, which would have provided native support to understand Chinese and Japanese HTML input.

Raggett also intended to build link-checking in Tidy, but there is no such support available today yet. With link-checking and an active Internet connection, Tidy could eventually provide an easy way to test links within the code as well, making the tool much more powerful in helping Web authors produce clean, conforming, and functional code.

With the need to migrate to the XML environment sometime in the near future, Tidy is absolutely essential.

For reproduction and reprint of articles authorized by Seamus Phan directly, kindly note that this copyright notice MUST be included at the end:

Seamus Phan is a leading author, keynote speaker, trainer and technologist in the areas of total quality, service quality, Internet, biotech, holistic health, and business processes. Based in Singapore, Seamus consults for international companies, government agencies and emerging enterprises around the world. He is also a professor of media studies and sustainable development.

Seamus Phan
Email Seamus now!
http://SeamusPhan.com
Copyright (c) 1990-2004 Seamus Phan. All rights reserved.

| Close window |