Chapter 21

HTML Validation

by Eric Ladd


CONTENTS

While you may sometimes hear people refer to "HTML programming," you should know that writing HTML code is very different from writing script or program code. Your HTML code is a set of instructions to a browser as to how it should display a document. Lines of code in a computer program tell the machine itself what to do. Many Web users will be quick to point out this difference. Indeed, any dyed-in-the-wool computer programmer would bristle at the suggestion that coding HTML is actually programming.

But there is one thing you can do with your HTML code that is exactly the same as what programmers do with theirs: You can check it for proper syntax. This process is referred to by several different names. You might see it called HTML checking, HTML verification, or HTML validation. Each of these names essentially refers to the same thing, although there are many different types of tests you can use when validating your HTML.

This chapter introduces you to some of the more common tools used to validate HTML and how to use these tools. Taking the time to validate your code is an important step in ensuring that your users all see the same thing, regardless of which browser they're using.

Why Validate Your HTML?

If you're wondering why you should make the effort to validate your HTML, the answer is this: to make your content available to the broadest audience possible. Each of the skyrocketing number of browsers that have come on the scene since the Web's inception in 1989 have a slightly different way of doing things-even something as simple as rendering an unordered list. This has caused the Web community to realize the need for standards-a set of agreed-upon guidelines for programming browsers, and other related matters.

By validating your HTML, you check it against what the standard defines as proper syntax. If there are errors in how you've implemented an HTML construct, that construct will not appear properly on some browsers and some of your content will be lost on these users. Complete adherence to the standard means that everyone using a standards-compliant browser will be able to view your content, thereby maximizing your audience.

Writing documents that comply with established standards also means that they will be viewable on future standards-compliant tools. So no matter how many new browsers hit the market, a user will be able to use any one of them to view your pages.

Figures 21.1, 21.2, and 21.3 show you how a standards-compliant page looks on three different browsers. Clearly, there are rendering differences in each, but you ultimately see the same content on any of the three.

Figure 21.1 : Internet Explorer 3.0 displays the Yahoo main page in a format that most of us are used to seeing.

Figure 21.2 : NCSA Mosaic renders the Yahoo main page like this. Note the differences in bullet characters and spacing between images.

Figure 21.3 : Lynx, a text-only browser, displays the Yahoo main page like this. While the graphical banners are gone, you can still see all of the links that are available to you.


NOTE
Sometimes even HTML authoring programs will flag non-standard HTML. Older versions of SoftQuad's HoTMetaL would not even load documents that used a non-standard HTML tag! The current version of HoTMetaL is more "forgiving," though, and makes allowance for non-standard tags through menu options or through direct insertion by the author.

Browser-Specific Extensions and Standards Complianc
Web folks who are concerned about standards-compliant HTML usually lament the abundance of browser-specific HTML tags in Web documents. Their main concern is that of reusability. A document specifically designed for the Netscape Navigator browser is not necessarily reusable with Microsoft Internet Explorer, NCSA Mosaic, or any other browser. The only solution to this problem is to code separate versions of the same page-one for each type of browser you expect to hit the page. This makes for much more work than is truly necessary.
Use browser-specific HTML extensions with care. Most of the extensions introduced by Netscape over the past year or two have been adopted as part of HTML 3.2, but Netscape and Microsoft continue to inundate us with more. If you do create a document with a browser-specific extension, you need to at least consider providing a standards-compliant version of the document as well.

Some Bad Code

The remainder of this chapter is devoted to introducing you to some of the HTML validation services available. So that you can compare them fairly, we will submit the same chunk of code to each to see how it fares. The test HTML document is as follows:

<HTML>
<HEAD>
<META NAME="KEYWORDS" CONTENT=not standards-compliant">
</HEAD>
<BODY BGCOLOR="#50G3AD">
<H1>Here's a document with 5 misteaks!</H2>
<P>Can you find them?</P>
</BODY>
</HTML>

The five errors in the document are

As you review the reports that come back from the validation services, check to see if they've caught all of the errors noted above.

Online Validation Services

There are two types of validation services at your disposal. The first is the online service, which typically allows you to submit the URL of the document you wish to validate. Some online services even let you copy and paste the code you want to validate right into a multiline text window. Once it has your code, the service will parse and check it for conformance with the standard. It will then present its findings to you on your browser screen.

The other type is the service that runs on your computer. You can download a number of validation tools and install them on the hard drives of the machine you use to author HTML documents. This way, when you finish a document, you can quickly run the program to validate it.

When choosing which type of service to use, you'll want to consider factors such as the following:

This section discusses four of the more popular online validation services. The subsequent section introduces you to tools you can run on your own machine.

WebTechs HTML Validation Service

WebTechs offers an HTML validation service at http://www.webtechs.com/html-val-svc/ (see Figure 21.4). The first order of business in doing your validation is to set the options you want. You can set the level of conformance to whichever HTML standard you want (2.0, 3.0, or 3.2), or you can set it to check browser-specific extensions from Netscape (Mozilla) and Microsoft. Checking the Strict box means that the parser will check to see that only recommended HTML instructions are used.

Figure 21.4 : WebTechs HTML validation service starts you off with several options you can use to customize how you use the service.

The other option you need to set is the format of the report from the parser. You can configure the report to include the code you submitted, the parser's output, and formatted output.

TIP
Chances are that you know what it is you're submitting, so you can leave the Show Input box unchecked.

Next, you need to submit the code to be validated. You have two ways to do this. The first is to submit the URL of an existing document on the Web (see Figure 21.5). Typing the URL into the text window you see in the figure and clicking the "Submit URLs for validation" button instructs the parsing routine to begin its work.

Figure 21.5 : You can submit the URLs of the documents you want validated and the WebTechs validator will grab them right off the Web for you.


TIP
Take advantage of the fact that you can submit more than one URL at a time for validation. This way, you can batch your validation rather than running the script once for each document.

Your other option is to enter the code you want validated into the text window under the Check Bits and Pieces Interactively section. Figure 21.6 shows our test code copied and pasted into the text window.

Figure 21.6 : Copying and pasting your code into WebTechs' service is much easier than retyping it.

With the code in the text window, you just click the Submit HTML for Validation button to begin the test. The errors found in the validation are shown in Figure 21.7. The first four errors are due to the missing quote in the <META> tag, though the parser doesn't put it to you quite that way. This is a case of an error that is relatively simple but is not easy for the validator to explain. What's important at this point is the fact that the report tells you which line the error occurred on. You can use this information to locate the mistake through a visual search if the message from the validator doesn't make it totally clear.

Figure 21.7 : WebTechs' service returns a report with all parsing errors noted, including the line on which they occurred.

The next line in the error report points out the missing <TITLE> element. Then, in the subsequent line, the validator catches the erroneous use of the </H2> tag to close the <H1> tag. Finally, the validator reports that it interpreted the <P> tag as implying the end of the <H1> heading.

WebTechs' service did not detect two of the errors: the invalid RGB hexadecimal triplet and the spelling error. Since the service is essentially an SGML parser, it not a surprise that it did not detect the spelling mistake. The fact that it didn't pick up the invalid color suggests that the syntax of the BGCOLOR attribute is not restricted to the hexadecimal characters 0-9 and A-F.

WebTechs printed out two other parts of the report as well. These are shown in Figures 21.8 and 21.9. The first is the SGML parser output. Reviewing this report gives you an idea as to how a browser breaks down a document for presentation. Figure 21.9 shows the formatted output (how the code will look on the browser screen) and a list of the HTML tags found.

Figure 21.8 : Those with knowledge of SGML will appreciate WebTechs' parser output report.

Figure 21.9 : You can even "preview" the page by selecting the WebTechs' formatted output option.

Weblint

Weblint is technically a Perl script that checks HTML for proper syntax and some style issues. However, many people have constructed Web interfaces to Weblint so that using the script is as easy as filling out an HTML form. For a list of current gateways, direct your browser to http://www.khoros.unm.edu/staff/neilb/weblint/gateways.html.

Figure 21.10 shows Ed Kubaitis' Weblint gateway. You enter the URL of the document you want checked and specify below whether you want it checked for proper use of Microsoft, Netscape, or Java extensions. Next, you specify which warning level you want. The Gateway Default is the most lenient; Weblint Pedantic is the strictest. Once you have your parameters configured, click the Check HTML button to start the validation.

Figure 21.10 : Ed Kubaitis at UIUC constructed this Web interface to the Weblint syntax checker.


NOTE
If your browser supports HTTP file upload, you can have Weblint check the HTML in a file on your hard drive using this gateway.

Figure 21.11 shows you the Weblint error report with the warning level set to Weblint Pedantic. Weblint did a decent job of finding the mistakes in our example file-including the erroneous color code. Weblint also did a good job noting the mistake in the <META> tag, indicating that there was an odd number of quotes. Additionally, it noted that there is no mailto: element set up in a <LINK> tag in the document head. This is not a mandatory element, but since it is good style to do so, Weblint brings it to your attention if it's missing. The checked code is replicated below the error reports with messages right after the tags that are used erroneously.

Figure 21.11 : Weblint's validation output is easily understood and even uses colors to distinguish between HTML tags and error messages.

A Kinder, Gentler HTML Validator

A Kinder, Gentler HTML Validator (http://ugweb.cs.ualberta.ca/~gerald/validate/) is another Web interface, but it is much more configurable than most (see Figure 21.12). When setting up your validation session, you can choose to

Figure 21.12 : The kinder, gentler part of the Kinder, Gentler HTML Validator comes from the many validation options you can choose.

The results of our kinder, gentler HTML validation are shown in Figure 21.13. Because we are essentially using Weblint in Pedantic Mode again, we should expect the output to be virtually the same as the results in Figure 21.11. You can see in Figure 21.13 that this is precisely the case, except for an additional message that notes that BGCOLOR is an extended attribute and not standard HTML. This happened because the Kinder, Gentler HTML Validator uses the HTML 2.0 DTD in the absence of a <!DOCTYPE> tag that says a different HTML standard is in use. Our test file didn't have a <!DOCTYPE> tag, so it was parsed according to HTML 2.0 rules and the BGCOLOR attribute was flagged.

Figure 21.13 : The Kinder, Gentler HTML Validatorpicked up all of the same errors as Weblint, plus some others since it used the HTML 2.0 DTD.

You can also see the document outline in Figure 21.13. The Kinder, Gentler HTML Validator builds an outline based on the use of headings in a document. If the headings are used properly-that is, increasing heading levels indicate increasing subordinate outline levels-then the outline should truly look like an outline.

Doctor HTML

Doctor HTML maintains his office at http://www2.imagiware.com/RxHTML/. When you visit, you can submit your documents for a large battery of tests (see Figure 21.14), including

Figure 21.14 : Welcome to Dr. HTML's office-please fill out this form....

Additionally, you can ask the doctor's report to show the command hierarchy (Show Commands) and to show what the page will look like (Show Page). You can run all of the tests by clicking the Do All Tests radio button. The report format can be Short or Long.

When we submit our test document to Dr. HTML, we get back the Summary Report shown in Figure 21.15. The diagnosis shows mixed results. Dr. HTML did pick up on three document structure errors. However, the doctor's spelling checker missed the word "misteaks" and reported that there were no spelling errors.

Figure 21.15 : Dr. HTML gives you a summary of the "patient's" condition first, followed by links to more detailed information.

The table of links at the bottom of the summary lets you read more information about the doctor's findings. Clicking the Document Structure link produces the page shown in Figure 21.16. Here, we see that the doctor flagged the missing <TITLE> and the mismatched heading style tags. There is no mention of the missing quotes in the <META> tag, though. This is somewhat discomforting because balanced container characters (quotes, parentheses, brackets, and so on) should be part of the syntax check for any piece of code in any computer language.

Figure 21.16 : When pressed for details, Dr. HTML shows that it found two of the three document structure errors.


TIP
Choosing View, Document Source in Netscape Navigator is a great way to check for missing double quotes. The code will flash, starting at the point where the double quote is missing, if your double quotes are out of balance.

On the whole, Dr. HTML did well with the structural errors, but it looks like its spelling checker is not up to par. Spelling checkers are only as good as their dictionaries, however, so perhaps some additions to the Dr. HTML dictionary will improve this service.

Validation Programs that Run on Your Machine

If you'd rather have an HTML validator of your own running on your machine, there are many options to choose from. This section takes a look at four different tools that check HTML documents without having to submit them to a Web page over the Internet.

HTML PowerTools for Windows

Talicom makes its HTML PowerTools for Windows available from its Web site at http://www.tali.com/. You can download a 30-day evaluation copy of the entire PowerTools suite which includes

Once you've downloaded the evaluation version into a temporary directory, unZIP the archived file and run the Setup program to install PowerTools on your machine. Setup places an HTML PowerTools option on your Windows 95 Start menu. Clicking this option calls up the HTML PowerTools Launch Pad shown in Figure 21.17.

Figure 21.17 : You have access toall programs in the PowerTools suite froma single console.

To perform an HTML validation on a document, click the PowerAnalyzer button. The PowerAnalyzer panel enables you to set up a new project (really just a collection of files) to be validated. You can also set many of the program's options by clicking the Options button (see Figure 21.18). Of particular note is the control you get over the output format. The PowerAnalyzer's report can be printed in HTML or plain text format and you can configure whichever program you want to display the report.

Figure 21.18 : You can customize your PowerAnalyzer's performance by setting desired options in the Options dialog box.

When you have everything set up the way you want, click the Analyze HTML button and the PowerAnalyzer does its stuff. The dialog box you see during the analysis keeps you apprised of the program's progress. When the analysis is done, you can click the Launch Report Viewer button to see the results. You can see in Figure 21.19 that the PowerAnalyzer did fairly well on our test file. It found the unbalanced quotes (though it did not phrase it this way in the report), the mismatched heading style tags, and the missing <TITLE> tag. Additionally, it indicated the <!DOCTYPE> tag should be included to declare what HTML standard the document is written for. The PowerAnalyzer failed to detect the spelling error, though this is forgivable since the program doesn't purport to be a spelling checker. It most likely missed the improper color code because the syntax for RGB color triplets is not specified in the HTML DTD.

Figure 21.19 : The PowerAnalyzer can summarize its results in HTML format and load a browser for you to examine the results.

All in all, Talicom's HTML PowerTools are a handy group of utilities for those doing a lot of HTML coding. The price for the entire set of utilities is $59.95, or you can buy each utility separately. The PowerAnalyzer alone costs $24.95.

CSE 3310 HTML Validator

The CSE 3310 HTML Validator was created by Albert Wiersch and you can find it on the Web at http://www.flash.net/~wiersch/htmlvalidator.html. A link on this page takes you right to the download page where you can pull down a ZIPped archive file containing the program. Once you download the ZIPped file to a temporary directory, you just unZIP it and run the Setup program to install the validator.

NOTE
The download version of the CSE 3310 HTML Validator is an evaluation copy that is good for 30 days. The cost to register your copy of the Validator is $15.00.

When you run the Validator, you see the small window shown in Figure 21.20. The Validator is much more than its name suggests. By using options under the various menus, you can

Figure 21.20 : In spite of its innocuous main window, the CSE 3310 HTML Validator is a very full-featured HTML authoring utility.

Of course, the option we're concerned with is the validation feature. You can validate an HTML document in one of two ways:

When we submit our test document for validation, we get back the results shown in Figure 21.21. The Validator seemed to get a little hung up on the missing quote, indicating that the quotes were imbalanced all the way through the document. When it got to the end of the document, it noted that there was still a missing quotes character and that the <HTML> and <HEAD> tags had not been closed. Because it was focused on the imbalanced quotes, the Validator did not pick up on the other errors in the document.

Figure 21.21 : The CSE 3310HTML Validator got bogged down over themissing quotes in our tag.

HTML Validation Utilities for UNIX

The programs noted previously both run on 32-bit Windows machines. However, you may want to validate your HTML on your Web server and, more likely than not, the server is running on a UNIX machine. Fortunately, there are HTML validation programs for UNIX as well. Two of note are

Each of these can be freely downloaded from their Web sites. You may be able to use them in a Windows environment as long as you have a Windows Perl interpreter available, although this has not been tested.

NOTE
New validation tools are cropping up all the time. Check out http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/
Validation_Checkers/
for the latest information.

HTML Editing Tools with Built-in Validators

Some of the editing tools you can use to author HTML files can also perform some kinds of validation on them. This chapter closes with a look at two such tools: SoftQuad's HoTMetaL and Microsoft FrontPage.

HoTMetaL

Figure 21.22 shows our test file loaded into HoTMetaL. To begin the HTML validation, place your cursor at the start of the file and choose Special, Validate Document.

Figure 21.22 : HoTMetaL's built-in SGML validator can be used to check your HTML syntax.

Figure 21.23 shows the first dialog box returned by HoTMetaL. It indicates that the parser was expecting an equal sign (=) at a certain point (line 4, character 69) but did not find one. The expectation of an equal sign is actually due to the lack of a quotation mark after CONTENT=in the <META> tag. If we add the quote and redo the validation, we get the screen shown in Figure 21.24.

Figure 21.23 : HoTMetaL flags errors as it finds them rather than doing a summary report after checking the whole document.

Figure 21.24 : Next HoTMetaL reminds us that our document is in need of a <TITLE>.

The dialog box in the figure tells you that there's an element missing from the document head. We know this to be the document's title. If we add a title and continue the validation, our next dialog box is the one shown in Figure 21.25.

Figure 21.25 : Once a title is in place, HoTMetaL moves on to find the mismatched heading style tags.

HoTMetaL has detected the mismatched opening and closing tags for the heading used at the top of the document. When we correct this error and redo the validation, we find that the document checks out. HoTMetaL says that there are no more errors.

We, of course, know better. There are two mistakes that HoTMetaL did not flag. The first is the use of G as a digit in a hexadecimal triplet. Since the exact syntax of a hexadecimal triplet isn't coded into the HTML 3.2 DTD, it is not too great a surprise that HoTMetaL missed this error.

The second is the missed spelling error in the heading. This is also not very surprising because HoTMetaL has a separate spelling checker to check documents for spelling mistakes.

NOTE
HoTMetaL's spellcheck feature is available only in SoftQuad's HoTMetaL Pro release of the software.

Microsoft FrontPage

Microsoft's FrontPage Explorer has a facility that checks the validity of the links in your HTML document, though it does not check the code itself for validity. With a Web loaded into the Explorer, you check the links by choosing Tools, Verify Links. When the Explorer is done, it displays a dialog box such as the one you see in Figure 21.26. The box displays all of the links it has checked with a colored circle next to it. A green circle means the link was verified properly. A red circle means the link is broken.

Figure 21.26 : Whereas it doesn't check the syntax of your HTML, the FrontPage Explorer excels at identifying broken links.

The buttons on the right of the dialog box are helpful as you work to correct the broken links. If you fix a link and want to check to make sure it's okay, click the Verify button. If you want to edit the link that has the problem, you can click the Edit Link button. Similarly, clicking the Edit Page opens the page with the broken link in the FrontPage Editor. If you want to delegate the repair of the broken link to someone else, click the Add Task button and make the assignment.