Chapter 3

Building Blocks of HTML

by Eric Ladd


CONTENTS

Hypertext Markup Language (HTML) is often confused as a programming language. HTML is exactly what its name suggests: a markup language. It is a means of providing formatting instructions for presenting text-based content on the World Wide Web. These instructions are embedded right in the content, much like an editor's markup instructions are embedded in the text of a printed document. Because HTML is so critical to governing how things appear on the Web, HTML instructions are considered to be the building blocks of all Web pages.

But HTML itself also has building blocks. This chapter discusses HTML's origins, its strengths and weaknesses, and its basic components.

HTML and Its Relationship to SGML

HTML and SGML have what is best described as a "parent-child" relationship. SGML, the "parent" language, is a document description language that gives content providers a set of very general instructions they can customize to a particular type of document. By creating new custom rules for applying SGML, you can generate all sorts of different "child" languages.

HTML is one such "child" language. It applies SGML instructions according to a particular set of rules appropriate to presenting content on the Web. Thus, while HTML is well-suited for Web documents, it lacks SGML's flexibility because the rules permit only one way to apply it.

Who Is Making These Rules Anyway?
You may be wondering just who it is that determines how SGML becomes HTML. The answer is the World Wide Web Consortium (W3C)-a group of academic and industry partners that develop common standards for the World Wide Web. W3C is run by MIT's Laboratory for Computer Science in Cambridge, Massachusetts, and INRIA, a scientific institute in France dedicated to research in computer science and control theory.
Rules begin as proposals made to the W3C by member organizations or by members of the larger Internet community. For example, Microsoft might propose the incorporation of the <MARQUEE> tag, an HTML instruction that produces a scrolling text message on the Internet Explorer screen. W3C then considers the merits of the proposal and, if the Consortium accepts it, incorporates the <MARQUEE> tag into the prevailing HTML standard. An accepted proposal is also issued to the Web community for comments, but this is mainly for input on wording of the standard and other details, not for large changes in scope.
To learn more about the W3C, visit http://www.w3.org/. You can find W3C's position on future directions for HTML at http://www.w3.org/pub/WWW/MarkUp/Activity

HTML Is a DTD of SGML

Stated properly, HTML is a Document Type Definition (DTD) of SGML. A DTD refers to the set of rules that govern a specific application of SGML. The first few lines of the HTML 3.2 DTD are shown in Figure 3.1.

Figure 3.1 : The HTML DTD describes an appli-cation of SGML by specifying format and function instructions for each SGML element.

DTDs are written by SGML authors according to the specifications put forward in the set of rules in use. For example, the DTD in Figure 3.1 was written in accordance with the standards determined by the W3C.

Wow! That Looks Confusing...
Even seasoned HTML pros can recoil when they see an SGML DTD. SGML code is less rooted in plain language than HTML, making it slightly harder to understand. Consider the following excerpt from the HTML 3.2 DTD:
<!ELEMENT IMG    - O EMPTY --  Embedded image -->
<!ATTLIST IMG
     src     %URL     #REQUIRED  -- URL of image to embed --
     alt     CDATA    #IMPLIED   -- for display in place of image --
     align   %IAlign  #IMPLIED   -- vertical or horizontal alignment --
     height  %Pixels  #IMPLIED   -- suggested height in pixels --
     width   %Pixels  #IMPLIED   -- suggested width in pixels --
     border  %Pixels  #IMPLIED   -- suggested link border width --
     hspace  %Pixels  #IMPLIED   -- suggested horizontal gutter --
     vspace  %Pixels  #IMPLIED   -- suggested vertical gutter --
     usemap  %URL     #IMPLIED   -- use client-side image map --
     ismap   (ismap)  #IMPLIED   -- use server image map --
     >
The first line defines the HTML <IMG> tag (used to place an image on a page) as an empty or stand-alone tag. The rest of the code, starting with <!ATTLIST IMG defines what attributes are permissible with the <IMG> tag. For example, the "src" attribute takes the form of an URL ("%URL"), is required in every <IMG> tag ("#REQUIRED"), and supplies the URL of the image to place on the page. Other lines provide similar information about an attribute: what form it takes, whether it is required or not, and what its purpose is.
You can see then that the HTML DTD provides the details as to the syntax of proper HTML. Agreeing on a common syntax allows software companies to produce browsers that will be able to correctly parse and display any document that conforms to the syntax. Conversely, it allows Web authors to prepare documents with the confidence that anyone with a browser that complies with HTML syntax can view their work.
To learn more about SGML, consult Que's Special Edition Using SGML.

Advantages and Disadvantages

In an earlier version of his World Wide Web Research Notebook, Daniel Connolly put forward several advantages and disadvantages of deriving HTML constructs and standards from SGML.

Among the advantages of deriving HTML from SGML are

The drawbacks of using SGML to define HTML include:

Conformance to the Standard

The set of rules defining HTML that are in place at a given time are called the current HTML standard. As HTML has evolved from its founding at CERN (the European Center for High Energy Physics in Switzerland) in 1989, there have been a number of HTML standards:

HTML 3.0 was never adopted by W3C because an agreement could not be reached on every proposal. Instead, W3C jumped to what they called HTML 3.2-an expanded version of HTML 2.0 that included many, though not all, of the proposals put forward in HTML 3.0.

The Current Standard: HTML 3.2

The current HTML standard is HTML 3.2, announced by W3C in May 1996. Among HTML 3.2's new features were

Additionally, W3C announced that it was still working on proposals for including multimedia objects, client-side scripts, mathematical expressions, and style sheets (collections of font and block attributes that can be applied to a document or a portion of a document). In the cases of client-side scripting and style sheets, W3C reserved tags to support these items and suggested that details on their syntax would follow in later releases of the standard.

NOTE
You can get all of the details on HTML 3.2 (codename Wilbur) from W3C's Web site at http://www.w3.org/pub/WWW/MarkUp/Wilbur/. Also, be on the lookout for Cougar, the code name for W3C's next revision of the HTML standard.

Choosing a Level of Conformance

When you write an HTML document, you are free to choose the level of HTML-0, 1, 2, 3, or 3.2-you want to use. For example, even though HTML 3.2 is the current standard, you may opt to use HTML 2.0 to author your documents if you're concerned that most of your audience will not have a browser that is HTML 3.2-compatible.

Indeed, your audience should be uppermost in your mind when you decide which level to use. The best rule of thumb for helping you make your choice is: Choose the highest level you can (to get the maximum amount of markup flexibility) without making your document inaccessible to your audience.

NOTE
An important issue as the HTML standard continues to evolve is backwards compatibility. This refers to an older browser's ability to interpret tags in the new standard. For instance, the <IMG> tag in HTML 3.2 can take the USEMAP attribute to signal a client-side image map. A browser that can handle client-side image maps would recognize and process this attribute, but one that was not 3.2-compliant would simply ignore the attribute and still correctly process the <IMG> tag. This points to an important feature that most of today's popular browsers possess: if they don't understand all or part of an HTML instruction, they simply ignore it rather than sending an error message.
As W3C works to produce new standards, they will almost certainly keep backwards compatibility in mind so that HTML documents remain accessible to the broadest possible audience.

Testing Conformance

Once you've chosen an HTML level for your authoring, you can check your documents for proper syntax by using one of the many HTML validation services available over the Web. Some HTML validators can even be downloaded and run locally on your machine.

Figure 3.2 shows the WebTechs Validation Service Web page. Notice that the first option lets you choose which level of conformance you want to check. Your choices are

Figure 3.2 : The WebTechs Validation Service will check your HTML documents for any of seven different conformance levels.

Once you've chosen a level, you need only provide an URL (see Figure 3.3) or a chunk of code (see Figure 3.4) and the service takes care of the rest.

Figure 3.3 : You can feed the WebTechs Validation an URL...

Figure 3.4 : ...or a piece of HTML code for testing.

What you'll get back is a report (see Figure 3.5) of any errors in your document or, if there are no errors, an invitation to label your site as conforming to the level you tested against. WebTechs maintains several images that you can place in your documents to indicate the level of conformance you checked out at-kind of like the different colored stars you got on your assignments in grade school!

Figure 3.5 : An HTML validation service returns a report of any errors it finds. WebTechs also provides a list of all HTML tags it encounters.


TIP
Many validation services will also check things like spelling and the validity of your hyperlinks. Check out Chapter 21, "HTML Validation," for more details.

HTML Constructs

Now that you have some sense of where HTML comes from, you can begin to explore the language itself. There are two main kinds of constructs in HTML: elements (also called tags) and entities.

Tags

An HTML tag is a signal to a browser that it should do something other than just throw text up on the screen in the default font. Tags are instructions that are embedded directly into the informational text of your document. They are offset from the information text by less than (<) and greater than (>) signs. For example, in the line of text:

<I>Italics</I> are used to emphasize a word or phrase.

the <I> and the </I> are HTML tags. The "I" sandwiched between the less than and greater than signs signals the browser to turn on italic formatting. The "/I" between less than and greater than signs instructs the browser to turn italics off. Figure 3.6 shows an HTML source code listing in which you can see many different tags.

Figure 3.6 : HTML tags are placed directly into the same file as your informational text.

HTML tags come in two varieties: container tags and stand-alone tags.

Container Tags  A tag is said to be a container tag if it, along with a companion tag, flanks something (usually text). The <I> tag above is an example of a container tag. <I> and its companion tag </I> cause the text they contain to be rendered in italics. Similarly, the effects of other container tags are applied only to the text they contain.

NOTE
In a container tag pair, the first tag (like <I>) is often called the opening tag and the second tag (like </I>) is called the closing tag.

Most HTML tags are container tags in which the opening tag activates an effect and the closing tag turns the effect off.

Stand-Alone Tags  The second type of HTML tag is the empty or stand-alone tag. A stand-alone tag does not have a companion tag and does not contain anything (hence the name "empty"). An example of an empty tag that you've already encountered in this chapter is the <IMG> tag. <IMG> simply places an image on a Web page. It produces no effect that needs to be carried over any amount of text, so no </IMG> tag is required. You just put the <IMG> tag at a position in the document that corresponds to where you want the image to appear on-screen.

Tag Attributes

Every HTML tag has some keyword that indicates what the tag does. "I" for italics and IMG for image are such keywords. In the case of the <I> tag, the keyword is enough to tell a browser what it has to do: turn italics on.

The <IMG> tag is different. A browser that sees the keyword IMG will not have enough information to complete the task of placing an image on the page. At the very least, the browser needs to know where the image file resides so it can retrieve and display the image. Additionally, information on how big the image is, how much space to leave around it, whether or not it should have a border, and what to do if the image file can't be loaded, might also be helpful. This type of extra information is specified by means of tag attributes.

Attributes modify or expand on the effect of a tag by providing the browser with further instructions. They typically are set equal to some value, though some attributes stand on their own. For example, in the following expanded <IMG> tag:

<IMG SRC="images/header.gif" WIDTH=500 HEIGHT=120 HSPACE=5 VSPACE=3 
BORDER=2 ISMAP>

SRC, WIDTH, HEIGHT, HSPACE, VSPACE, BORDER, and ISMAP are all attributes of the <IMG> tag. Almost all of them are set equal to some quantity-SRC to the URL of the image file, WIDTH and HEIGHT to the number of pixels that represent the dimensions of the image, HSPACE and VSPACE to the number of pixels of empty space (also called "white space" though it is not necessarily white in color) to leave around the image, and BORDER to the number of pixels wide the image's border should be. The ISMAP attribute indicates that the image is to be part of an image map. Since the word ISMAP is sufficient to signal the browser of this, it is not necessary to set ISMAP equal to anything.

Many HTML tags, both container and stand-alone, have attributes that give document authors many more options in how they design pages. Indeed, many of the "extensions" that have been introduced into HTML come in the form of attributes to existing tags, rather than completely new tags.

HTML Entities

HTML entities are character sequences that reproduce special characters on a browser screen. Special characters come in two flavors:

An HTML entity always starts with an ampersand and ends with a semicolon. What's between them determines what gets rendered on the browser screen. For example, the entity

&gt;

produces a greater than sign on screen. The foreign language character entities are made up of the base character followed by the applicable diacritical mark. For example, the entity

&uuml;

produces a lowercase umlauted "u." To produce an uppercase umlauted "u," just change the first u in the entity to a U.

TIP
A full list of the HTML entities appears in the "An Overview of the HTML Elements" section.

One special HTML entity is the non-breaking space: &nbsp;. You can put a non-breaking space between two words that should not be separated by a line.

In addition to the reserved characters, foreign language characters, and non-breaking space, you can represent any character with an HTML entity. All you need to know is the character's decimal ASCII value. For example, if you needed a bullet point (¨) and you knew a bullet's ASCII value was 183, you could use the entity

&#183;

to place a bullet on your page.

TIP
Windows users can use the Character Map accessory program to quickly look up a character's decimal ASCII value.

Two Rules Browsers Follow When Processing HTML

While there will always be individual differences between browsers, there are two rules they follow consistently:

HTML Is Not Case-Sensitive  When writing HTML tags, you are free to use any combination of uppercase and lowercase letters inside the tag. This means that each of the following tags will be interpreted the same way:

<IMG SRC="button.gif" WIDTH=50 HEIGHT=50 BORDER=0>
<Img Src="button.gif" Width=50 Height=50 Border=0>
<img src="button.gif" width=50 height=50 border=0>
<iMG SRc="button.gif" WiDtH=50 hEiGhT=50 BoRdEr=0>

The only exception to this rule is any text contained inside quotes. Text in quotes is interpreted literally by a browser.

NOTE
Most HTML authors choose either an all uppercase or all lowercase approach to writing HTML tags. This helps tags stand out better while editing. The tags in this book are all in uppercase to enhance readability.

Extra Space Is Ignored  Browsers will recognize the first space after a character, but any spaces after that are ignored. Other space characters-tabs and carriage returns-are also ignored.

This rule can be frustrating for new HTML authors who diligently place carriage returns in their documents, only to have the browser treat them like they're not there. It can also be frustrating for those who want to indent the first word of a paragraph several spaces. You can put the customary five spaces before the first word, but the browser will only acknowledge the first one.

TIP
You can use non-breaking space to put in extra space where you need it. A browser ignores the last two spaces in a sequence of three space characters, but it does print three spaces if you use
&nbsp;&nbsp;&nbsp;.

An Overview of the HTML Elements

Tables 3.1 through 3.6 provide an overview of all standard HTML 3.2 elements-both tags and entities. Tables describing tags indicate whether the tag is a container or a stand-alone tag and what the tag's purpose is. Proper tag syntax, including the use of attributes, is discussed over the next several chapters. The entity tables list characters and their associated entities.

Table 3.1  HTML Tags Allowable in the Document Head

TagType Purpose
<BASE>Stand-alone Defines document baseline information
<HEAD>Container Denotes the start of the document head
<ISINDEX>Stand-alone Indicates that the document is a searchable index
<LINK>Stand-alone Establishes linking relationships with other documents
<META>Stand-alone Supplies document meta-information
<SCRIPT>Container Contains code for a client-side script
<STYLE>Container Supplies style sheet information
<TITLE>Container Gives the document a descriptive title

Table 3.2  HTML Tags Allowable in the Document Body

TagType Purpose
<A>Container Establishes an anchor
<ADDRESS>Container Denotes an address (postal or e-mail)
<APPLET>Container Embeds a Java applet in a document
<AREA>Stand-alone Defines clickable regions in a client-side image map
<B>Container Produces boldface text
<BIG>Container Renders text in a larger font size
<BLOCKQUOTE>Container Denotes a quoted passage
<BODY>Container Denotes the start of the document body
<BR>Stand-alone Inserts a line break
<CENTER>Container Centers contained items on the page
<CITE>Container Indicates the name or title of a cited work
<CODE>Container Denotes computer code
<DD>Container Denotes a term definition
<DIR>Container Initiates a directory listing
<DIV>Container Denotes the start of a document division (chapter, appendix, etc.)
<DL>Container Initiates a definition list
<DT>Container Denotes a term to be defined
<EM>Container Signifies text to be emphasized
<FONT>Container Modifies font characteristics (size and color)
<H1>Container Denotes a level 1 heading
<H2>Container Denotes a level 2 heading
<H3>Container Denotes a level 3 heading
<H4>Container Denotes a level 4 heading
<H5>Container Denotes a level 5 heading
<H6>Container Denotes a level 6 heading
<HR>Stand-alone Places a horizontal line (rule) on a page
<I>Container Produces italicized text
<IMG>Stand-alone Places an image on a page
<KBD>Container Denotes keyboard input
<LI>Stand-alone Denotes the start of a list item
<MAP>Container Contains definitions of clickable regions for a client-side image map
<MENU>Container Initiates a menu list
<OL>Container Initiates an ordered (numbered) list
<P>Container Denotes the start of a new paragraph
<PRE>Container Signifies text to be treated as preformatted
<SAMP>Container Denotes sample or literal text
<SMALL>Container Renders text in a smaller font
<STRIKE>Container Produces strikethrough text
<STRONG>Container Denotes text to be strongly emphasized
<SUB>Container Renders text as a subscript
<SUP>Container Renders text as a superscript
<TT>Container Renders text in a fixed-width font (typewriter text)
<UL>Container Initiates an unordered (bulleted) list
<VAR>Container Denotes a variable name

Table 3.3  HTML Tags Allowable in a Form

TagType Purpose
<FORM>Container Denotes the start of a form
<INPUT>Stand-alone Specifies a user input field
<OPTION>Stand-alone Defines a form menu option
<SELECT>Container Contains options in a form menu
<TEXTAREA>Container Establishes a window for multiline text input

Table 3.4  HTML Tags Allowable in a Table

TagType Purpose
<CAPTION>Container Denotes a table caption
<TABLE>Container Denotes the start of a table
<TD>Container Signifies the start of a new table data element
<TH>Container Signifies the start of a new table header
<TR>Container Signifies the start of a new table row

Table 3.5  Reserved Character Entities

CharacterEntity
Ampersand (&)&amp;
Greater than sign (>)&gt;
Less than sign (<)&lt;
Non-breaking space&nbsp;
Quotation marks (")&quot;
Copyright symbol (©)&copy;
Registered symbol (¨)&reg;

Table 3.6  Entities for Characters with Diacritical Marks

Character
Entity
 &Aelig;, &aelig;
 &Aacute;, &aacute;
 &Acirc;, &acirc;
 &Aring;, &aring;
 &Atilde;, &atilde;
 &Auml;, &auml;
 &ETH;, &eth;
 &Eacute;, &eacute;
 &Ecirc;, &ecirc;
 &Egrave;, &egrave;
 &Euml;, &euml;
 &Iacute;, &iacute;
 &Icirc;, &icirc;
 &Igrave;, &igrave;
 &Iuml;, &iuml;
 &Ntilde;, &ntilde;
 &Oacute;, &oacute;
 &Ocirc;, &ocirc;
 &Ograve;, &ograve;
 &Otilde;, &otilde;
 &Ouml;, &ouml;
 &szlig;
 &THORN;, &thorn;
 &Uacute;, &uacute;
 &Ucirc;, &ucirc;
 &Ugrave;, &ugrave;
 &Yacute;, &yacute;
 &yuml;