by Eric Ladd
Hypertext Markup Language (HTML) is often confused as a programming language. HTML is exactly what its name suggests: a markup language. It is a means of providing formatting instructions for presenting text-based content on the World Wide Web. These instructions are embedded right in the content, much like an editor's markup instructions are embedded in the text of a printed document. Because HTML is so critical to governing how things appear on the Web, HTML instructions are considered to be the building blocks of all Web pages.
But HTML itself also has building blocks. This chapter discusses HTML's origins, its strengths and weaknesses, and its basic components.
HTML and SGML have what is best described as a "parent-child" relationship. SGML, the "parent" language, is a document description language that gives content providers a set of very general instructions they can customize to a particular type of document. By creating new custom rules for applying SGML, you can generate all sorts of different "child" languages.
HTML is one such "child" language. It applies SGML instructions
according to a particular set of rules appropriate to presenting
content on the Web. Thus, while HTML is well-suited for Web documents,
it lacks SGML's flexibility because the rules permit only one
way to apply it.
| Who Is Making These Rules Anyway? |
You may be wondering just who it is that determines how SGML becomes HTML. The answer is the World Wide Web Consortium (W3C)-a group of academic and industry partners that develop common standards for the World Wide Web. W3C is run by MIT's Laboratory for Computer Science in Cambridge, Massachusetts, and INRIA, a scientific institute in France dedicated to research in computer science and control theory. Rules begin as proposals made to the W3C by member organizations or by members of the larger Internet community. For example, Microsoft might propose the incorporation of the <MARQUEE> tag, an HTML instruction that produces a scrolling text message on the Internet Explorer screen. W3C then considers the merits of the proposal and, if the Consortium accepts it, incorporates the <MARQUEE> tag into the prevailing HTML standard. An accepted proposal is also issued to the Web community for comments, but this is mainly for input on wording of the standard and other details, not for large changes in scope. To learn more about the W3C, visit http://www.w3.org/. You can find W3C's position on future directions for HTML at http://www.w3.org/pub/WWW/MarkUp/Activity |
Stated properly, HTML is a Document Type Definition (DTD) of SGML. A DTD refers to the set of rules that govern a specific application of SGML. The first few lines of the HTML 3.2 DTD are shown in Figure 3.1.
DTDs are written by SGML authors according to the specifications
put forward in the set of rules in use. For example, the DTD in
Figure 3.1 was written in accordance with the standards determined
by the W3C.
| Wow! That Looks Confusing... |
Even seasoned HTML pros can recoil when they see an SGML DTD. SGML code is less rooted in plain language than HTML, making it slightly harder to understand. Consider the following excerpt from the HTML 3.2 DTD:
The first line defines the HTML <IMG> tag (used to place an image on a page) as an empty or stand-alone tag. The rest of the code, starting with <!ATTLIST IMG defines what attributes are permissible with the <IMG> tag. For example, the "src" attribute takes the form of an URL ("%URL"), is required in every <IMG> tag ("#REQUIRED"), and supplies the URL of the image to place on the page. Other lines provide similar information about an attribute: what form it takes, whether it is required or not, and what its purpose is. You can see then that the HTML DTD provides the details as to the syntax of proper HTML. Agreeing on a common syntax allows software companies to produce browsers that will be able to correctly parse and display any document that conforms to the syntax. Conversely, it allows Web authors to prepare documents with the confidence that anyone with a browser that complies with HTML syntax can view their work. To learn more about SGML, consult Que's Special Edition Using SGML. |
In an earlier version of his World Wide Web Research Notebook, Daniel Connolly put forward several advantages and disadvantages of deriving HTML constructs and standards from SGML.
Among the advantages of deriving HTML from SGML are
The drawbacks of using SGML to define HTML include:
The set of rules defining HTML that are in place at a given time are called the current HTML standard. As HTML has evolved from its founding at CERN (the European Center for High Energy Physics in Switzerland) in 1989, there have been a number of HTML standards:
HTML 3.0 was never adopted by W3C because an agreement could not be reached on every proposal. Instead, W3C jumped to what they called HTML 3.2-an expanded version of HTML 2.0 that included many, though not all, of the proposals put forward in HTML 3.0.
The current HTML standard is HTML 3.2, announced by W3C in May 1996. Among HTML 3.2's new features were
Additionally, W3C announced that it was still working on proposals
for including multimedia objects, client-side scripts, mathematical
expressions, and style sheets (collections of font and block attributes
that can be applied to a document or a portion of a document).
In the cases of client-side scripting and style sheets, W3C reserved
tags to support these items and suggested that details on their
syntax would follow in later releases of the standard.
| NOTE |
You can get all of the details on HTML 3.2 (codename Wilbur) from W3C's Web site at http://www.w3.org/pub/WWW/MarkUp/Wilbur/. Also, be on the lookout for Cougar, the code name for W3C's next revision of the HTML standard. |
When you write an HTML document, you are free to choose the level of HTML-0, 1, 2, 3, or 3.2-you want to use. For example, even though HTML 3.2 is the current standard, you may opt to use HTML 2.0 to author your documents if you're concerned that most of your audience will not have a browser that is HTML 3.2-compatible.
Indeed, your audience should be uppermost in your mind when you
decide which level to use. The best rule of thumb for helping
you make your choice is: Choose the highest level you can (to
get the maximum amount of markup flexibility) without making your
document inaccessible to your audience.
| NOTE |
An important issue as the HTML standard continues to evolve is backwards compatibility. This refers to an older browser's ability to interpret tags in the new standard. For instance, the <IMG> tag in HTML 3.2 can take the USEMAP attribute to signal a client-side image map. A browser that can handle client-side image maps would recognize and process this attribute, but one that was not 3.2-compliant would simply ignore the attribute and still correctly process the <IMG> tag. This points to an important feature that most of today's popular browsers possess: if they don't understand all or part of an HTML instruction, they simply ignore it rather than sending an error message. As W3C works to produce new standards, they will almost certainly keep backwards compatibility in mind so that HTML documents remain accessible to the broadest possible audience. |
Once you've chosen an HTML level for your authoring, you can check your documents for proper syntax by using one of the many HTML validation services available over the Web. Some HTML validators can even be downloaded and run locally on your machine.
Figure 3.2 shows the WebTechs Validation Service Web page. Notice that the first option lets you choose which level of conformance you want to check. Your choices are
Once you've chosen a level, you need only provide an URL (see Figure 3.3) or a chunk of code (see Figure 3.4) and the service takes care of the rest.
Figure 3.3 : You can feed the WebTechs Validation an URL...
Figure 3.4 : ...or a piece of HTML code for testing.
What you'll get back is a report (see Figure 3.5) of any errors in your document or, if there are no errors, an invitation to label your site as conforming to the level you tested against. WebTechs maintains several images that you can place in your documents to indicate the level of conformance you checked out at-kind of like the different colored stars you got on your assignments in grade school!
| TIP |
Many validation services will also check things like spelling and the validity of your hyperlinks. Check out Chapter 21, "HTML Validation," for more details. |
Now that you have some sense of where HTML comes from, you can begin to explore the language itself. There are two main kinds of constructs in HTML: elements (also called tags) and entities.
An HTML tag is a signal to a browser that it should do something other than just throw text up on the screen in the default font. Tags are instructions that are embedded directly into the informational text of your document. They are offset from the information text by less than (<) and greater than (>) signs. For example, in the line of text:
<I>Italics</I> are used to emphasize a word or phrase.
the <I> and the </I> are HTML tags. The "I" sandwiched between the less than and greater than signs signals the browser to turn on italic formatting. The "/I" between less than and greater than signs instructs the browser to turn italics off. Figure 3.6 shows an HTML source code listing in which you can see many different tags.
Figure 3.6 : HTML tags are placed directly into the same file as your informational text.
HTML tags come in two varieties: container tags and stand-alone tags.
Container Tags A tag is said to be a container
tag if it, along with a companion tag, flanks something (usually
text). The <I> tag above is an example of a container
tag. <I> and its companion tag </I>
cause the text they contain to be rendered in italics. Similarly,
the effects of other container tags are applied only to the text
they contain.
| NOTE |
In a container tag pair, the first tag (like <I>) is often called the opening tag and the second tag (like </I>) is called the closing tag. |
Most HTML tags are container tags in which the opening tag activates an effect and the closing tag turns the effect off.
Stand-Alone Tags The second type of HTML tag is the empty or stand-alone tag. A stand-alone tag does not have a companion tag and does not contain anything (hence the name "empty"). An example of an empty tag that you've already encountered in this chapter is the <IMG> tag. <IMG> simply places an image on a Web page. It produces no effect that needs to be carried over any amount of text, so no </IMG> tag is required. You just put the <IMG> tag at a position in the document that corresponds to where you want the image to appear on-screen.
Every HTML tag has some keyword that indicates what the tag does. "I" for italics and IMG for image are such keywords. In the case of the <I> tag, the keyword is enough to tell a browser what it has to do: turn italics on.
The <IMG> tag is different. A browser that sees the keyword IMG will not have enough information to complete the task of placing an image on the page. At the very least, the browser needs to know where the image file resides so it can retrieve and display the image. Additionally, information on how big the image is, how much space to leave around it, whether or not it should have a border, and what to do if the image file can't be loaded, might also be helpful. This type of extra information is specified by means of tag attributes.
Attributes modify or expand on the effect of a tag by providing the browser with further instructions. They typically are set equal to some value, though some attributes stand on their own. For example, in the following expanded <IMG> tag:
<IMG SRC="images/header.gif" WIDTH=500 HEIGHT=120 HSPACE=5 VSPACE=3 BORDER=2 ISMAP>
SRC, WIDTH, HEIGHT, HSPACE, VSPACE, BORDER, and ISMAP are all attributes of the <IMG> tag. Almost all of them are set equal to some quantity-SRC to the URL of the image file, WIDTH and HEIGHT to the number of pixels that represent the dimensions of the image, HSPACE and VSPACE to the number of pixels of empty space (also called "white space" though it is not necessarily white in color) to leave around the image, and BORDER to the number of pixels wide the image's border should be. The ISMAP attribute indicates that the image is to be part of an image map. Since the word ISMAP is sufficient to signal the browser of this, it is not necessary to set ISMAP equal to anything.
Many HTML tags, both container and stand-alone, have attributes that give document authors many more options in how they design pages. Indeed, many of the "extensions" that have been introduced into HTML come in the form of attributes to existing tags, rather than completely new tags.
HTML entities are character sequences that reproduce special characters on a browser screen. Special characters come in two flavors:
An HTML entity always starts with an ampersand and ends with a semicolon. What's between them determines what gets rendered on the browser screen. For example, the entity
>
produces a greater than sign on screen. The foreign language character entities are made up of the base character followed by the applicable diacritical mark. For example, the entity
ü
produces a lowercase umlauted "u." To produce an uppercase
umlauted "u," just change the first u in the entity
to a U.
| TIP |
A full list of the HTML entities appears in the "An Overview of the HTML Elements" section. |
One special HTML entity is the non-breaking space: . You can put a non-breaking space between two words that should not be separated by a line.
In addition to the reserved characters, foreign language characters, and non-breaking space, you can represent any character with an HTML entity. All you need to know is the character's decimal ASCII value. For example, if you needed a bullet point (¨) and you knew a bullet's ASCII value was 183, you could use the entity
·
to place a bullet on your page.
| TIP |
Windows users can use the Character Map accessory program to quickly look up a character's decimal ASCII value. |
While there will always be individual differences between browsers, there are two rules they follow consistently:
HTML Is Not Case-Sensitive When writing HTML tags, you are free to use any combination of uppercase and lowercase letters inside the tag. This means that each of the following tags will be interpreted the same way:
<IMG SRC="button.gif" WIDTH=50 HEIGHT=50 BORDER=0> <Img Src="button.gif" Width=50 Height=50 Border=0> <img src="button.gif" width=50 height=50 border=0> <iMG SRc="button.gif" WiDtH=50 hEiGhT=50 BoRdEr=0>
The only exception to this rule is any text contained inside quotes.
Text in quotes is interpreted literally by a browser.
| NOTE |
Most HTML authors choose either an all uppercase or all lowercase approach to writing HTML tags. This helps tags stand out better while editing. The tags in this book are all in uppercase to enhance readability. |
Extra Space Is Ignored Browsers will recognize the first space after a character, but any spaces after that are ignored. Other space characters-tabs and carriage returns-are also ignored.
This rule can be frustrating for new HTML authors who diligently
place carriage returns in their documents, only to have the browser
treat them like they're not there. It can also be frustrating
for those who want to indent the first word of a paragraph several
spaces. You can put the customary five spaces before the first
word, but the browser will only acknowledge the first one.
| TIP |
You can use non-breaking space to put in extra space where you need it. A browser ignores the last two spaces in a sequence of three space characters, but it does print three spaces if you use |
Tables 3.1 through 3.6 provide an overview of all standard HTML
3.2 elements-both tags and entities. Tables describing tags indicate
whether the tag is a container or a stand-alone tag and what the
tag's purpose is. Proper tag syntax, including the use of attributes,
is discussed over the next several chapters. The entity tables
list characters and their associated entities.
| Tag | Type | Purpose |
| <BASE> | Stand-alone | Defines document baseline information |
| <HEAD> | Container | Denotes the start of the document head |
| <ISINDEX> | Stand-alone | Indicates that the document is a searchable index |
| <LINK> | Stand-alone | Establishes linking relationships with other documents |
| <META> | Stand-alone | Supplies document meta-information |
| <SCRIPT> | Container | Contains code for a client-side script |
| <STYLE> | Container | Supplies style sheet information |
| <TITLE> | Container | Gives the document a descriptive title |
| Tag | Type | Purpose |
| <A> | Container | Establishes an anchor |
| <ADDRESS> | Container | Denotes an address (postal or e-mail) |
| <APPLET> | Container | Embeds a Java applet in a document |
| <AREA> | Stand-alone | Defines clickable regions in a client-side image map |
| <B> | Container | Produces boldface text |
| <BIG> | Container | Renders text in a larger font size |
| <BLOCKQUOTE> | Container | Denotes a quoted passage |
| <BODY> | Container | Denotes the start of the document body |
| <BR> | Stand-alone | Inserts a line break |
| <CENTER> | Container | Centers contained items on the page |
| <CITE> | Container | Indicates the name or title of a cited work |
| <CODE> | Container | Denotes computer code |
| <DD> | Container | Denotes a term definition |
| <DIR> | Container | Initiates a directory listing |
| <DIV> | Container | Denotes the start of a document division (chapter, appendix, etc.) |
| <DL> | Container | Initiates a definition list |
| <DT> | Container | Denotes a term to be defined |
| <EM> | Container | Signifies text to be emphasized |
| <FONT> | Container | Modifies font characteristics (size and color) |
| <H1> | Container | Denotes a level 1 heading |
| <H2> | Container | Denotes a level 2 heading |
| <H3> | Container | Denotes a level 3 heading |
| <H4> | Container | Denotes a level 4 heading |
| <H5> | Container | Denotes a level 5 heading |
| <H6> | Container | Denotes a level 6 heading |
| <HR> | Stand-alone | Places a horizontal line (rule) on a page |
| <I> | Container | Produces italicized text |
| <IMG> | Stand-alone | Places an image on a page |
| <KBD> | Container | Denotes keyboard input |
| <LI> | Stand-alone | Denotes the start of a list item |
| <MAP> | Container | Contains definitions of clickable regions for a client-side image map |
| <MENU> | Container | Initiates a menu list |
| <OL> | Container | Initiates an ordered (numbered) list |
| <P> | Container | Denotes the start of a new paragraph |
| <PRE> | Container | Signifies text to be treated as preformatted |
| <SAMP> | Container | Denotes sample or literal text |
| <SMALL> | Container | Renders text in a smaller font |
| <STRIKE> | Container | Produces strikethrough text |
| <STRONG> | Container | Denotes text to be strongly emphasized |
| <SUB> | Container | Renders text as a subscript |
| <SUP> | Container | Renders text as a superscript |
| <TT> | Container | Renders text in a fixed-width font (typewriter text) |
| <UL> | Container | Initiates an unordered (bulleted) list |
| <VAR> | Container | Denotes a variable name |
| Tag | Type | Purpose |
| <FORM> | Container | Denotes the start of a form |
| <INPUT> | Stand-alone | Specifies a user input field |
| <OPTION> | Stand-alone | Defines a form menu option |
| <SELECT> | Container | Contains options in a form menu |
| <TEXTAREA> | Container | Establishes a window for multiline text input |
| Tag | Type | Purpose |
| <CAPTION> | Container | Denotes a table caption |
| <TABLE> | Container | Denotes the start of a table |
| <TD> | Container | Signifies the start of a new table data element |
| <TH> | Container | Signifies the start of a new table header |
| <TR> | Container | Signifies the start of a new table row |
| Character | Entity |
| Ampersand (&) | & |
| Greater than sign (>) | > |
| Less than sign (<) | < |
| Non-breaking space | |
| Quotation marks (") | " |
| Copyright symbol (©) | © |
| Registered symbol (¨) | ® |
| Entity | |
| &Aelig;, æ | |
| Á, á | |
| Â, â | |
| Å, å | |
| Ã, ã | |
| Ä, ä | |
| Ð, ð | |
| É, é | |
| Ê, ê | |
| È, è | |
| Ë, ë | |
| Í, í | |
| Î, î | |
| Ì, ì | |
| Ï, ï | |
| Ñ, ñ | |
| Ó, ó | |
| Ô, ô | |
| Ò, ò | |
| Õ, õ | |
| Ö, ö | |
| ß | |
| Þ, þ | |
| Ú, ú | |
| Û, û | |
| Ù, ù | |
| Ý, ý | |
| ÿ |