Concepts in Computing
CS4 - Winter 2007
Instructor: Fabio Pellacini

Lecture 2: Intro to HTML

Overview

  • Getting a page
  • The meaning of a page
  • Writing a page

How the web fits together

  • We have to precisely specify how everything works; computers can't figure it out on their own.
  • HTTP (HyperText Transfer Protocol): set of rules for how browsers and web servers communicate
  • Ex: Your PC to the CS web server: "get /~fabio/teaching/cs4-winter07/index.html"
    CS web server to your PC: [the page]
    (There are also particular error responses that are possible.)
  • URL (Uniform Resource Locator): global identity of the page you want
  • Ex: http://www.cs.dartmouth.edu/~fabio/teaching/cs4-winter07/index.html
    • http: -- the protocol (how to obtain the document)
    • www.cs.dartmouth.edu -- the hostname (which machine has it)
    • ~fabio/teaching/cs4-winter07/ -- the path (where on that machine the document is)
    • index.html -- the file name (if absent, often assumed to be index.html)
  • The hostname is specified in a particular hierarchical naming structure
  • Ex: www.cs.dartmouth.edu
    • www -- the web server machine
    • cs -- in the CS department
    • dartmouth -- at Dartmouth
    • edu -- which is an educational institution (compare .com, .gov, .org, etc.)

The information in a document

  • We have to precisely specify how everything works; computers can't figure it out on their own.
  • Documents have structure
    • Books: chapter, section, paragraph, sentence
    • Recipes: name, ingredients, steps, cooking time, servings
    • Library records: title, author, subject, publisher, date
  • Need to explicitly represent all the information contained in a document, including its words and also its structure / meaning.
  • Problem: we only have the same medium (the text document) to represent the extra info.
  • Solution: markup language: use special extra words, called tags.
  • In general, these tags go in pairs, around what they're describing:
    "I'm starting to tell you the author's name"
    William Shakespeare
    "Okay, that was the author's name"

HTML

  • HTML: "hypertext markup language"
    (Special case of SGML (standard generalized markup language) -- developed by committee of book publishers and librarians to solve the problem of how to tell the computer about the structure of a document.)
  • Basic use of HTML tags: <TAGNAME> the tagged text </TAGNAME>
    Note pair of "open" (<TAGNAME>) and "close" (</TAGNAME>) tags for the same name, like parentheses.
  • The tag name indicates how the tagged text is to be treated, e.g.,
    • html -- an HTML document
    • p -- a paragraph
    • h1 -- a level 1 heading
  • While most tags need both an "open" and a "close" tag, some tags don't, e.g.,
    • hr -- horizontal rule
    • br -- line break
    It's considered good style thought (and is required to meet some standards) to put a slash after the tag, if it's a "solo" tag, e.g. <br />
  • Tag names are not case-sensitive. <HTML> = <html>
    But we'll choose the lower case always since it supports more standards.
  • White space doesn't usually matter (the <pre> tag is one exception), so use it to make your code neat and readable. Nicely structured html helps people read it later (including you and your graders, who will count off it it is bad).

Sadly there is no one HTML. There are many dialects. HTML 1, 2, 3, 4, XHTML... What are those? Think about them as different version of a language that let us represent more information as the version number increases. In human languages we include new words too.
XHTML is a special case. HTML was not fully consistent (for example some tags are not closed). Makes it harder to find mistakes and (very rarely) browsers gets confused on error. XHTML is completely consistent. In this class we'll cover the HTML subset that is compatible with XHTML.

The structure of an HTML document

  • <html>
    <head>
    <title>The title</title>
    </head>
    <body>
    The contents
    </body>
    </html>
  • The head contains general information about the document, including the title (in the title bar and in the bookmarks), and the body contains the contents of the document (suitably marked up).
  • Apart from markup tags, a web page is simply a regular old text file. You can edit these in whatever program you like (save as text). WARNING: if you save-as html in various word processors, "hidden" mark-up is added that is ugly, often unnecessary, and confusing to web browsers. Don't do it!
  • Now you can understand the document for hw 1.
  • In fact, a good way to learn HTML is to look under the hood at your favorite web pages ("View Source" menu option). There may be some confusing things, and there may be things that aren't strictly "standard HTML", but which work anyway because browsers can be forgiving.

HW 0

(A walk-through of putting up a page.)

Some tags

  • Paragraphs are surrounded by <p> and </p> tags.
  • You can force a line break with <br> (with no close tag).
  • You can specify that a horizontal line should be placed in the document with the <hr> command (with no close tag). This stands for "horizontal rule".
  • Headings, subheadings, subsubheadings, and so on are surrounded by <h1> and </h1> tags, <h2> and </h2> tags, and so on. The smallest headings available are given by <h6> and </h6> tags.
  • Lists:
    • Bulleted ("unordered") lists are surrounded by <ul> and </ul> tags.
    • Numbered ("ordered") lists are surrounded by <ol> and </ol> tags.
    • Individual items in lists are surrounded by <li> and </li> tags. This stands for "list item". (Historically, the </li> end tag was not used, so you might find some documents that don't have it. It's good practice to use it.)
      The list item can be any text, including another list.
  • Text style:
    • Emphasized text is surrounded by <em> and </em> tags.
    • Strongly emphasized text is surrounded by <strong> and </strong> tags.
    • Typewriter/monospace text is surrounded by <code> and </code> tags.
  • Escape codes:
    • Because they are used to begin and end tags, the symbols < and > cannot be used directly in text. To indicate them we use the escape sequences &gt; and &lt;.
    • Because the & is used to indicate the beginning of an "escape" sequence when we want an & in the text we need an escape sequence to indicate it: &amp;. Other escape sequences are used for special characters in foreign languages. Escape sequences are case sensitive.
    • To make a space that really matters, use the escape code &nbsp (non-breaking space).
  • You can put a comment in the HTML (text that is in the file but isn't displayed), in a special type of solo tag whose name starts with an exclamation point and two dashes, and ends with two dashes:
    <!-- Whatever you want to say -->
  • Unknown tags are ignored (allowing for future expansion).

You can now make some pretty nice documents! You can also read this document itself. See a good HTML reference manual for more information on these and related tags.

Logical vs. physical structure

  • Note that HTML addresses two different concerns:
    1. How to convey the logical structure of a document (title, paragraphs, headings, etc.) so that the web browser can display it.
    2. How to actually display the document physically, using various fonts, styles, colors, spacing, etc.
  • Most HTML tags are targeted at the logical structure of the document, including: head, body, p, em, strong, code, cite, ul, ol
  • But HTML also provides tags to let you specify physical attributes of your document, such as:
    • b -- boldface text
    • i -- italic text
    • big -- large text
    • small -- small text
    • tt -- teletype text
  • Rule of Thumb:
    If a tag describes the MEANING of some text, it is LOGICAL.
    If a tag describes the APPEARANCE of some text, it is PHYSICAL.
  • When HTML was first created, physical tags were useful, but now, using physical tags is generally a poor design choice. As we will see, there is a much better way to specify how your documents should be displayed, using stylesheets.
  • Prefer logical tags unless you have some reason to specify the exact format of your text. Trust the browser!