Concepts in Computing
CS4 - Winter 2007
Instructor: Fabio Pellacini

Lecture 16: Building Abstractions

Overview

Recall that we defined a simple computer using the von Neumann architecture. A simple instruction set allows us to do arithmetic, comparisons, branches, and data transfer. This hides the complexity of the circuits, which hide the complexity of the gates, which hide the complexity of the switches. We'll continue moving up abstraction levels, to assembly language and higher-level languages. We'll also discuss how operating systems also provide support for hiding some low-level details.

Instruction Set and the Virtual Machine

The von Neumann architecture discussed describes a kind of "naked machine", which computes by passing signals around between the four functional units of the processor. To the programmer, however, the important details are the instructions: They determine how you think about writing a program. The instruction set defines a virtual machine; that is to say, a mental model of a machine that is accurate with respect to what the computer does, but abstracts away details that aren't important to the programmer, like wiring and timing.

The virtual machine described in the text is as follows.

  • It has a single register R.
  • It has a program counter PC.
  • It has a memory with a 12-bit address and 16-bit cells. (Thus, the computer's memory has 4096 locations of 16 bits each.) In describing operations, we act as if the memory is accessed like an array: M[x] gives the contents of the memory cell at address x.
  • It also has an input device from which it can read integer values, and an output device to which it can write integer values.
  • The ALU has three status registers for recording the results of comparisons. These registers are named LT, EQ, and GT, and each holds a single bit (0 or 1).

Each instruction is encoded as a 4-bit opcode plus a 12-bit operand, for a total of 16 bits per instruction.

Opcode Instruction Mnemonic Function
0 (0000) LOAD x R = M[x]
1 (0001) STORE x M[x] = R
2 (0010) CLEAR x M[x] = 0
3 (0011) ADD x R = R + M[x]
4 (0100) INCREMENT x M[x] = M[x] + 1
5 (0101) SUBTRACT x R = R - M[x]
6 (0110) DECREMENT x M[x] = M[x] - 1
7 (0111) COMPARE x if M[x] > R, then GT = 1
if M[x] = R, then EQ = 1
if M[x] < R, then LT = 1
8 (1000) JUMP x PC = x
9 (1001) JUMPGT x if GT = 1, then PC = x
10 (1010) JUMPEQ x if EQ = 1, then PC = x
11 (1011) JUMPLT x if LT = 1, then PC = x
12 (1100) JUMPNEQ x if EQ = 0, then PC = x
13 (1101) IN x read input device and store result in M[x]
14 (1110) OUT x write M[x] to output device
15 (1111) HALT x Processor stops

Example Virtual Machine Program

Compute 5 + 12 and write the result to the output device. The basic idea:

  1. Load 5 from some memory location into the register (R)
  2. Add 12 from some memory location
  3. Store R back to some memory location
  4. Write that memory location out to the output device
  5. Halt

Implications:

  • We have to pick where our instructions go in memory, which means we have to know how many there are.
  • We have to pick memory locations for 5 and 12, and make sure those values are available when the program runs.
  • There are five instructions and two data values, so why don't we use addresses 0..4 for the instructions, 5..6 for the data, and 7 for the result.
Addr   Contents              Meaning
0      0000 0000 0000 0101   LOAD  5
1      0011 0000 0000 0110   ADD   6
2      0001 0000 0000 0111   STORE 7
3      1110 0000 0000 0111   OUT   7
4      1111 0000 0000 0000   HALT
5      0000 0000 0000 0101   (data value) +5
6      0000 0000 0000 1100   (data value) +12

The values at M[0] and M[5] are identical. Why is one of them an instruction and the other data?
As we've discussed, there is no intrinsic reason; binary values can be interpreted in many different ways!

Assembler

Writing a program like that is painful (though easier than wiring gates...). Translating everything into binary, choosing addresses, looking up opcodes -- this is all tedium the computer should be doing. In fact, that's what an assembler does: you write a program in a symbolic way, and the assembler picks the memory locations and translates everything to binary. It lets you assign names to locations in your program, instead of having to know their addresses in memory.

Example: Get two numbers; print whichever is larger.

begin:    IN      x        ;; input two values from the user
          IN      y

          LOAD    y
          COMPARE x        ;; compare x against y
          JUMPLT  printy   ;; if x < y, skip over the part to print x

          OUT     x        
          JUMP    done     ;; if x ≥ y, skip over the part to print y

printy:   OUT     y

done:     HALT

x         .DATA     0      ;; storage for variables
y         .DATA     0

This program has 9 instructions (say, 0..8) and 2 variable locations (say, x:9, y:10). Thus the assembler replaces x with 9 and y with 10. Similarly, it replaces the label printy with the actual line number, 7, and the label done with the number 8. The .DATA "pseudo-op" is not really an operation, just a notation to the assembler that the memory location is set aside for storing data.

Addr  Value                Meaning
0     1101 0000 0000 1001  IN 9
1     1101 0000 0000 1010  IN 10
2     0000 0000 0000 1010  LOAD 10
3     0111 0000 0000 1001  COMPARE 9
4     1011 0000 0000 0111  JUMPLT 7
5     1110 0000 0000 1001  OUT 9
6     1000 0000 0000 1000  JUMP 8
7     1110 0000 0000 1010  OUT 10
8     1111 0000 0000 0000  HALT
9     0000 0000 0000 0000  (data, variable x)
10    0000 0000 0000 0000  (data, variable y)

Note that if we had neglected to put the HALT statement, the program would continue on, trying to execute line 9. It would blindly interpret what it sees there -- as opcode 0 (LOAD) and as operand whatever the value of x is. It's all just bits.

Example: Perform division by repeated subtraction.

begin:    IN       x       ;; input two values from the user
          IN       y

          CLEAR    quot    ;; clear the quotient

loop:     LOAD     x
          COMPARE  y
          JUMPGT   done    ;; if y > x, we're done

          SUBTRACT y       ;; subtract y from x
          STORE    x

          INCREMENT quot   ;; add 1 to the quotient

          JUMP     loop    ;; go back and loop again

done:     OUT      quot    ;; write the quotient to the output device
          HALT

x         .DATA     0      ;; storage for variables
y         .DATA     0
quot      .DATA     0

This program has 12 instructions (say, 0..11) and 3 variable locations (say, x:12, y:13, quot:14). Thus the assembler replaces x with 12, y with 13, and quot with 14. A couple of the lines have labels, which it replaces with their numbers: done with 10, and loop with 3.

Addr  Value                Meaning
0     1101 0000 0000 1100  IN 12
1     1101 0000 0000 1101  IN 13
2     0010 0000 0000 1110  CLEAR 14
3     0000 0000 0000 1100  LOAD 12
4     0111 0000 0000 1101  COMPARE 13
5     1001 0000 0000 1010  JUMPGT 10
6     0101 0000 0000 1101  SUBTRACT 13
7     0001 0000 0000 1100  STORE 12
8     0100 0000 0000 1110  INCREMENT 14
9     1000 0000 0000 0011  JUMP 3
10    1110 0000 0000 1110  OUT 14
11    1111 0000 0000 0000  HALT
12    0000 0000 0000 0000  (data, variable x)
13    0000 0000 0000 0000  (data, variable y)
14    0000 0000 0000 0000  (data, variable quot)

Higher-Level Languages

Now we have a computer we can program, and with the help of an assembler, it's not even that difficult to to. Even so, this is still quite tedious! We'd like to be able to write high-level friendly programs like:

function divide(x, y)
{
  var quot = 0;

  while (y <= x) {
    x = x - y;
    quot = quot + 1;
  }

  return quot;
}

The computer should figure out the details like where the JUMP instructions go, and how to set up the comparisons, and which values to read in or write out. Furthermore, the original program assumes you're programming the particular computer we described; even with an assembler, it won't work on your Pentium or PowerPC machine. In other words, it's not portable from one type of machine to another.

To get around this, we use high-level languages, such as JavaScript. HLL's provide another layer of virtual machine abstraction, which can be translated into a lower-level set of machine instructions. Pro: Much easier to program! Con: Requires a translator program.

Two categories of language translators:

  • Interpreter: Reads the high-level program and performs the actions it specifies, directly. Analogy: Giving a recipe to a cook.
  • Compiler: Translates the high-level program into another language, which can be executed later. Analogy: Having someone translate your recipe.

In practice, there are often "hybrid" systems: the program is read and converted to a kind of assembly language (but not one that runs on a real machine), often called bytecode, to make it easier to interpret. The resulting bytecode is executed directly by an interpreter. This is more or less how the JavaScript interpreter inside your web browser works.

More examples of compilation (also see the division example above, reading from JavaScript to assembly):

JavaScript:

x = y + z

Assembly:

LOAD  y
ADD   z
STORE x

JavaScript:

while (x <= 17) {
  x = x + 1;
}

Assembly:

loop:    LOAD data17 
         COMPARE x
         JUMPGT  done

         INCREMENT x   ;; body of loop
         JUMP    loop

done:    ... ;; rest of the program

;; ... later, after the code
data17:  .data  +17   ;; data for value of 17

As we discussed before, there are many, many different programming languages out there (remember the diversity). They provide different abstractions, different ways to think about programs, and so forth. Each has a compiler and/or interpreter to allow programs written in that language to run on real machines.

Operating Systems

We've seen now how the high-level programs we write in languages like JavaScript can be translated into machine instructions and executed on a real computer. But we're still missing a lot of detail:

  • How do we load programs into memory in the first place?
  • How do we tell the computer what programs to run?
  • How do we save, print, backup, share files, etc.?

The CPU and its instructions are powerful, but incredibly simple. To make it possible for users to actually control the computer, we need to provide it with an operating system (OS). The OS provides a convenient user interface and coordinates the various pieces of hardware to provide such useful abstractions as "files", "directories", "windows", "cursors", etc.

When the CPU is powered up, everything is a "blank slate". The operating system is then loaded into memory, the first program to run. It provides you with the interface to run other programs, access files, display windows and buttons, establish network connections, etc. The operating system is an incredibly complex piece of software. And yet, it's just software...a big sequence of instructions. (Windows XP is estimated to have 40 million source lines of code [Wikipedia].)

One of the most important features of an OS is multitasking: The CPU only does one thing at a time, but we want our computers to do many different things (read email, edit a file, surf the web). To obtain this illusion, the operating system "switches" between multiple programs, many times a second, so that they each make forward progress and appear to be running simultaneously.

It used to be the case that every new computer that came out had its own operating system, written specially for that computer. They were usually written in assembly language. In the late 1960's and early 1970's, Brian Kernighan, Dennis Ritchie, and Ken Thompson, working for Doug McIlroy at Bell Labs, developed a new operating system they called "Unix".

Unlike earlier systems, this was written in a new high-level language they had invented called "C". A very small piece of the system was written in assembly language. For the first time, there was an operating system that could be easily "ported" to run on new computers as they were developed; you re-write the assembly language portions (maybe 10-15000 instructions, at the time), and the rest could be compiled.

Unix spread widely throughout academia, and many different versions evolved, some commercial, some open-source. The interface is antique by modern standards: You type out your commands to a program called the "shell". But, the system is very simple for programmers to understand, and very powerful. It supports multitasking and also multiple users who can access the machine simultaneously via terminals or over a network connection.

Early personal computers ran an OS called "CP/M" -- a very clunky, simplistic system that provided files and directories, but only one program at a time! In 1981, IBM introduced the PC, creating a standard architecture that becomes the basis for the widespread Intel-based PC's available even today. Microsoft released MS-DOS 1.0. In 1984, Apple released the Macintosh, and in 1985, Microsoft released the first versions of Windows. In 1991, Linus Torvalds announces the Linux project, to provide a version of Unix for PCs.