Petzold’s capitalization program

One of the most striking, and possibly most intimidating, parts of the reading from Charles Petzold’s Code is the short program he gives for capitalizing ASCII-encoded letters on an Intel 8080 microprocessor. Though it certainly isn’t essential for you to understand this program, it is, I think, possible to follow his little code listing even without having read Petzold’s previous 19 chapters. If you enjoy this sort of puzzle, then, here is a bit of an explanation. As a kind of sneak-preview of our work learning R, I’ve added several translations of this assembly language program into R.

First, you have to know that in this program, the 8080 is storing data in three places: in registers, in memory, and in flags. The registers are referred to by letters: the program uses three registers, A, C, and HL. (HL is really two registers but they are treated together.) A and C each hold a byte (a number from 0 to 255, or, in hex, 00h to FFh).1 HL holds two bytes (from 0000h to FFFFh), and its numeric value is treated as a memory address. Memory, for our purposes, is just a long list of bytes, which we refer to by their number, from 0000h to FFFFh. To refer to the byte stored in memory at address 1234h, we write [1234h]. Flags are single bits that store the results of certain operations.

The 8080 operations in the program are fairly simple. The mov operation copies (moves) data from one place to another. mov x,y means “move y to x.” The cpi operation means “compare” and stores its results in the processor’s flags. jz, jc, and jnc all jump to a new place in the program, but only if the flags are set to certain values; if the conditions are not met, the program continues to the next line. jmp jumps unconditionally. These crucial operations thus determine the flow of control in the program. The destination of the jump is given as a label. Thus jz AllDone means “go to the part of the program labeled AllDone: if the previous comparison was of two equal numbers; otherwise continue to the following line.” The remaining operations in the program are simply arithmetic.

At the start of the program, the C register is the number of characters to convert to lowercase, and the HL register holds the address in memory of the first character to lowercase (thus the first character is said to be at [HL]. The next character is at [HL] + 1, and so on up to [HL] + C - 1.

For example, imagine that we want to capitalize the phrase “Hello, friend!”

In ASCII, this is

48 65 6c 6c 6f 2c 20 66 72 69 65 6e 64 21
 H  e  l  l  o  ,     f  r  i  e  n  d  !

So imagine we store this starting at memory address 0000h:

0000h:  48
0001h:  65
0002h:  6c
...
000Dh:  21

We’ll set the value of HL to 0000h and the value of C to 0Eh (1410, the number of characters). Here goes Petzold’s program (I have slightly expanded his explanatory comments; text after the semicolons is commentary for humans and is ignored in running the program)2:

Capitalize:     mov A,C     ; move letters-remaining count in C to A
                cpi A,00h   ; is A zero?
                jz AllDone  ; if yes, jump to line AllDone

                mov A,[HL]  ; move byte at address specified by HL to A
                cpi A,61h   ; is A greater, less, or equal to 61h = 'a'?
                jc SkipIt   ; if less, jump to SkipIt

                cpi A,78h   ; compare A to 78h = 'z'
                jnc SkipIt  ; if greater, jump to SkipIt

                sbi A,20h   ; otherwise: it's lowercase. Subtract 20h from A 
                mov [HL],A  ; move A to the address specified by HL

SkipIt:         inx HL      ; let's look at the next character, [HL] + 1
                dcr C       ; set C counter to 1 less than its current value
                jmp Capitalize  ; and loop back

AllDone:        ret         ; consider this to mean "we're done"

At the end of this operation, the memory from 0000h to 000Dh holds:

48 45 4c 4c 4f 2c 20 46 52 49 45 4e 44 21
 H  E  L  L  O  ,     F  R  I  E  N  D  !

R transliteration

I’ll begin by transposing these machine operations very literally into an R program. R does not use of numerical memory addresses, however. Instead our data will be in a variable, which I will call msg.3 imagine msg is a vector of single letters, and cnt is a count of the length of that vector.

R’s convention for hexadecimal numbers is to prefix them with 0x. To store a series of such numbers, we put them in a vector, called msg. msg is simply a list of numbers, which are indexed (or numbered): msg[1], msg[2], msg[3], and so on. Now we will use hl as the name for this simple index, which runs from 1 to the length of the character string we are capitalizing.

Finally, whereas Petzold’s 8080 program used “jump” instructions to set the flow of control, R’s basic flow control looks slightly different though it is in effect the same: here, I will use the keywords while and if.

msg <- c(0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x2c, 0x20, 0x66, 0x72, 0x69,
         0x65, 0x6e, 0x64, 0x21)
cnt <- 0x0e
hl <- 1
while (cnt > 0) {           # As long as our count is positive, repeat:
    a <- msg[hl]            # retrieve the character at index hl, store in a
    if (a >= 0x61) {        # test: is it at least the code for lowercase "a"?
        if (a <= 0x78) {    # test: is it no more than the code for "z"?
            a <- a - 0x20   # then subtract to make it uppercase
            msg[hl] <- a    # and store the result
        }
    }
    hl <- hl + 1        # increment the hl index 
    cnt <- cnt - 1      # decrement the count of remaining characters
}
msg               # show the result...as decimal numbers
##  [1] 72 69 76 76 79 44 32 70 82 73 69 78 68 33
as.hexmode(msg)   # same, but displayed as hex digits
##  [1] "48" "45" "4c" "4c" "4f" "2c" "20" "46" "52" "49" "45" "4e" "44" "21"

As it happens, R has a function for converting codes like these to letters, intToUtf8, so we can verify the result4:

intToUtf8(msg)
## [1] "HELLO, FRIEND!"

R translation

Though while and if correspond closely to the “jump” instructions for the 8080, the idiomatic way to write an R program that carries out this computation is rather different. It looks like this:

msg <- c(0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x2c, 0x20, 0x66, 0x72, 0x69,
         0x65, 0x6e, 0x64, 0x21)
lc_letters <- msg > 0x61 & msg < 0x78   # logical vector: which elements of
                                        # msg are lowercase letters?
msg[lc_letters] <- msg[lc_letters] - 0x20
                                        # pick out those elements and subtract
                                        # 20h
msg
##  [1] 72 69 76 76 79 44 32 70 82 73 69 78 68 33
intToUtf8(msg)                          # verify
## [1] "HELLO, FRIEND!"

R idiom

However, a vector of numbers corresponding to ASCII codes is not the way we normally handle strings of text in R. R has a built-in data type, which is called the character class, and whose values are specified by typing the text in double quotes. If we use this data type, we can also make use of R’s built-in functions for common operations on strings. To capitalize strings, R supplies a function called toupper(). The idiomatic R version is thus:

msg <- "Hello, friend!"
msg <- toupper(msg) # apply toupper to msg, store the result in msg
msg
## [1] "HELLO, FRIEND!"

This version is agnostic about the underlying numerical representation of the individual characters in the text (it might be ASCII, or it might not). R abstracts the encoding question away for you. Nonetheless, the machine operations for toupper are similar to those in Petzold’s program.

Question

When do you think this abstraction could break down while using R? Under what circumstances would you need to know the numeric codes used to encode text you want to analyze?


  1. A refresher on numbers: as you probably know, the ultimate format for all data stored in a digital computer is binary numbers, that is, numbers in base two. A bit is a single binary digit, either 1 or 0. Eight bits is a byte: its value ranges from 000000002 (the subscript indicates base 2) to 111111112, that is, from 0 to 25510 = 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1. It is convenient to notate bytes in base sixteen or hexadecimal (“hex”) notation. In hex, one counts like this: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, 11… Petzold notates hex numbers with a lowercase h after them. Each byte is two hex digits, from 00h to FFh. A minor potential confusion: in the program, the letter C alone refers to the C register (ditto A). If I want to designate the number C16 (1210), I write 0Ch.

  2. Charles Petzold, Code: The Hidden Language of Computer Hardware and Software (Redmond, WA: Microsoft, 2000), 293.

  3. In fact, variables correspond pretty closely to memory addresses, but R hides the details away, and reserves the right to shuffle memory around without telling you about it. R abstracts away low-level memory addressing.

  4. A bit sneaky. intToUtf8 does what it sounds like and converts integers to characters according to the Unicode “UTF-8” encoding. Why should that have anything to do with ASCII? As Petzold tells you at the end of the chapter you read, for the basic American English alphabet and punctuation, the Unicode encoding (more specifically, the UTF-8 variant of Unicode) and the ASCII encoding are one and the same. A lowercase a is encoded as the single byte 61h in both Unicode and ASCII.