Tuesday, June 16, 2009

Bring your flashlight

Raymond Chen's post yesterday brings up the point that performance engineering is not always straight forward and that academic algorithm analysis can sometimes send you off in the wrong direction. Years after you have taken a course, you may remember a summary of a solution without remembering the basis for that conclusion.

It is easy for developers to think they know where they need to improve the performance of their programs. Alas, they are often wrong, so one learns early on to follow these rules:

1) Collect the scenarios that you wish to improve, and
2) Measure where the program is actually using resources (CPU, memory, network traffic, disk IO, etc.)

If you have a working system with users, then these steps are possible. You might even have some team members who specialize in these activities so they can measure usage, or have even developed tools to measure actual resource usage.

But if the system is being designed, you will have to do some back-of-the-envelope work, or cobble up a prototype to test. You don't want to spend too much time on the prototype because it is, by design, going to be thrown away. It might also be useful for playing with algorithms, internal interfaces etc, but if you are unlucky it will appear to work well enough to ship. Prototypes really should only be used as learning devices, but things happen on the way to shipping.

I have always found doing this "fit and finish" work to be quite pleasant. You have a working program that people can use and you can make easy improvements every day or so that can be seen. If something doesn't work out, then you don't make the change and your users can still get work done.

Rarely am I not surprised when I measure real scenarios. Probably the biggest surprise for me happened when I was working on an MS-DOS interpreter for a weird in-house language that I worked on in 1990. We had this large application that ran on a PC that was written in several different special languages but compiled down into an interpreted language (think p-code if you must). Since this was MS-DOS, the interpreter implemented its own virtual memory and did a fairly good job of squeezing as much out of a small PC as was needed at the time.

I had to add my own performance measurement code, which in this case was just a program counter sampler which ran off the timer interrupt. My scenarios were provided by various people who complained about slowness. Not scientific, but most operations took less than a second on this system, so they were noticeable. I was hoping to find some algorithm or data structure that needed to be fixed, tested and deployed, but instead there was just one line of C code in the interpreter was using a lot of the time. The C code looked like this:

*ptr = 0.0;

and the underlying machine code looked like this:

fstp qword ptr [bx];

This code was in a loop initializing to zero a floating point array. This should be fast enough, but my instrumentation said otherwise. Have you figured it out yet?

When the IBM PC came out, it came with the Intel 8088 chip. It was a low-end 16 bit chip with 8 bit external data paths and no floating point instructions. If you wanted floating point, you could add it later by buying the Intel 8087 chip. This was not cheap and most people didn't bother. So when a floating point instruction was attempted to be executed on these machines, it faulted and a bit of software would simulate the operation. This interpreter was not fast in part because the IEEE floating point standard was pretty exacting.

In this case, the fix was easy since we just had to initialize memory to certain values. There was no need to use the (missing) floating point chip for this simple operation. I believe it was:

*(long*)ptr = (long) 0.0 ;

In 1984, Phil Koch worked for True Basic developing an interpreter for an ANSI Basic system on the Apple Macintosh He used the system-provided floating point routines and wrote the interpreter in a few weeks. He was proficient at assembler, but had never worked with a Motorola 68000, or a personal computer. The first Macs certainly didn't have floating point hardware, but the system code would use the hardware when it was available on later models. The interpreter worked fine, but the floating point interpreter was, again, very slow.

If memory serves me, the Mac interpreter had contagious floating point just like the IBM PC interpreter being developed at the same time. All numbers in this system could be floating point, but for the cases where the number fit into a 16 bit value, it was stored as that integer and flagged as such. The IEEE standard has values called "Not a number" (NAN) which could be used (or misused) for this purpose. The interpreter would notice these NAN values and avoid the floating point operations. So things only got slow when numbers were not in the 16 bit integer range. (Aside: we didn't invent this idea. Apparently it was an old LISP idea, but I can't find a reference)

Phil typically worked late into the night on his programs, probably because of years of working on mainframes -- they game you better response when no one else was using them. But that was the summer of the Los Angeles Olympics, so he was spending some evenings at home watching the better highlights. One night the TV coverage ended, but it was too late to bicycle into town to work. He tried sleeping, but to no avail. So, to pass the time, he wrote out by hand most of the floating point interpreter that was used to replace the Apple-provided one, only it was very fast. It didn't have to do all the odd corners of the IEEE standard, it didn't have the system call overhead and it didn't have to simulate the floating point chip down to the bit level. I think he had to consult Knuth's Art of Computer Programming Vol. 2 to get the divide right, but otherwise it was done in an evening of idle time.

We later had a curious performance bug reported to us on the IBM PC version. It was exposed with the following program:

let a= 0
let b= 0.5
print a
print b-b

Both print statements were very fast unless you had a floating point chip installed, in which case the second print took about one half a second! It turns out that once a number is considered to be a floating point number, it stays that way (hence the contagious term). So the b-b computed a floating point zero which is something that has a very negative exponent (think 1.0 times 2 to the -308th power). The print routine, in the process of trying to figure out what power of ten the number fell into (think scientific notation) was slowly and carefully doing lots of multiplies until the result was around 1.0 or it noticed that it had gone for too long and declared that value to be zero.

If you didn't have a floating point chip, every floating point operation would check to see if the value was an integer again and if so, flag the result as such. If you have the floating point chip, it was decided to skip this check since it would take time. The fix was to speed up the print routine to find the appropriate power of ten more quickly.

I don't want you to get the idea that floating point is fraught with performance problems. But using floating point does have its challenges and pitfalls, some of which relate to security. A future post will cover some that I know about. But it is probably safe to say that many of these bugs would not be obvious without measurement tools.

BTW, the idea of instrumenting the interpreter to find bugs like this goes way back. I just re-read Donald Knuth's Literate Programming chapter 10 titled "The Errors in TeX" where he explains that he did this in a compiler he wrote in the early 1960s. Apparently, he wrote the compiler in some interpretive language and then wrote the interpreter so that it would fit into the tiny machine he was using (which may be the IBM 650).

1 comment:

Stanley_Krute said...

Such a pleasure to read such musings.

Post a Comment