Thursday, June 18, 2009

Notepad stories part I

I picked up development responsibilities for Micrsoft's Notepad app in the fall of 1993 right after it had been converted to be a Unicode application. Our test team had converted it from a 16 bit app to a 32 bit app to test some porting tools and found ourselves the proud owners because the Shell team refused to deal with it. I have to admit it wasn't pretty; the port had been done with automated tools and the Unicode conversion was done rather quickly by someone on a tight schedule.

It was strange that a test team was responsible for maintaining shipping components, but there you are. I also maintained some of the games most notably Solitaire, and Mine Sweeper for the same reason. Dealing with these was pretty simple since we tested in the area and they didn't have any Product Managers changing requirements on them, so it was pretty much maintenance mode fixes: keep up with UI changes, use better common dialogs, and in the games cases, use a common cards DLL. But they were heavily used, so there were always some suggestions for improvements.

I cleaned up the Notepad code and started working on the first real problem: performance. It turned out that loading a 1 megabyte Unicode file would take 5 minutes to display on the fastest machine we had at the time: a 50 MHz MIPS box. This problem had to be fixed in multiple places in GDI and wasn't a Notepad problem. Apparently GDI cached glyphs and character metrics for computing character widths but only kept at most 256 contiguous characters at a time. With Unicode, if you used characters that were more than 256 points apart, GDI would recalculate the metrics. And it turned out there were layers of these caches: one in user mode, and at least one in kernel mode. It took a while to find all these caches and do the right thing, especially on small memory configurations.

The next problem was the string find function. It used to be fast, but when the Unicode conversion happened it got very very slow. This function does a case insensitive search for a string in the edit buffer. In the ANSI character set, converting to upper-case is easy and is typically done with a table lookup. Unicode is much harder. You might think it could be done with a 64K table but you would be wrong sometimes. There are single characters that when you convert them to lower-case turn into two characters and other non-intuitive things. The proper thing to do is use the Unicode Win32 compare API, but this API has to do all this interesting work and was the cause of the slowdown.

The trick is to convert the first character of the string you are looking for into both upper and lower-case and search for either of these in the buffer. If one is found, use the Unicode compare API. Most of the time the first character will not match and the slow API will not be called.

This performance problem was not seen in Windows 95 because a) that version of Notepad was ANSI only and b) the largest file the Windows 95 Notepad could handle was 32K characters. Really. If the Windows 95 Notepad detected a file large than that, it ran Wordpad on it instead. This led to an interesting bug reported to us later.

Lots of bugs get reported against Notepad but most of them aren't Notepad bug per se. But I had to triage them when they came in and assign them to the right developer. For example, if Notepad can't print right on a certain printer, the bug would be assigned to me when clearly it was a print subsystem bug. There were also common dialog bugs and edit control bugs that I had to deal with. The funniest one was a report of a blue screen (kernel crash) caused by Notepad just days before we shipped NT 3.1. I looked at the stack trace and realized that it was aborting inside the 16 bit DOS/Windows emulator. It turned out that the user was running the 16 bit version of Notepad that wasn't even in the shipping product, but was on a network share that he was using.

We had one other case where Notepad blue screened the OS. Notepad needs to know the length of the file it is editing and I used the first API I could find that returned the 64 bit length. If memory serves me, it was GetFileInformationByHandle. This API got far more information than I needed, but met my needs. But then a user reported crashing the system when he accessed a file on a network drive from Notepad. The code at fault was deep in the remote file system code, but the developer of that part of the system (Larry Osterman) was just down the hall. The problem was in his code but only showed up when accessing a share on a DEC VAX system. The API that I was using had exercised a little used part of the system that clearly hadn't been tested on ALL known implementations of this network protocol. Larry tried to convince me to use a lighter weight API, but since it was only called once per file open, I left it in. Since Notepad is a popular application, it might find other implementations with the same bug.

More Notepad stories in part II

Wednesday, June 17, 2009

Trust but verify

When we started True Basic in 1983, none of us had real microprocessor experience. A couple of us had worked on minis and a cool bit-sliced machine, but we had no mainstream computer chip experience. So, we immediately went out and got a Z8000 based Unix box for development. Seriously. Then we bought an IBM PC and before the end of the year, we had an Apple Mac and soon an Apple Lisa.

Our strategy was to develop portable software on the Unix box and then load it onto various forms of personal computers. It wasn't obvious at the time that the Intel/IBM architecture was going to be so successful and we also had dreams of getting our software to run on the Apple II which had good numbers back then, plus the Motorola 68000 machines and whatever came next.

The software architecture was pretty standard stuff: Write almost everything in a high-level language, compile this into a byte code for an imaginary machine, then interpret this machine with an interpreter specific to each computer we want to support. Everything but the interpreter would be common to all machines, and the interpreter could be written in either a high-level language, or in assembler if speed was a problem.

Since each machine had requirements for data layouts etc that would impact performance, there was a linker that knew about the particular target machine and set up auxiliary tables as needed. The linker was written in a high-level language, too. So the only thing that, in theory, needed to be changed when porting to a new machine was the interpreter.

So we got to design that ideal machine for ANSI BASIC and then write several interpreters for it. David Pearson architected a storage-to-storage machine with complex effective addressing. There were other language systems at the time that used a stack-based machine, notably the UCSD P-system, but I think our design led to more compact code though it required a larger interpreter. It was an engineering guess that we would save space for the total system doing this.

For example, if you had the statement:

let a= b+c

with a stack machine this would turn into something like:

push b
push c
pop c

Assuming that an opcode would take 1 byte and an address would take 1 byte also, this would add up to 7 bytes. If you did the same thing with a memory-to-memory instruction set, it would look like:

add b,c,a

and this would be only 4 bytes long.

With the ideal machine target settled, we could immediately write an interpreter for it, build a simple compiler and development environment and try stuff out. And so we did. We picked a small subset of the eventual language to implement, wrote the interpreter in Lattice C, and implemented the compiler and IDE (development environment or UI) in, listen carefully, True Basic. Since we didn't have a True Basic compiler (yet), anything written in True Basic was actually written in the form of Basic that was on the Unix box. But only that subset of the language that existed on both the Unix box and True Basic was used.

This version had lots of limitations, by design. It was written in "near memory model" so only 64K bytes of memory could be used for data. This simplified our compiler and interpreter, but put a serious cap on the size of programs that could be run. And because the interpreter had just used the C memory manager, if a request was rejected there was no way to compact memory and try again. But this version was completed in a couple of months and allowed other people to try the language. It also provided us with a demo version to show to investors or publishers.

For the real version, we had to start over and write the real compiler for the whole language, and the interpreter for the IBM PC had to be written in assembler. C compilers didn't do a great job with optimizing back then, and we needed the speed to compete with other Basics in the marketplace.

Since I was going to be programming in assembler, I read the Intel manuals from cover to cover. I reverse engineered the rules for how to calculate the cycles taken per instruction rather than just looking each one up, so I had a mental model of what was costly and what wasn't. Then I dove in and wrote some code and timed it. Lo and behold, the IBM PC was nowhere near as fast as the Intel manual would lead you to believe. This baffled us for a while.

Then we noticed a little fine print in the back of the manual. It said the timings all assumed the instruction prefetch buffer was full. On the 8086 and 8088, there was a 6 byte lookahead buffer that would fetch the next instructions if there was memory bandwidth free. This might happen on 8086 machines with fast RAM, but on an 8088 with an 8 bit bus and slow RAM, this would rarely happen.

In the end, the best model for speed on the IBM PC was to count the number of bytes that were read or written and multiply by 1 microsecond. About the only time where the prefetch buffer could get ahead was when a divide operation happened.

I created a list of one byte, two byte and three byte instructions and memorized them. In particular, if I could use a one byte instruction in place of a two or three byte, I did. I pretty much ignored Intel's timings.

Lesson learned: Trust but verify. Or if you have created a mental model, verify it if possible.

I plan in future posts to cover some of the tricks or coding gems from this interpreter, and perhaps some of the cool designs in this language system. I wasn't the designer, so I can say this. It was ported to several machines including the Amiga, Atari, Apple Mac, and a Wang VS machine (something like IBM 360 with a stack). It was later ported to 32 bit Windows, but I had left the company by the time that happened.

Tuesday, June 16, 2009

Bring your flashlight

Raymond Chen's post yesterday brings up the point that performance engineering is not always straight forward and that academic algorithm analysis can sometimes send you off in the wrong direction. Years after you have taken a course, you may remember a summary of a solution without remembering the basis for that conclusion.

It is easy for developers to think they know where they need to improve the performance of their programs. Alas, they are often wrong, so one learns early on to follow these rules:

1) Collect the scenarios that you wish to improve, and
2) Measure where the program is actually using resources (CPU, memory, network traffic, disk IO, etc.)

If you have a working system with users, then these steps are possible. You might even have some team members who specialize in these activities so they can measure usage, or have even developed tools to measure actual resource usage.

But if the system is being designed, you will have to do some back-of-the-envelope work, or cobble up a prototype to test. You don't want to spend too much time on the prototype because it is, by design, going to be thrown away. It might also be useful for playing with algorithms, internal interfaces etc, but if you are unlucky it will appear to work well enough to ship. Prototypes really should only be used as learning devices, but things happen on the way to shipping.

I have always found doing this "fit and finish" work to be quite pleasant. You have a working program that people can use and you can make easy improvements every day or so that can be seen. If something doesn't work out, then you don't make the change and your users can still get work done.

Rarely am I not surprised when I measure real scenarios. Probably the biggest surprise for me happened when I was working on an MS-DOS interpreter for a weird in-house language that I worked on in 1990. We had this large application that ran on a PC that was written in several different special languages but compiled down into an interpreted language (think p-code if you must). Since this was MS-DOS, the interpreter implemented its own virtual memory and did a fairly good job of squeezing as much out of a small PC as was needed at the time.

I had to add my own performance measurement code, which in this case was just a program counter sampler which ran off the timer interrupt. My scenarios were provided by various people who complained about slowness. Not scientific, but most operations took less than a second on this system, so they were noticeable. I was hoping to find some algorithm or data structure that needed to be fixed, tested and deployed, but instead there was just one line of C code in the interpreter was using a lot of the time. The C code looked like this:

*ptr = 0.0;

and the underlying machine code looked like this:

fstp qword ptr [bx];

This code was in a loop initializing to zero a floating point array. This should be fast enough, but my instrumentation said otherwise. Have you figured it out yet?

When the IBM PC came out, it came with the Intel 8088 chip. It was a low-end 16 bit chip with 8 bit external data paths and no floating point instructions. If you wanted floating point, you could add it later by buying the Intel 8087 chip. This was not cheap and most people didn't bother. So when a floating point instruction was attempted to be executed on these machines, it faulted and a bit of software would simulate the operation. This interpreter was not fast in part because the IEEE floating point standard was pretty exacting.

In this case, the fix was easy since we just had to initialize memory to certain values. There was no need to use the (missing) floating point chip for this simple operation. I believe it was:

*(long*)ptr = (long) 0.0 ;

In 1984, Phil Koch worked for True Basic developing an interpreter for an ANSI Basic system on the Apple Macintosh He used the system-provided floating point routines and wrote the interpreter in a few weeks. He was proficient at assembler, but had never worked with a Motorola 68000, or a personal computer. The first Macs certainly didn't have floating point hardware, but the system code would use the hardware when it was available on later models. The interpreter worked fine, but the floating point interpreter was, again, very slow.

If memory serves me, the Mac interpreter had contagious floating point just like the IBM PC interpreter being developed at the same time. All numbers in this system could be floating point, but for the cases where the number fit into a 16 bit value, it was stored as that integer and flagged as such. The IEEE standard has values called "Not a number" (NAN) which could be used (or misused) for this purpose. The interpreter would notice these NAN values and avoid the floating point operations. So things only got slow when numbers were not in the 16 bit integer range. (Aside: we didn't invent this idea. Apparently it was an old LISP idea, but I can't find a reference)

Phil typically worked late into the night on his programs, probably because of years of working on mainframes -- they game you better response when no one else was using them. But that was the summer of the Los Angeles Olympics, so he was spending some evenings at home watching the better highlights. One night the TV coverage ended, but it was too late to bicycle into town to work. He tried sleeping, but to no avail. So, to pass the time, he wrote out by hand most of the floating point interpreter that was used to replace the Apple-provided one, only it was very fast. It didn't have to do all the odd corners of the IEEE standard, it didn't have the system call overhead and it didn't have to simulate the floating point chip down to the bit level. I think he had to consult Knuth's Art of Computer Programming Vol. 2 to get the divide right, but otherwise it was done in an evening of idle time.

We later had a curious performance bug reported to us on the IBM PC version. It was exposed with the following program:

let a= 0
let b= 0.5
print a
print b-b

Both print statements were very fast unless you had a floating point chip installed, in which case the second print took about one half a second! It turns out that once a number is considered to be a floating point number, it stays that way (hence the contagious term). So the b-b computed a floating point zero which is something that has a very negative exponent (think 1.0 times 2 to the -308th power). The print routine, in the process of trying to figure out what power of ten the number fell into (think scientific notation) was slowly and carefully doing lots of multiplies until the result was around 1.0 or it noticed that it had gone for too long and declared that value to be zero.

If you didn't have a floating point chip, every floating point operation would check to see if the value was an integer again and if so, flag the result as such. If you have the floating point chip, it was decided to skip this check since it would take time. The fix was to speed up the print routine to find the appropriate power of ten more quickly.

I don't want you to get the idea that floating point is fraught with performance problems. But using floating point does have its challenges and pitfalls, some of which relate to security. A future post will cover some that I know about. But it is probably safe to say that many of these bugs would not be obvious without measurement tools.

BTW, the idea of instrumenting the interpreter to find bugs like this goes way back. I just re-read Donald Knuth's Literate Programming chapter 10 titled "The Errors in TeX" where he explains that he did this in a compiler he wrote in the early 1960s. Apparently, he wrote the compiler in some interpretive language and then wrote the interpreter so that it would fit into the tiny machine he was using (which may be the IBM 650).

Monday, June 8, 2009

Guest Accounts

Handling guest accounts properly is always a little tricky.

You want them to be useful, but you want to limit the negative effects they may have on authenticated accounts and data. So, you limit the resources they may use and protect data they can access.

The processor itself is pretty easy to protect. Operating systems already have mechanisms to fairly distribute the CPU power to different processes and with any luck will also carry account information that can be used to enforce fairness across accounts. And modern file systems have ways to protect data from being accessed or modified by different accounts.

But it's possible for a guest account to misuse the computer by using lots of disk space and not freeing it before signing off. You could limit the amount of space used during a session, but if you allow disk space to be retained between sessions, several guests, working in concert, could fill up the disk. On DTSS, this was solved by severely limiting the amount of disk space a user could use for permanent storage, and having a separate larger limit for temporary files used in a process. This later trick allowed compilers to create large intermediate files. Also, by policy, guest accounts were handed out in person, so there were not a lot of them in use.

In the 32 bit version of Windows (Windows NT, Windows 2000, Windows XP etc), there is also the concept of a guest account. When a guest signs in, an empty user directory is made for that guest and when the guest signs off, the directory is deleted. This leads to an interesting security bug.

The process that deleted the guest directory runs with elevated privileges. The code would recurse through the directory, deleting files and directories. Unfortunately, it did not impersonate the guest account before doing this. If the guest left a linked file that pointed to a file it didn't have rights to delete, this deletion process would happily delete it since it had plenty of permissions. In fact, most OS files could be deleted. No good can come from this. And before you ask, yes, this bug is fixed.

Of course, this is still a problem if an administrator does the equivalent thing to a file without thinking.

Sunday, June 7, 2009

Ancient Social Engineering Attack

Various Social Engineering attacks are in the news these days. By taking advantage of people's vulnerabilities, criminals can get personal information or have programs run on their computers that can harvest information later, or be used as part of a botnet .

A simple example is where the attacker sends email to a victim saying they have won a prize and that they should visit a web site and confirm their identity by providing sensitive data like their Social Security Number, date of birth or mother's maiden name. The attacker than uses this data for fraud while the victim awaits their prize.

You might think this sort of attack is new, but here is a story of a social engineering attack from the early 1970s.

Dartmouth College developed a time-sharing system in the 1960s and allowed wide scale usage of it in the community. There were lots of non-science related programs available to use, including word processing and early text games. All students, faculty and staff were given user accounts and guest accounts were readily available for visitors.

Old mechanical terminals that ran at 10 characters per second were used to interact with the machine, but most of them were in private offices or clusters that weren't open to the public. The computer center did have what was called the "Public Teletype Room" that had 20 or so terminals and a student available to answer questions in person or by phone. If you were into computing, this was a great job. It was low pressure, you had access to computers and you got to help people with their problems. The only job better than this was being a systems programmer (sysprog), those guys who got to write parts of the operating system. This later job was the highest paid job on campus but if you stopped working after10 hours a week you were paid for. Given the excitement of doing real work, it was hard to only work for 10 hours and some students' grades suffered. But the Public Room Assistant job was great and gave plenty of time for working on personal projects.

Many kinds of people used the public area. High school students and even elementary school kids were in there either doing serious work, or just playing computer games. College students had priority, but there were often free terminals and the townspeople used them a lot. Like the town library, kids would go there after school and wait until their parents picked them up after work. It was a safe environment for kids, and with any luck the kids would get the computer bug.

One of the most basic problems people had was how to turn on the computer and type some simple commands. So, they might have wanted to play tic-tac-toe, but they didn't know how to sign-in and specify the program to run. After all, these people were not computer science students; they were just trying to use the computer and shouldn't be expected to learn or remember such arcania. Each terminal had a cheat sheet, (what we would call now an FAQ) taped next to the terminal which explained where the power switch was, what a user account was and a few simple commands with examples.

This was a time-sharing system, so any additional user would slow down other user's programs. Most of the games and other programs didn't requirement many resources, but it could be annoying if you had real work to do and a casual user was playing a game or sending email. So one of the assistants came up with a plan.

He developed a cheat sheet that looked similar to the official one. But on page two, he wrote about the fact that the computer had two CPUs and then mentioned that one was better than the other. He then said that the user could select which CPU was to be used. This was a complete lie, but not harmful. But then he went on to say that during heavy loads, the user should select this CPU by typing the command "GOOD" he followed this with the observation that during these times the computer may sign them off because the machine is too busy and they should wait and come back later during non-peak load times.

The "GOOD" command didn't exist, but the command line interpreter would parse the first three letters and that would match with the "GOODBYE" command. Needless to say, this command would sign the user out. Now remember, to the user, this was explained, so he never thought he had done anything wrong. This was just a clever denial of service attack which would only work on the casual user who went to the trouble of actually reading the documentation.

It was several months before someone in power discovered this bit of fakery.

It makes sense to allow abbreviations like "GOO" for the longer "GOODBYE", but it causes problems when a parser allows matches that are not close. Early language compilers typically ignored characters in an identifier after a certain length. For example, FORTRAN may ignore characters after 6 or 8 characters and BASIC would only scan the first three letters of a statement allowing RESET or RESTORE to mean the same thing. While allowing statements to have different spellings may only cause problems when porting programs to another machine, ignoring identifier differences can lead to immediate bugs which may be hard to see.