Monday, October 19, 2009

Notepad stories past II

In the late 1990s, Microsoft was shipping two different versions of Notepad. One shipped on the 16 bit versions of Windows (Windows 95, 98, ME) while the other shipped on the true 32 bit versions (Windows NT 4.0 and Windows 2000). The 16 bit version only handled ANSI character files and only files that would fit in an edit control. I think this was 64K characters, but it could have been 32K. The 32 bit version handled both ANSI character files and Unicode characters and could theoretically handle files of up to 2^32characters.

Since Notepad should be a pretty easily used application, some design decisions were made to keep it simple and these lead to a couple of interesting bugs.

The limitation on file sizes for Windows 95 was not acceptable. I heard at one time, the Windows 95 team considered just eliminating Notepad and just shipping WordPad which handled large files and also Rich Edit format files. But WordPad had performance problems at the time and so both were shipped. Notepad was changed to run WordPad if the file to be edited was too large. This was all done under the covers, but users could see that a different program was running. It was important to have a program called Notepad because many programs had explicit references to the application. For example, setup programs would launch Notepad if the user wanted to view the License Agreement.

WordPad looks at the file contents to see if it is plain text, or rich text. Rich text has all sorts of formatting information embedded in it for things like bold, italic as well as more advanced formatting options. If the text appears to be rich text, it will display it as such. Notepad doesn't do this: it considers all text to be just characters except for the special interpretation of the end of line sequence and the character to distinguish the file as being Unicode. More about this in a future post.

So, later a bug was reported to me from that Application Compatibility team. It turns out that a program that worked just fine on Windows 95 was displaying what was considered to be garbage on Windows 2000. The non-Microsoft program was running Notepad on a text file that turned out to be rich text and the file was large enough so that Notepad on Windows 95 would run WordPad on it. Windows 95 worked file, but Windows 2000 failed. Actually, it was stranger than that. The application didn't specify Notepad, but just referred to the .TXT file. Explorer knows that to view this file means to run Notepad. If the application had named the file with the .RTF suffix, it would have worked fine on both Windows 95 and Windows 2000.

The fix could have been in either Notepad or in the app compat layer for the ShellExecute. If you fix it in Notepad, then whenever Notepad opens a file, it would scan it to see if it looked like a Rich Text file and if so, it would let WordPad handle it. The problem with this solution is that the user may have actually wanted to see the actual characters in the file, not how the file looks after WordPad has interpreted the various formatting commands in the file. In fact, you can imagine that some people may generate Rich Text files using notepad.

So, in the end, Notepad was not fixed to get around this one application problem. The fix, if I remember correctly, was put into the ShellExecute app compat layer. When the application with the error was run, when it called ShellExecute, another bit of code got run that detected this file being passed and ran WordPad instead.

Thursday, June 18, 2009

Notepad stories part I


I picked up development responsibilities for Micrsoft's Notepad app in the fall of 1993 right after it had been converted to be a Unicode application. Our test team had converted it from a 16 bit app to a 32 bit app to test some porting tools and found ourselves the proud owners because the Shell team refused to deal with it. I have to admit it wasn't pretty; the port had been done with automated tools and the Unicode conversion was done rather quickly by someone on a tight schedule.

It was strange that a test team was responsible for maintaining shipping components, but there you are. I also maintained some of the games most notably Solitaire, and Mine Sweeper for the same reason. Dealing with these was pretty simple since we tested in the area and they didn't have any Product Managers changing requirements on them, so it was pretty much maintenance mode fixes: keep up with UI changes, use better common dialogs, and in the games cases, use a common cards DLL. But they were heavily used, so there were always some suggestions for improvements.

I cleaned up the Notepad code and started working on the first real problem: performance. It turned out that loading a 1 megabyte Unicode file would take 5 minutes to display on the fastest machine we had at the time: a 50 MHz MIPS box. This problem had to be fixed in multiple places in GDI and wasn't a Notepad problem. Apparently GDI cached glyphs and character metrics for computing character widths but only kept at most 256 contiguous characters at a time. With Unicode, if you used characters that were more than 256 points apart, GDI would recalculate the metrics. And it turned out there were layers of these caches: one in user mode, and at least one in kernel mode. It took a while to find all these caches and do the right thing, especially on small memory configurations.

The next problem was the string find function. It used to be fast, but when the Unicode conversion happened it got very very slow. This function does a case insensitive search for a string in the edit buffer. In the ANSI character set, converting to upper-case is easy and is typically done with a table lookup. Unicode is much harder. You might think it could be done with a 64K table but you would be wrong sometimes. There are single characters that when you convert them to lower-case turn into two characters and other non-intuitive things. The proper thing to do is use the Unicode Win32 compare API, but this API has to do all this interesting work and was the cause of the slowdown.

The trick is to convert the first character of the string you are looking for into both upper and lower-case and search for either of these in the buffer. If one is found, use the Unicode compare API. Most of the time the first character will not match and the slow API will not be called.

This performance problem was not seen in Windows 95 because a) that version of Notepad was ANSI only and b) the largest file the Windows 95 Notepad could handle was 32K characters. Really. If the Windows 95 Notepad detected a file large than that, it ran Wordpad on it instead. This led to an interesting bug reported to us later.

Lots of bugs get reported against Notepad but most of them aren't Notepad bug per se. But I had to triage them when they came in and assign them to the right developer. For example, if Notepad can't print right on a certain printer, the bug would be assigned to me when clearly it was a print subsystem bug. There were also common dialog bugs and edit control bugs that I had to deal with. The funniest one was a report of a blue screen (kernel crash) caused by Notepad just days before we shipped NT 3.1. I looked at the stack trace and realized that it was aborting inside the 16 bit DOS/Windows emulator. It turned out that the user was running the 16 bit version of Notepad that wasn't even in the shipping product, but was on a network share that he was using.

We had one other case where Notepad blue screened the OS. Notepad needs to know the length of the file it is editing and I used the first API I could find that returned the 64 bit length. If memory serves me, it was GetFileInformationByHandle. This API got far more information than I needed, but met my needs. But then a user reported crashing the system when he accessed a file on a network drive from Notepad. The code at fault was deep in the remote file system code, but the developer of that part of the system (Larry Osterman) was just down the hall. The problem was in his code but only showed up when accessing a share on a DEC VAX system. The API that I was using had exercised a little used part of the system that clearly hadn't been tested on ALL known implementations of this network protocol. Larry tried to convince me to use a lighter weight API, but since it was only called once per file open, I left it in. Since Notepad is a popular application, it might find other implementations with the same bug.

More Notepad stories in part II

Wednesday, June 17, 2009

Trust but verify




When we started True Basic in 1983, none of us had real microprocessor experience. A couple of us had worked on minis and a cool bit-sliced machine, but we had no mainstream computer chip experience. So, we immediately went out and got a Z8000 based Unix box for development. Seriously. Then we bought an IBM PC and before the end of the year, we had an Apple Mac and soon an Apple Lisa.

Our strategy was to develop portable software on the Unix box and then load it onto various forms of personal computers. It wasn't obvious at the time that the Intel/IBM architecture was going to be so successful and we also had dreams of getting our software to run on the Apple II which had good numbers back then, plus the Motorola 68000 machines and whatever came next.

The software architecture was pretty standard stuff: Write almost everything in a high-level language, compile this into a byte code for an imaginary machine, then interpret this machine with an interpreter specific to each computer we want to support. Everything but the interpreter would be common to all machines, and the interpreter could be written in either a high-level language, or in assembler if speed was a problem.

Since each machine had requirements for data layouts etc that would impact performance, there was a linker that knew about the particular target machine and set up auxiliary tables as needed. The linker was written in a high-level language, too. So the only thing that, in theory, needed to be changed when porting to a new machine was the interpreter.

So we got to design that ideal machine for ANSI BASIC and then write several interpreters for it. David Pearson architected a storage-to-storage machine with complex effective addressing. There were other language systems at the time that used a stack-based machine, notably the UCSD P-system, but I think our design led to more compact code though it required a larger interpreter. It was an engineering guess that we would save space for the total system doing this.

For example, if you had the statement:


let a= b+c

with a stack machine this would turn into something like:

push b
push c
add
pop c

Assuming that an opcode would take 1 byte and an address would take 1 byte also, this would add up to 7 bytes. If you did the same thing with a memory-to-memory instruction set, it would look like:

add b,c,a

and this would be only 4 bytes long.

With the ideal machine target settled, we could immediately write an interpreter for it, build a simple compiler and development environment and try stuff out. And so we did. We picked a small subset of the eventual language to implement, wrote the interpreter in Lattice C, and implemented the compiler and IDE (development environment or UI) in, listen carefully, True Basic. Since we didn't have a True Basic compiler (yet), anything written in True Basic was actually written in the form of Basic that was on the Unix box. But only that subset of the language that existed on both the Unix box and True Basic was used.

This version had lots of limitations, by design. It was written in "near memory model" so only 64K bytes of memory could be used for data. This simplified our compiler and interpreter, but put a serious cap on the size of programs that could be run. And because the interpreter had just used the C memory manager, if a request was rejected there was no way to compact memory and try again. But this version was completed in a couple of months and allowed other people to try the language. It also provided us with a demo version to show to investors or publishers.

For the real version, we had to start over and write the real compiler for the whole language, and the interpreter for the IBM PC had to be written in assembler. C compilers didn't do a great job with optimizing back then, and we needed the speed to compete with other Basics in the marketplace.

Since I was going to be programming in assembler, I read the Intel manuals from cover to cover. I reverse engineered the rules for how to calculate the cycles taken per instruction rather than just looking each one up, so I had a mental model of what was costly and what wasn't. Then I dove in and wrote some code and timed it. Lo and behold, the IBM PC was nowhere near as fast as the Intel manual would lead you to believe. This baffled us for a while.

Then we noticed a little fine print in the back of the manual. It said the timings all assumed the instruction prefetch buffer was full. On the 8086 and 8088, there was a 6 byte lookahead buffer that would fetch the next instructions if there was memory bandwidth free. This might happen on 8086 machines with fast RAM, but on an 8088 with an 8 bit bus and slow RAM, this would rarely happen.

In the end, the best model for speed on the IBM PC was to count the number of bytes that were read or written and multiply by 1 microsecond. About the only time where the prefetch buffer could get ahead was when a divide operation happened.

I created a list of one byte, two byte and three byte instructions and memorized them. In particular, if I could use a one byte instruction in place of a two or three byte, I did. I pretty much ignored Intel's timings.

Lesson learned: Trust but verify. Or if you have created a mental model, verify it if possible.

I plan in future posts to cover some of the tricks or coding gems from this interpreter, and perhaps some of the cool designs in this language system. I wasn't the designer, so I can say this. It was ported to several machines including the Amiga, Atari, Apple Mac, and a Wang VS machine (something like IBM 360 with a stack). It was later ported to 32 bit Windows, but I had left the company by the time that happened.

Tuesday, June 16, 2009

Bring your flashlight



Raymond Chen's post yesterday brings up the point that performance engineering is not always straight forward and that academic algorithm analysis can sometimes send you off in the wrong direction. Years after you have taken a course, you may remember a summary of a solution without remembering the basis for that conclusion.

It is easy for developers to think they know where they need to improve the performance of their programs. Alas, they are often wrong, so one learns early on to follow these rules:

1) Collect the scenarios that you wish to improve, and
2) Measure where the program is actually using resources (CPU, memory, network traffic, disk IO, etc.)

If you have a working system with users, then these steps are possible. You might even have some team members who specialize in these activities so they can measure usage, or have even developed tools to measure actual resource usage.

But if the system is being designed, you will have to do some back-of-the-envelope work, or cobble up a prototype to test. You don't want to spend too much time on the prototype because it is, by design, going to be thrown away. It might also be useful for playing with algorithms, internal interfaces etc, but if you are unlucky it will appear to work well enough to ship. Prototypes really should only be used as learning devices, but things happen on the way to shipping.

I have always found doing this "fit and finish" work to be quite pleasant. You have a working program that people can use and you can make easy improvements every day or so that can be seen. If something doesn't work out, then you don't make the change and your users can still get work done.

Rarely am I not surprised when I measure real scenarios. Probably the biggest surprise for me happened when I was working on an MS-DOS interpreter for a weird in-house language that I worked on in 1990. We had this large application that ran on a PC that was written in several different special languages but compiled down into an interpreted language (think p-code if you must). Since this was MS-DOS, the interpreter implemented its own virtual memory and did a fairly good job of squeezing as much out of a small PC as was needed at the time.

I had to add my own performance measurement code, which in this case was just a program counter sampler which ran off the timer interrupt. My scenarios were provided by various people who complained about slowness. Not scientific, but most operations took less than a second on this system, so they were noticeable. I was hoping to find some algorithm or data structure that needed to be fixed, tested and deployed, but instead there was just one line of C code in the interpreter was using a lot of the time. The C code looked like this:

*ptr = 0.0;


and the underlying machine code looked like this:

fldz
fstp qword ptr [bx];

This code was in a loop initializing to zero a floating point array. This should be fast enough, but my instrumentation said otherwise. Have you figured it out yet?

When the IBM PC came out, it came with the Intel 8088 chip. It was a low-end 16 bit chip with 8 bit external data paths and no floating point instructions. If you wanted floating point, you could add it later by buying the Intel 8087 chip. This was not cheap and most people didn't bother. So when a floating point instruction was attempted to be executed on these machines, it faulted and a bit of software would simulate the operation. This interpreter was not fast in part because the IEEE floating point standard was pretty exacting.

In this case, the fix was easy since we just had to initialize memory to certain values. There was no need to use the (missing) floating point chip for this simple operation. I believe it was:

*(long*)ptr = (long) 0.0 ;

In 1984, Phil Koch worked for True Basic developing an interpreter for an ANSI Basic system on the Apple Macintosh He used the system-provided floating point routines and wrote the interpreter in a few weeks. He was proficient at assembler, but had never worked with a Motorola 68000, or a personal computer. The first Macs certainly didn't have floating point hardware, but the system code would use the hardware when it was available on later models. The interpreter worked fine, but the floating point interpreter was, again, very slow.

If memory serves me, the Mac interpreter had contagious floating point just like the IBM PC interpreter being developed at the same time. All numbers in this system could be floating point, but for the cases where the number fit into a 16 bit value, it was stored as that integer and flagged as such. The IEEE standard has values called "Not a number" (NAN) which could be used (or misused) for this purpose. The interpreter would notice these NAN values and avoid the floating point operations. So things only got slow when numbers were not in the 16 bit integer range. (Aside: we didn't invent this idea. Apparently it was an old LISP idea, but I can't find a reference)

Phil typically worked late into the night on his programs, probably because of years of working on mainframes -- they game you better response when no one else was using them. But that was the summer of the Los Angeles Olympics, so he was spending some evenings at home watching the better highlights. One night the TV coverage ended, but it was too late to bicycle into town to work. He tried sleeping, but to no avail. So, to pass the time, he wrote out by hand most of the floating point interpreter that was used to replace the Apple-provided one, only it was very fast. It didn't have to do all the odd corners of the IEEE standard, it didn't have the system call overhead and it didn't have to simulate the floating point chip down to the bit level. I think he had to consult Knuth's Art of Computer Programming Vol. 2 to get the divide right, but otherwise it was done in an evening of idle time.

We later had a curious performance bug reported to us on the IBM PC version. It was exposed with the following program:

let a= 0
let b= 0.5
print a
print b-b

Both print statements were very fast unless you had a floating point chip installed, in which case the second print took about one half a second! It turns out that once a number is considered to be a floating point number, it stays that way (hence the contagious term). So the b-b computed a floating point zero which is something that has a very negative exponent (think 1.0 times 2 to the -308th power). The print routine, in the process of trying to figure out what power of ten the number fell into (think scientific notation) was slowly and carefully doing lots of multiplies until the result was around 1.0 or it noticed that it had gone for too long and declared that value to be zero.

If you didn't have a floating point chip, every floating point operation would check to see if the value was an integer again and if so, flag the result as such. If you have the floating point chip, it was decided to skip this check since it would take time. The fix was to speed up the print routine to find the appropriate power of ten more quickly.

I don't want you to get the idea that floating point is fraught with performance problems. But using floating point does have its challenges and pitfalls, some of which relate to security. A future post will cover some that I know about. But it is probably safe to say that many of these bugs would not be obvious without measurement tools.

BTW, the idea of instrumenting the interpreter to find bugs like this goes way back. I just re-read Donald Knuth's Literate Programming chapter 10 titled "The Errors in TeX" where he explains that he did this in a compiler he wrote in the early 1960s. Apparently, he wrote the compiler in some interpretive language and then wrote the interpreter so that it would fit into the tiny machine he was using (which may be the IBM 650).

Monday, June 8, 2009

Guest Accounts


Handling guest accounts properly is always a little tricky.

You want them to be useful, but you want to limit the negative effects they may have on authenticated accounts and data. So, you limit the resources they may use and protect data they can access.

The processor itself is pretty easy to protect. Operating systems already have mechanisms to fairly distribute the CPU power to different processes and with any luck will also carry account information that can be used to enforce fairness across accounts. And modern file systems have ways to protect data from being accessed or modified by different accounts.

But it's possible for a guest account to misuse the computer by using lots of disk space and not freeing it before signing off. You could limit the amount of space used during a session, but if you allow disk space to be retained between sessions, several guests, working in concert, could fill up the disk. On DTSS, this was solved by severely limiting the amount of disk space a user could use for permanent storage, and having a separate larger limit for temporary files used in a process. This later trick allowed compilers to create large intermediate files. Also, by policy, guest accounts were handed out in person, so there were not a lot of them in use.

In the 32 bit version of Windows (Windows NT, Windows 2000, Windows XP etc), there is also the concept of a guest account. When a guest signs in, an empty user directory is made for that guest and when the guest signs off, the directory is deleted. This leads to an interesting security bug.

The process that deleted the guest directory runs with elevated privileges. The code would recurse through the directory, deleting files and directories. Unfortunately, it did not impersonate the guest account before doing this. If the guest left a linked file that pointed to a file it didn't have rights to delete, this deletion process would happily delete it since it had plenty of permissions. In fact, most OS files could be deleted. No good can come from this. And before you ask, yes, this bug is fixed.

Of course, this is still a problem if an administrator does the equivalent thing to a file without thinking.

Sunday, June 7, 2009

Ancient Social Engineering Attack



Various Social Engineering attacks are in the news these days. By taking advantage of people's vulnerabilities, criminals can get personal information or have programs run on their computers that can harvest information later, or be used as part of a botnet .

A simple example is where the attacker sends email to a victim saying they have won a prize and that they should visit a web site and confirm their identity by providing sensitive data like their Social Security Number, date of birth or mother's maiden name. The attacker than uses this data for fraud while the victim awaits their prize.

You might think this sort of attack is new, but here is a story of a social engineering attack from the early 1970s.

Dartmouth College developed a time-sharing system in the 1960s and allowed wide scale usage of it in the community. There were lots of non-science related programs available to use, including word processing and early text games. All students, faculty and staff were given user accounts and guest accounts were readily available for visitors.

Old mechanical terminals that ran at 10 characters per second were used to interact with the machine, but most of them were in private offices or clusters that weren't open to the public. The computer center did have what was called the "Public Teletype Room" that had 20 or so terminals and a student available to answer questions in person or by phone. If you were into computing, this was a great job. It was low pressure, you had access to computers and you got to help people with their problems. The only job better than this was being a systems programmer (sysprog), those guys who got to write parts of the operating system. This later job was the highest paid job on campus but if you stopped working after10 hours a week you were paid for. Given the excitement of doing real work, it was hard to only work for 10 hours and some students' grades suffered. But the Public Room Assistant job was great and gave plenty of time for working on personal projects.

Many kinds of people used the public area. High school students and even elementary school kids were in there either doing serious work, or just playing computer games. College students had priority, but there were often free terminals and the townspeople used them a lot. Like the town library, kids would go there after school and wait until their parents picked them up after work. It was a safe environment for kids, and with any luck the kids would get the computer bug.

One of the most basic problems people had was how to turn on the computer and type some simple commands. So, they might have wanted to play tic-tac-toe, but they didn't know how to sign-in and specify the program to run. After all, these people were not computer science students; they were just trying to use the computer and shouldn't be expected to learn or remember such arcania. Each terminal had a cheat sheet, (what we would call now an FAQ) taped next to the terminal which explained where the power switch was, what a user account was and a few simple commands with examples.

This was a time-sharing system, so any additional user would slow down other user's programs. Most of the games and other programs didn't requirement many resources, but it could be annoying if you had real work to do and a casual user was playing a game or sending email. So one of the assistants came up with a plan.

He developed a cheat sheet that looked similar to the official one. But on page two, he wrote about the fact that the computer had two CPUs and then mentioned that one was better than the other. He then said that the user could select which CPU was to be used. This was a complete lie, but not harmful. But then he went on to say that during heavy loads, the user should select this CPU by typing the command "GOOD" he followed this with the observation that during these times the computer may sign them off because the machine is too busy and they should wait and come back later during non-peak load times.

The "GOOD" command didn't exist, but the command line interpreter would parse the first three letters and that would match with the "GOODBYE" command. Needless to say, this command would sign the user out. Now remember, to the user, this was explained, so he never thought he had done anything wrong. This was just a clever denial of service attack which would only work on the casual user who went to the trouble of actually reading the documentation.

It was several months before someone in power discovered this bit of fakery.

It makes sense to allow abbreviations like "GOO" for the longer "GOODBYE", but it causes problems when a parser allows matches that are not close. Early language compilers typically ignored characters in an identifier after a certain length. For example, FORTRAN may ignore characters after 6 or 8 characters and BASIC would only scan the first three letters of a statement allowing RESET or RESTORE to mean the same thing. While allowing statements to have different spellings may only cause problems when porting programs to another machine, ignoring identifier differences can lead to immediate bugs which may be hard to see.

Sunday, May 31, 2009

Code Gems - Signum function


 

 

 


The signum function returns that sign of the argument. There are several different definitions, but a common one can be written in C as


int Signum( int x )
{
if( x < 0 ) return -1;
if( x == 0 ) return 0;
return 1;
}

 

In x86 instructions, this can be done with:


CDQ ; (1) x (eax) is sign extended into edx
NEG EAX ; (2) negate x (carry set if x>0)
ADC EDX,EDX ; (2) add with carry either -1 or 0
; result in edx

 

Not bad for 5 bytes of code.

This code was discovered by Henry Massalin and documented in his paper
Superoptimizer: a look at the smallest program published in 1987. This short 5 page paper had a number of interesting ideas, but the major for generating small functions was:

1) Generate all reasonable functions of size 1, 2, 3... bytes long,and
2) Eliminate the ones that clearly don't compute the same value as the function needed.

The first step is relatively easy, but you have to eliminate wild jumps or memory references to arbitrary addresses. From my reading, he just eliminated all jumps and memory references and only worked on values in registers.

The second step was accomplished by sending test vectors of values to the known correct function (e.g. signum function above) and comparing the results with the generated function. Since most generated functions will be wrong, the chances of generated function getting the correct answer for all the test vector is very small. Still, the author examined that generated function if it passed all the test vectors.

The test vectors could be hand generated (if you know a lot about the function), or generated randomly. Obviously, if the user generates them it takes more time initially, but that may speed up the process later. I suspect for many functions, generating random numbers is good enough.

There have been many follow-up papers to this. While it was easy to minimize the code size, with a proper CPU model of timings, the program could also compute minimum time routines. And since the number of different routines grows exponentially with the number of instructions, you will need a lot of computing time to generate and test routines greater than 10-15 bytes long without some significant changes to the generator.

Massalin mentioned a not so obvious usage: giving feedback to the assembly language programmer. He mentioned that he used it to write a 500 byte printf routine. New assembly language is not written much anymore, but it still comes up. For example, Excel's recalc engine is written in 32 bit x86 assembler and if its performance wasn't good enough, this type of optimization could come in handy.

There is a parallel between this idea and the idea of testing two or more implementations of the same function. In testing, you may have a known good model for what the code is supposed to do and the implementation under test. In this case, you take your test vector, run it through both your model and the implementation being tested and compare the results. If you are lucky, by definition the results should be the same. The model may be correct but quite slow, but this is fine for testing.

This is one of the ways to test graphics drivers. The model for every operation may be easy enough to write but way too slow for real systems while the driver is optimized for speed both because of the code and the graphics chip. But having one model of what is right allows you to test many different graphics drivers and chips.

One can even use this idea to test computer languages if you have more than one implementation. All you have to do is know how to generate valid programs, and then let the different implementations run them. If the results differ, you need to investigate. There are many limitations to this scheme, but it can improve the testing over hand coded tests. More about this later...

Thursday, May 28, 2009

Abstractions and Core Memory


One of the most enduring abstractions in computer science is that memory is located by address, holds values for a long time, but typically not forever and is uniform in the cost of access. Various kinds of storage like disks or tape are meant for the longest term storage and, to a first order approximation, forever. Glossed over in this abstraction are a few obvious things like different performance characteristics and size limitations.

It takes a while to learn that this model of memory is not exactly right. To get a program working, you need to know lots of other things and this simple model works well. If you are lucky, you won't have to deal with the differences between reality and the model.

The concept that memory is of infinite size is the first to go. Your first programs never need much space, but then you start down the road of saving computations, collecting data from external sources that need rapid access etc. One day, you either get an out of memory error, or you notice that your program takes way too much time because you have less real memory than the virtual memory that your program needs. Solving these problems usually takes some major surgery to the program, and anyone who has discovered these problems just before delivering will learn to do some estimating early on in the design.

The next thing you might discover is that the pattern by which the program addresses memory can change the speed of access. This is largely due to cache memory in the CPU, but could also be due to memory that hasn't been referenced in a while being paged out. Programmers who care about performance learn about how memory caches operate, but there are many different implementations with different characteristics. Unless you are lucky enough to be programming for just one implementation, your code should act reasonably well on all supported processors, which is not going to be easy given that they keep coming out with new processors. You probably need to make some benchmark programs to run tests on any processors you decide to support, and you probably need to keep up with what the chip manufacturers are planning.

In the end, you learn that memory has lots of features. It may be shared between processes, it is not uniform in its performance, it may disappear because of access permissions and it may change value from one instruction to another because of another CPU or thread.

These cases are fairly well covered in various processor manuals, device driver kits and text books. So even though we as an industry continue to see mistakes made because people don't understand these, at least they are discoverable by most programmers.

This is all background to two stories from computing days before semiconductor memories took over the market. I started computing on machines that used core memory and even worked with a machine later that used drum memory as its main memory.

Core memories work by storing state in magnets: by creating magnetic fields in the right direction with electric currents in wires. Each magnetic donut had two electric wires running at right angles thru them. The current in one wire wasn't enough to flip the magnet, but if both were on, then that donut would flip its magnetic direction. Later, to read the value, the field is set in one direction and if the magnet changes its direction, it will invoke a current in a sense wire. If the magnet doesn't change direction, no current will be created in the sense wire. SO, in this way, the memory subsystem would know which way the magnet was set. More than you want to know about this technology can be found in http://en.wikipedia.org/wiki/Core_memory

It is worth pointing out that reading the memory actually sets the values to zero, so the memory subsystem has to re-write the value. Since it takes time to write the old value, specs at the time would cite how long it would take to read the value and how long it would take before the next operation could start. On one machine I worked on, the architect also figured out that some instructions could be sped up if the re-write was aborted and the memory system waited for a new value to write. This was great for instructions like "Add to Memory". The memory location would be read, the CPU would compute the result and then give the memory system the new value to write to the location that was just read/zeroed. A documented side-effect of this was to lock out any other memory operations during this read-alter-re-write cycle. So these instructions were also used for semaphore and thread locking operations. This caused problems later when semiconductor memory was used and multiple processors were more common.

The first time I saw an actual core memory was in the basement of my father's business partner's house. He had been offered an opportunity to sell GE computers in the NYC area and he thought he should know something about them before he took the job. So he bought some surplus parts on Canal Street in Manhattan and put together a working CPU, core stack, and teletype interface. His conclusion was that computers were not as interesting as he thought so he didn't take the job. He gave me the memory pictured above. I think it is a register set, but I really don't know.

OK. Here is the first story:

Dartmouth developed a time-sharing system that was operational by mid-1964. Basic was also invented there by John Kemeny and Tom Kurtz, and John wrote the first compiler. The computer system was actually two separate computers: The GE 235 ran the compilers and user programs and the GE Datanet-30 communicated with the terminals and handled user commands like CMD.exe does on Windows. (Arnold Spielberg was the designer of the 235 for General Electric.)

The computer was available to everyone in the Dartmouth community. This was pretty radical for its time given the expense of computing, but it didn't cost that much more than a more restrictive policy and allowed for interesting non-traditional uses of the computer. The funny thing was every Monday morning when the first person signed on, the system would crash. Easy enough to recover from, but annoying and was unexplained. I believe it was John McGeachie who finally figured it out because I heard the story from him.

It turns out that a) no one used the computer on weekends and b) when the machine was idle it just executed a "jump to self" waiting for an interrupt. Why no one would use the computer on the weekends might not be understandable today, but computers back then were islands not connected to other computers. Computer games were basically unknown and really not that interesting on a 10 character per second terminal. But more importantly, Dartmouth was all-male at the time, so non-academic activities were pursued with vigor on the weekends. :-)

So the machine was idle for days days executing the same instruction over and over. The net result was that those parts of memory that should not have changed value when that memory was read (remember reading clears memory) were changed. I'm not clear on the physics of why this happened, but it was observed. The solution was to change the idle loop to also sweep thru every address in memory just reading. This spread out the magnetic fields inside the memory. On the next machine the college got, they implemented the same idea but with the addition of an instruction that that you won't see on many machines: The Gray-to-Binary instruction. It was slower than a divide instruction and didn't reference memory. This was done to cut down on memory traffic in case the other CPU or IO channels needed the bandwidth.

The second story comes from the mid-1970s:

College hires are typically given jobs that are challenging, but if they fail, they won't bring the company down with them. My former business partner started a job a Data General where he was given the job of writing the file system repair program, while someone hired at the same time as him got the memory test program.

You might think a memory test is simple. And a simple test is. But to do a good job, you should have a model of how the memory works and what kinds of errors it could have. A good memory test, when it detects an error, will give instructions on what to repair. There is tons of papers and practical knowledge for this area.

Apparently, this new college hire was better than they thought. Or perhaps they were too busy to notice that he had completed what was needed and had moved on to more extensive tests. He noticed that when the core memory flipped (this was at the end of the core era), that it vibrated a little bit. He soon figured out that if you made them flip at the right frequency, you could set up a standing wave on the wires. And for a good percentage of the memories, if you did this long enough, it broke the wires.

Management was impressed, but banned the program.

This to me is another example of how even a boring task could be made interesting if you try hard enough. And how, if you hire good people, you may not get exactly what you want.

Wednesday, May 27, 2009

Code gems - Absolute value function

Most of the time, little bits of code are straight-forward. But every machine I've worked on has little bits of code sequences that are very clever. Sometimes the code has instructions that are rarely seen and other times the instructions may be in odd ways. The older the architecture or the more complex the instruction set, the more chance you have of finding these little gems.

This post is about the ABS function in the Intel x86 architecture.

One of the simpler functions to code would have to be the absolute value function. Given the value X, the absolute value would be defined as


int abs1( int x)
{
if( x < 0) x= -x;
}

 
Pretty simple. This would easily translate into x86 assembler like so:
(Assume register eax has the value to be operated on in these examples)
(I have left out the prologue and epilogue from the function which wouldn't exist if the function were in-lined)
(The numbers in parens are the number of code bytes)


TEST EAX,EAX ; (2) compare eax with zero
JGE SKIP1 ; (2) jump if eax >= zero
NEG EAX ; (2) negate eax
SKIP1:


 
While this is not a lot of code, it has a jump instruction which on many processors can be expensive. Some recent processors can cut down on this expense if they can predict the jump, but it would be nice to find a solution with no jumps.

Let's look at a different way of doing this:


int abs2( int x )
{
int temp= (x>>31); // assumes a 32 bit integer
return (x ^ temp ) - temp;
}


 
It helps to remember that the expression "x>>31" does an arithmetic right shift of x by 31 bits and that "x ^ temp" means "x exclusive-or temp". The value of temp has either the value of 0 if x is positive or 0xFFFFFFFF (or -1) if x is negative.

At this point, you may want to remember that in the 2-complement notation that -x = ~x+1.

So, if x is positive, temp will be 0 and x^temp will just be x and the function returns x.

If x is negative, temp will be -1, x^temp would be the same at ~x so the expression is the same as ~x - (-1) or ~x+1 which is the same as -x.

The abs2 function won't generate great code, but it will avoid any jumps:


MOV ECX,EAX ; (2) copy x to ecx
SAR ECX,1FH ; (3) sign extend ecx thru all
; of ecx ->temp
XOR EAX,ECX ; (2) x ^ temp
SUB EAX,ECX ; (2) (x ^ temp) - temp


 
Not too bad, but even this can be improved upon:

CDQ ; (1) extend the sign of eax
; into edx (temp)
XOR EAX,EDX ; (2) exclusive-xor of temp into x
SUB EAX,EDX ; (2) subtract temp from (x^temp)


 
The CDQ instruction extends the sign of EAX into EDX. It is a 1 byte instruction that dates back to at least the 8086.

If your numbers are in signed-magnitude or 1-complement form, then the sign bit just needs to be turned off. Floating point formats typically have an explicit sign bit, so the absolute function is easier on these formats. But these formats have other corner cases to worry about like two forms of zero which makes checking for equality a bit tricky.

I found this implementation buried in ntdll:


MOV EAX,ECX ; (2) make copy of X into EAX
NEG EAX ; (2) negate X
TEST ECX,ECX ; (2) what sign does X have?
CMOVNS EAX,ECX ; (3) conditional move X to EAX if
; X is positive


 
This code also has the feature that there are no jumps, but it takes more code. And even though the CMOV instructions have been around for more than 10 years on Intel processors, there seems to be some question as to whether or not they improve performance. It would be interesting to know how this code was generated.

It is worth noting that there is one negative value that the absolute function with return a negative value for. This value is the most negative value e.g. 0x80000000. This could cause problems if the code assumes that values that have gone thru the absolute function are in the positive number range.

References:

http://en.wikipedia.org/wiki/Two%27s_complement

Tuesday, May 26, 2009

Manual Labor

There is an interesting article in this last Sunday's New York Times magazine
http://www.nytimes.com/2009/05/24/magazine/24labor-t.html

entitled "The Case for Working with Your Hands" made lots of interesting observations.

The author, Matthew B. Crawford, noticed that not all jobs that require advanced formal education are either well paid or interesting and that not all jobs that require little formal education are boring or poorly paid. He described several white collar jobs that were truly awful. Sometimes it is the actual work that is awful and sometimes it is the product itself that is hard to stand behind.

Crawford now owns a motorcycle repair shop specializing in older models and speaks of the joy of the job. He knows his customers, is presented with problems with no obvious right answer, apparently has no boss, and has to solve problems daily. He communicates with a community of other mechanics who have a lot of skills and experience. And it sounds like every day is interesting.

Contrast his description of being in a traditional blue collar job with many software engineering jobs. Many programming jobs are pretty boring. Modern corporations are typically organized to isolate programmers from actual customers by having layers of program managers, sales people, customer support organizations etc. And for that matter, the customer support people are isolated from contributing to the code. It is easy under these conditions for important information to be lost and for individuals to become more and more specialized and isolated from the actual customer or problem.

The author uses his jobs as a proof point that people should seriously consider blue collar jobs. But he is kind of mixing apples with oranges. I have certainly worked in software companies where the work was rewarding, management understood the quality of the work and what it was used for and programmers met with customers on a regular basis. And I've certainly worked in entry level blue collar jobs that I would never go back to.

Bill Gates, in an interview sometime before 1992, said that his job as the company got bigger was to make its employees feel like they were working for a small company. He clearly understood that programmers do not work in a vacuum, but work better in small groups. Bill was very quotable, but many of his ideas were not translated down into the trenches. For example, in the mid-1990s, he said that Windows was designed to be Year 2000 compliant. Needless to say, there were many changes made after that to make this statement true, but it wasn't driven by his initiative.

When Windows 2000 shipped, the leader of that effort, Brian Valentine, announced that any software engineer who wished to do so could visit a field office and/or visit some customers with a sales person present. If you wanted to take him up on the offer, you put your name and area of expertise into a database and the sales offices picked who they wanted. Some people went to India, Europe or New Zealand while others stayed in the USA. I was lucky enough to visit Nashville. The sales people kept us busy and we got to hear first hand what issues customers had with their plans for Microsoft software and Windows 2000 deployment specifically. This visitation program was never repeated. There were never any required trip reports, nor was there any effort to see what issues were commonly discussed. It was great for morale, but probably little else.

Finding a career that you love and a company that you love to work at is hard. People often go to college because it is expected of them, and they major in what interests them at the moment. But young people often have broad interests and it is not altogether surprising that what interests them at the age of 18 is not the same as when they are 30.

I once attended a talk by Ray Kroc given to an MBA class. Ray had started McDonalds hamburgers when he was in his 50s and was still quite active in his 80s when I saw him. One of the students, after hearing his talk, asked him what field he would go into to make money if he were graduating from a business school that spring. Ray responded simply: there is no guarantee that you will be rich, but if you do become rich, you will have to work hard. Therefore, he would recommend a job that, when you looked back on it in the future, you would feel good about your work. The problem, he said, was to find that job. This answer didn't make the student happy, but his thoughts have certainly stayed with me over the years. And now, as a parent, I find myself torn: I've always encouraged my daughters to figure out what they loved, so they could do that and feel good about it. But there are times when I wish I'd encouraged them in directions with well paying careers. I'll just have to trust that -- like the books says -- they'll be able to "Do what you love and the money will follow."

Tuesday, May 19, 2009

Programmer's Disease


I am a programmer and have been for 40+ years. I'm not just that of course. I'm a husband, parent of two daughters, a motorcycle rider and I enjoy films. But through the many changes in my life, the joy of the act of getting a program to work still is as thrilling as when my first program worked. The act of thinking about a solution, working out how to express that in a computer language and then making sure it is correct is time consuming and at times difficult, but is almost always rewarding in a way that is hard to explain to non-programmers.

The problem is that this process is addicting. If you are working on a large program, you have to pace yourself. Break down the problem into smaller problems, solve those problems, assure yourself that they are correct and then at the end get them to all work together. You may test the solutions to these sub-problems, and trust that they will work together at the end. But at the end, there are always "integration issues" that were not thought about and take time to work out.

This integration process is the most exciting and the most frustrating part of the process. And it is where the disease is really noticed. You try to get the program to do something and it almost works. Then you have to track down where it went wrong. This can be a time consuming task and requires that you know what intermediate results should be. If you are lucky, your code has enough comments and internal documentation to remind you of what various parts are suppose to do.

During this process, concentration is required to hold all this information in your head. Anything that distracts you from this process takes time to recover from. Time doesn't so much slow down as it gets ignored.

This is where the addiction becomes apparent to your friends, family and colleagues. You miss meetings, you skip meals, you come home hours after you told your wife you were just turning off the computer. You become unreliable and get irritated by anything that takes you away from your program. In short, you act like an addict.

John Kemeny was the first one to point this condition out to me in the mid-1980s. Earlier in his career, he had worked on the Manhattan project doing calculations for Richard Feynman. He went on to be Einstein's assistant, invented the programming language BASIC with Tom Kurtz and became President of Dartmouth College. But he, too, suffered from this addiction to programming. Actually, I don't think he suffered as much as the people around him suffered. He could work really long hours even in his 60s. John wrote the first BASIC compiler and it is well known that it compiled its first program early in the morning of May 1, 1964.

Here is an example of how exciting and time consuming things can be during integration. When we were creating True Basic, we developed a language system with an integrated development environment consisting of a compiler, a UI and an interpreter. Everything except the interpreter was written in Basic, and the interpreter was written initially in C and later in assembler on the machines it ran on. The compiler emitted code for an ideal "Basic" machine. I wrote the initial C interpreter and then re-wrote it in assembler for the actual product. This allowed the compiler writer and the UI writer to make progress while I worked out making the interpreter as fast as possible. The language was fairly large, but only a small subset of the language was needed to support the compiler and UI. I did this work first so we could debug the compiler and UI earlier in the process as well as getting actual feedback from users.

It looked like we were close to getting things running natively on the IBM PC, so John decided to come by and help. In general, he could be a distraction. He would be nice and get pizza, but if he detected that we needed technical help, he would politely ask if we would like help. Of course, he was quite capable of helping, but being who we are, we enjoyed solving these problems. Things were going pretty well, but then I had to leave to catch a 6:30am plane to my grandmother's 90th birthday before things were quite running.

At around 5:30am while packing, I get a phone call from John and David Pearson. They almost had it working but there was a problem with the character input routine where it masked off the lowest bit instead of the highest bit of any typed in character. This made it hard to type in programs and get them to run. They had found the problem in my code and were calling to see what the process was to assemble and link it. This was simple but time consuming on the IBM PC at the time; we only had floppy disks at the time and the assembler was slow.


They made the fix and waited for the new interpreter. But they were, shall I say, a bit impatient. So, they figured out what commands they could use and what a simple program would be that would only use the even characters from the keyboard. They proudly told me later that it worked the first time somewhere around 6:00am in the morning.

John was famous for his affection of his wife Jean, so I'm sure he didn't wake her up when he went home. But I'll bet you it was the first thing he spoke to her about after she did awaken.