Sunday, May 31, 2009

Code Gems - Signum function


 

 

 


The signum function returns that sign of the argument. There are several different definitions, but a common one can be written in C as


int Signum( int x )
{
if( x < 0 ) return -1;
if( x == 0 ) return 0;
return 1;
}

 

In x86 instructions, this can be done with:


CDQ ; (1) x (eax) is sign extended into edx
NEG EAX ; (2) negate x (carry set if x>0)
ADC EDX,EDX ; (2) add with carry either -1 or 0
; result in edx

 

Not bad for 5 bytes of code.

This code was discovered by Henry Massalin and documented in his paper
Superoptimizer: a look at the smallest program published in 1987. This short 5 page paper had a number of interesting ideas, but the major for generating small functions was:

1) Generate all reasonable functions of size 1, 2, 3... bytes long,and
2) Eliminate the ones that clearly don't compute the same value as the function needed.

The first step is relatively easy, but you have to eliminate wild jumps or memory references to arbitrary addresses. From my reading, he just eliminated all jumps and memory references and only worked on values in registers.

The second step was accomplished by sending test vectors of values to the known correct function (e.g. signum function above) and comparing the results with the generated function. Since most generated functions will be wrong, the chances of generated function getting the correct answer for all the test vector is very small. Still, the author examined that generated function if it passed all the test vectors.

The test vectors could be hand generated (if you know a lot about the function), or generated randomly. Obviously, if the user generates them it takes more time initially, but that may speed up the process later. I suspect for many functions, generating random numbers is good enough.

There have been many follow-up papers to this. While it was easy to minimize the code size, with a proper CPU model of timings, the program could also compute minimum time routines. And since the number of different routines grows exponentially with the number of instructions, you will need a lot of computing time to generate and test routines greater than 10-15 bytes long without some significant changes to the generator.

Massalin mentioned a not so obvious usage: giving feedback to the assembly language programmer. He mentioned that he used it to write a 500 byte printf routine. New assembly language is not written much anymore, but it still comes up. For example, Excel's recalc engine is written in 32 bit x86 assembler and if its performance wasn't good enough, this type of optimization could come in handy.

There is a parallel between this idea and the idea of testing two or more implementations of the same function. In testing, you may have a known good model for what the code is supposed to do and the implementation under test. In this case, you take your test vector, run it through both your model and the implementation being tested and compare the results. If you are lucky, by definition the results should be the same. The model may be correct but quite slow, but this is fine for testing.

This is one of the ways to test graphics drivers. The model for every operation may be easy enough to write but way too slow for real systems while the driver is optimized for speed both because of the code and the graphics chip. But having one model of what is right allows you to test many different graphics drivers and chips.

One can even use this idea to test computer languages if you have more than one implementation. All you have to do is know how to generate valid programs, and then let the different implementations run them. If the results differ, you need to investigate. There are many limitations to this scheme, but it can improve the testing over hand coded tests. More about this later...

Thursday, May 28, 2009

Abstractions and Core Memory


One of the most enduring abstractions in computer science is that memory is located by address, holds values for a long time, but typically not forever and is uniform in the cost of access. Various kinds of storage like disks or tape are meant for the longest term storage and, to a first order approximation, forever. Glossed over in this abstraction are a few obvious things like different performance characteristics and size limitations.

It takes a while to learn that this model of memory is not exactly right. To get a program working, you need to know lots of other things and this simple model works well. If you are lucky, you won't have to deal with the differences between reality and the model.

The concept that memory is of infinite size is the first to go. Your first programs never need much space, but then you start down the road of saving computations, collecting data from external sources that need rapid access etc. One day, you either get an out of memory error, or you notice that your program takes way too much time because you have less real memory than the virtual memory that your program needs. Solving these problems usually takes some major surgery to the program, and anyone who has discovered these problems just before delivering will learn to do some estimating early on in the design.

The next thing you might discover is that the pattern by which the program addresses memory can change the speed of access. This is largely due to cache memory in the CPU, but could also be due to memory that hasn't been referenced in a while being paged out. Programmers who care about performance learn about how memory caches operate, but there are many different implementations with different characteristics. Unless you are lucky enough to be programming for just one implementation, your code should act reasonably well on all supported processors, which is not going to be easy given that they keep coming out with new processors. You probably need to make some benchmark programs to run tests on any processors you decide to support, and you probably need to keep up with what the chip manufacturers are planning.

In the end, you learn that memory has lots of features. It may be shared between processes, it is not uniform in its performance, it may disappear because of access permissions and it may change value from one instruction to another because of another CPU or thread.

These cases are fairly well covered in various processor manuals, device driver kits and text books. So even though we as an industry continue to see mistakes made because people don't understand these, at least they are discoverable by most programmers.

This is all background to two stories from computing days before semiconductor memories took over the market. I started computing on machines that used core memory and even worked with a machine later that used drum memory as its main memory.

Core memories work by storing state in magnets: by creating magnetic fields in the right direction with electric currents in wires. Each magnetic donut had two electric wires running at right angles thru them. The current in one wire wasn't enough to flip the magnet, but if both were on, then that donut would flip its magnetic direction. Later, to read the value, the field is set in one direction and if the magnet changes its direction, it will invoke a current in a sense wire. If the magnet doesn't change direction, no current will be created in the sense wire. SO, in this way, the memory subsystem would know which way the magnet was set. More than you want to know about this technology can be found in http://en.wikipedia.org/wiki/Core_memory

It is worth pointing out that reading the memory actually sets the values to zero, so the memory subsystem has to re-write the value. Since it takes time to write the old value, specs at the time would cite how long it would take to read the value and how long it would take before the next operation could start. On one machine I worked on, the architect also figured out that some instructions could be sped up if the re-write was aborted and the memory system waited for a new value to write. This was great for instructions like "Add to Memory". The memory location would be read, the CPU would compute the result and then give the memory system the new value to write to the location that was just read/zeroed. A documented side-effect of this was to lock out any other memory operations during this read-alter-re-write cycle. So these instructions were also used for semaphore and thread locking operations. This caused problems later when semiconductor memory was used and multiple processors were more common.

The first time I saw an actual core memory was in the basement of my father's business partner's house. He had been offered an opportunity to sell GE computers in the NYC area and he thought he should know something about them before he took the job. So he bought some surplus parts on Canal Street in Manhattan and put together a working CPU, core stack, and teletype interface. His conclusion was that computers were not as interesting as he thought so he didn't take the job. He gave me the memory pictured above. I think it is a register set, but I really don't know.

OK. Here is the first story:

Dartmouth developed a time-sharing system that was operational by mid-1964. Basic was also invented there by John Kemeny and Tom Kurtz, and John wrote the first compiler. The computer system was actually two separate computers: The GE 235 ran the compilers and user programs and the GE Datanet-30 communicated with the terminals and handled user commands like CMD.exe does on Windows. (Arnold Spielberg was the designer of the 235 for General Electric.)

The computer was available to everyone in the Dartmouth community. This was pretty radical for its time given the expense of computing, but it didn't cost that much more than a more restrictive policy and allowed for interesting non-traditional uses of the computer. The funny thing was every Monday morning when the first person signed on, the system would crash. Easy enough to recover from, but annoying and was unexplained. I believe it was John McGeachie who finally figured it out because I heard the story from him.

It turns out that a) no one used the computer on weekends and b) when the machine was idle it just executed a "jump to self" waiting for an interrupt. Why no one would use the computer on the weekends might not be understandable today, but computers back then were islands not connected to other computers. Computer games were basically unknown and really not that interesting on a 10 character per second terminal. But more importantly, Dartmouth was all-male at the time, so non-academic activities were pursued with vigor on the weekends. :-)

So the machine was idle for days days executing the same instruction over and over. The net result was that those parts of memory that should not have changed value when that memory was read (remember reading clears memory) were changed. I'm not clear on the physics of why this happened, but it was observed. The solution was to change the idle loop to also sweep thru every address in memory just reading. This spread out the magnetic fields inside the memory. On the next machine the college got, they implemented the same idea but with the addition of an instruction that that you won't see on many machines: The Gray-to-Binary instruction. It was slower than a divide instruction and didn't reference memory. This was done to cut down on memory traffic in case the other CPU or IO channels needed the bandwidth.

The second story comes from the mid-1970s:

College hires are typically given jobs that are challenging, but if they fail, they won't bring the company down with them. My former business partner started a job a Data General where he was given the job of writing the file system repair program, while someone hired at the same time as him got the memory test program.

You might think a memory test is simple. And a simple test is. But to do a good job, you should have a model of how the memory works and what kinds of errors it could have. A good memory test, when it detects an error, will give instructions on what to repair. There is tons of papers and practical knowledge for this area.

Apparently, this new college hire was better than they thought. Or perhaps they were too busy to notice that he had completed what was needed and had moved on to more extensive tests. He noticed that when the core memory flipped (this was at the end of the core era), that it vibrated a little bit. He soon figured out that if you made them flip at the right frequency, you could set up a standing wave on the wires. And for a good percentage of the memories, if you did this long enough, it broke the wires.

Management was impressed, but banned the program.

This to me is another example of how even a boring task could be made interesting if you try hard enough. And how, if you hire good people, you may not get exactly what you want.

Wednesday, May 27, 2009

Code gems - Absolute value function

Most of the time, little bits of code are straight-forward. But every machine I've worked on has little bits of code sequences that are very clever. Sometimes the code has instructions that are rarely seen and other times the instructions may be in odd ways. The older the architecture or the more complex the instruction set, the more chance you have of finding these little gems.

This post is about the ABS function in the Intel x86 architecture.

One of the simpler functions to code would have to be the absolute value function. Given the value X, the absolute value would be defined as


int abs1( int x)
{
if( x < 0) x= -x;
}

 
Pretty simple. This would easily translate into x86 assembler like so:
(Assume register eax has the value to be operated on in these examples)
(I have left out the prologue and epilogue from the function which wouldn't exist if the function were in-lined)
(The numbers in parens are the number of code bytes)


TEST EAX,EAX ; (2) compare eax with zero
JGE SKIP1 ; (2) jump if eax >= zero
NEG EAX ; (2) negate eax
SKIP1:


 
While this is not a lot of code, it has a jump instruction which on many processors can be expensive. Some recent processors can cut down on this expense if they can predict the jump, but it would be nice to find a solution with no jumps.

Let's look at a different way of doing this:


int abs2( int x )
{
int temp= (x>>31); // assumes a 32 bit integer
return (x ^ temp ) - temp;
}


 
It helps to remember that the expression "x>>31" does an arithmetic right shift of x by 31 bits and that "x ^ temp" means "x exclusive-or temp". The value of temp has either the value of 0 if x is positive or 0xFFFFFFFF (or -1) if x is negative.

At this point, you may want to remember that in the 2-complement notation that -x = ~x+1.

So, if x is positive, temp will be 0 and x^temp will just be x and the function returns x.

If x is negative, temp will be -1, x^temp would be the same at ~x so the expression is the same as ~x - (-1) or ~x+1 which is the same as -x.

The abs2 function won't generate great code, but it will avoid any jumps:


MOV ECX,EAX ; (2) copy x to ecx
SAR ECX,1FH ; (3) sign extend ecx thru all
; of ecx ->temp
XOR EAX,ECX ; (2) x ^ temp
SUB EAX,ECX ; (2) (x ^ temp) - temp


 
Not too bad, but even this can be improved upon:

CDQ ; (1) extend the sign of eax
; into edx (temp)
XOR EAX,EDX ; (2) exclusive-xor of temp into x
SUB EAX,EDX ; (2) subtract temp from (x^temp)


 
The CDQ instruction extends the sign of EAX into EDX. It is a 1 byte instruction that dates back to at least the 8086.

If your numbers are in signed-magnitude or 1-complement form, then the sign bit just needs to be turned off. Floating point formats typically have an explicit sign bit, so the absolute function is easier on these formats. But these formats have other corner cases to worry about like two forms of zero which makes checking for equality a bit tricky.

I found this implementation buried in ntdll:


MOV EAX,ECX ; (2) make copy of X into EAX
NEG EAX ; (2) negate X
TEST ECX,ECX ; (2) what sign does X have?
CMOVNS EAX,ECX ; (3) conditional move X to EAX if
; X is positive


 
This code also has the feature that there are no jumps, but it takes more code. And even though the CMOV instructions have been around for more than 10 years on Intel processors, there seems to be some question as to whether or not they improve performance. It would be interesting to know how this code was generated.

It is worth noting that there is one negative value that the absolute function with return a negative value for. This value is the most negative value e.g. 0x80000000. This could cause problems if the code assumes that values that have gone thru the absolute function are in the positive number range.

References:

http://en.wikipedia.org/wiki/Two%27s_complement

Tuesday, May 26, 2009

Manual Labor

There is an interesting article in this last Sunday's New York Times magazine
http://www.nytimes.com/2009/05/24/magazine/24labor-t.html

entitled "The Case for Working with Your Hands" made lots of interesting observations.

The author, Matthew B. Crawford, noticed that not all jobs that require advanced formal education are either well paid or interesting and that not all jobs that require little formal education are boring or poorly paid. He described several white collar jobs that were truly awful. Sometimes it is the actual work that is awful and sometimes it is the product itself that is hard to stand behind.

Crawford now owns a motorcycle repair shop specializing in older models and speaks of the joy of the job. He knows his customers, is presented with problems with no obvious right answer, apparently has no boss, and has to solve problems daily. He communicates with a community of other mechanics who have a lot of skills and experience. And it sounds like every day is interesting.

Contrast his description of being in a traditional blue collar job with many software engineering jobs. Many programming jobs are pretty boring. Modern corporations are typically organized to isolate programmers from actual customers by having layers of program managers, sales people, customer support organizations etc. And for that matter, the customer support people are isolated from contributing to the code. It is easy under these conditions for important information to be lost and for individuals to become more and more specialized and isolated from the actual customer or problem.

The author uses his jobs as a proof point that people should seriously consider blue collar jobs. But he is kind of mixing apples with oranges. I have certainly worked in software companies where the work was rewarding, management understood the quality of the work and what it was used for and programmers met with customers on a regular basis. And I've certainly worked in entry level blue collar jobs that I would never go back to.

Bill Gates, in an interview sometime before 1992, said that his job as the company got bigger was to make its employees feel like they were working for a small company. He clearly understood that programmers do not work in a vacuum, but work better in small groups. Bill was very quotable, but many of his ideas were not translated down into the trenches. For example, in the mid-1990s, he said that Windows was designed to be Year 2000 compliant. Needless to say, there were many changes made after that to make this statement true, but it wasn't driven by his initiative.

When Windows 2000 shipped, the leader of that effort, Brian Valentine, announced that any software engineer who wished to do so could visit a field office and/or visit some customers with a sales person present. If you wanted to take him up on the offer, you put your name and area of expertise into a database and the sales offices picked who they wanted. Some people went to India, Europe or New Zealand while others stayed in the USA. I was lucky enough to visit Nashville. The sales people kept us busy and we got to hear first hand what issues customers had with their plans for Microsoft software and Windows 2000 deployment specifically. This visitation program was never repeated. There were never any required trip reports, nor was there any effort to see what issues were commonly discussed. It was great for morale, but probably little else.

Finding a career that you love and a company that you love to work at is hard. People often go to college because it is expected of them, and they major in what interests them at the moment. But young people often have broad interests and it is not altogether surprising that what interests them at the age of 18 is not the same as when they are 30.

I once attended a talk by Ray Kroc given to an MBA class. Ray had started McDonalds hamburgers when he was in his 50s and was still quite active in his 80s when I saw him. One of the students, after hearing his talk, asked him what field he would go into to make money if he were graduating from a business school that spring. Ray responded simply: there is no guarantee that you will be rich, but if you do become rich, you will have to work hard. Therefore, he would recommend a job that, when you looked back on it in the future, you would feel good about your work. The problem, he said, was to find that job. This answer didn't make the student happy, but his thoughts have certainly stayed with me over the years. And now, as a parent, I find myself torn: I've always encouraged my daughters to figure out what they loved, so they could do that and feel good about it. But there are times when I wish I'd encouraged them in directions with well paying careers. I'll just have to trust that -- like the books says -- they'll be able to "Do what you love and the money will follow."

Tuesday, May 19, 2009

Programmer's Disease


I am a programmer and have been for 40+ years. I'm not just that of course. I'm a husband, parent of two daughters, a motorcycle rider and I enjoy films. But through the many changes in my life, the joy of the act of getting a program to work still is as thrilling as when my first program worked. The act of thinking about a solution, working out how to express that in a computer language and then making sure it is correct is time consuming and at times difficult, but is almost always rewarding in a way that is hard to explain to non-programmers.

The problem is that this process is addicting. If you are working on a large program, you have to pace yourself. Break down the problem into smaller problems, solve those problems, assure yourself that they are correct and then at the end get them to all work together. You may test the solutions to these sub-problems, and trust that they will work together at the end. But at the end, there are always "integration issues" that were not thought about and take time to work out.

This integration process is the most exciting and the most frustrating part of the process. And it is where the disease is really noticed. You try to get the program to do something and it almost works. Then you have to track down where it went wrong. This can be a time consuming task and requires that you know what intermediate results should be. If you are lucky, your code has enough comments and internal documentation to remind you of what various parts are suppose to do.

During this process, concentration is required to hold all this information in your head. Anything that distracts you from this process takes time to recover from. Time doesn't so much slow down as it gets ignored.

This is where the addiction becomes apparent to your friends, family and colleagues. You miss meetings, you skip meals, you come home hours after you told your wife you were just turning off the computer. You become unreliable and get irritated by anything that takes you away from your program. In short, you act like an addict.

John Kemeny was the first one to point this condition out to me in the mid-1980s. Earlier in his career, he had worked on the Manhattan project doing calculations for Richard Feynman. He went on to be Einstein's assistant, invented the programming language BASIC with Tom Kurtz and became President of Dartmouth College. But he, too, suffered from this addiction to programming. Actually, I don't think he suffered as much as the people around him suffered. He could work really long hours even in his 60s. John wrote the first BASIC compiler and it is well known that it compiled its first program early in the morning of May 1, 1964.

Here is an example of how exciting and time consuming things can be during integration. When we were creating True Basic, we developed a language system with an integrated development environment consisting of a compiler, a UI and an interpreter. Everything except the interpreter was written in Basic, and the interpreter was written initially in C and later in assembler on the machines it ran on. The compiler emitted code for an ideal "Basic" machine. I wrote the initial C interpreter and then re-wrote it in assembler for the actual product. This allowed the compiler writer and the UI writer to make progress while I worked out making the interpreter as fast as possible. The language was fairly large, but only a small subset of the language was needed to support the compiler and UI. I did this work first so we could debug the compiler and UI earlier in the process as well as getting actual feedback from users.

It looked like we were close to getting things running natively on the IBM PC, so John decided to come by and help. In general, he could be a distraction. He would be nice and get pizza, but if he detected that we needed technical help, he would politely ask if we would like help. Of course, he was quite capable of helping, but being who we are, we enjoyed solving these problems. Things were going pretty well, but then I had to leave to catch a 6:30am plane to my grandmother's 90th birthday before things were quite running.

At around 5:30am while packing, I get a phone call from John and David Pearson. They almost had it working but there was a problem with the character input routine where it masked off the lowest bit instead of the highest bit of any typed in character. This made it hard to type in programs and get them to run. They had found the problem in my code and were calling to see what the process was to assemble and link it. This was simple but time consuming on the IBM PC at the time; we only had floppy disks at the time and the assembler was slow.


They made the fix and waited for the new interpreter. But they were, shall I say, a bit impatient. So, they figured out what commands they could use and what a simple program would be that would only use the even characters from the keyboard. They proudly told me later that it worked the first time somewhere around 6:00am in the morning.

John was famous for his affection of his wife Jean, so I'm sure he didn't wake her up when he went home. But I'll bet you it was the first thing he spoke to her about after she did awaken.