A Eureka Moment!
Back on the 13th August I posted a long email to the ICFP contest discussion list. What follows below is basically a slightly reworked version of that.
I wanted to share a "discovery" that I had made. I knew it could (and still might) turn out to be something already well known in certain circles (or be obvious to others) but I couldn't find anything on the net that seemed to be describing the same kind of thing. It certainly wasn't obvious to me, so I felt rather excited that perhaps this might be something "new" - even if it ultimately also turned out to have little or no real value!
In essence, I first found a way (for some simple cases at least) to precisely describe the number of instructions executed by a particular self-interpreter (when running another program) using a matrix, plus a less significant vector, both derived directly from a static analysis of the internals of that self-interpreter. Later on I stumbled onto the discovery that the dominant eigenvalue of that same matrix was the limit, as N goes to infinity, for the ratio of the total instructions executed to run a stack of N+1 levels of the interpreter to the total instructions executed by a stack of just N levels instead (and with both "stacks" running the same top level program). This also meant that as long as the time required to execute each opcode is some constant positive amount, then the same ratio describes the difference in total running time also.
In the same way that the set of eigenvalues and associated eigenvectors characterise a matrix, this ratio appears to characterise a particular self-interpreter. Therefore I decided to name this ratio the eigenratio of that self-interpreter.
Here's a blow by blow description of how I got to this point:
In an earlier post I described how I had done some experiments, and it seemed that as long as you had a high enough stack of um.um self-interpreters running some other UM binary, then adding yet another instance of um.um to the stack made the whole thing about 24.5 times slower. (Just under that value actually.)
That is, you can construct a stack of N um.um interpreters with another program to top it all off and then run the whole thing with some other valid interpreter (your own C implementation for example). If we then look at the ratio between the time taken to run the stack with N+1 interpreters to the time taken to run the stack with just N interpreters, it appears that the ratio converges to a value of around 24.5 as N increases. That is, as N increases, N+1 levels of the interpreter will take about 24.5 times as long as N levels to run the same top level program. Furthermore it seemed that the limit value of this ratio didn't change when a different top level program was used (on the very top of the stack).
Later I added a few lines of code to my own UM interpreter (the C implementation) so it would report exactly how many times each instruction in the UM instruction set was being executed at the lowest level of the "stack of self-interpreters" (at the level where things were "really" happening!) I collected those statistics with different numbers of um.um in the stack, and then put them into a spreadsheet where I could experiment with assigning more or less arbitrary, but still fixed positive "costs" to each instruction. Once again, it seemed that no matter what cost (e.g., execution time) I assigned to individual instructions, in the end the ratio of the total "cost" for N+1 to just N levels of um.um running some other top level program still converged to the same value of around 24.5! This was intriguing...
Next I decided to implement my own simple virtual machine/instruction set, and then to write a self-interpreter for it (in the same style as um.um) but with the specific goal of achieving a lower ratio than the 24.5 (in the limiting case) that um.um produced. I did this with a very simple machine with (initially) only 6 instructions and finally got my own self-interpreter running. It showed the same kind of behaviour as um.um when run as a stack of multiple copies of itself, but with a ratio of about 15.2. After examining the code for this intepreter I added another specialised instruction to replace a commonly used sequence of three others, and the updated VM and self-interpreter achieved a ratio of about 11.5. (Still a long way to go to get anywhere near the "Golden Ratio" (~1.618...) that I had plucked out of thin air for my earlier (and not entirely serious) "conjecture" but at least some progress was being made in the right direction!) Another subsequent smaller tweak to the instruction set reduced the ratio to about 11.2.
At this stage I decided to try a different approach, and after spending some time with pen and paper found a tidy way to describe and directly calculate the number of times each instruction would be executed in order to run a stack of N self-interpreters plus some other program on top. The solution was to use matrices.
The structure of my own various self-interpreters was simple. A very short section of code executed once at start-up, and then a loop was entered that on each iteration decoded the opcode for the next instruction from the program being interpreted (next level up) before branching to some code to emulate that instruction (adjusting memory addresses etc., along the way).
Therefore for any instruction being interpreted, I could follow the execution path through the interpreter and determine exactly which instructions it executed and also how many times. From this I built a matrix M with m rows and m columns (each row and column corresponding to one of the m instructions in the instruction set). Each column of the matrix corresponds to the emulation of one instruction from the instruction set, and the values in that column (reading the rows from top to bottom) are exactly the number of times each instruction is executed in the current level of the interpreter in order to handle that task. In other words, the value in row i and column j is the number of times instruction i has to be executed during the main loop in order to handle one instance of instruction j in the code being interpreted.
Next, I constructed an m by 1 vector (call it P) where each value is a count of times each instruction is executed during a full run of the top level program. (The order of the values in this vector must match the order used for the instructions in the rows and columns in M also. In my case the op codes were integers from 0 to m-1 and the same ordering was used in rows/columns of the matrices.)
Thirdly, construct one more vector (S) with the same dimensions as P where the values are again the numbers of times the various instructions are executed, but this time covering any code in the start-up section of the interpreter (before it enters its main loop to handle instructions from the program being interpreted).
Now (using juxtaposition to represent matrix multiplication), we can say:
P is the number of times instruction is executed to run the top level program.
MP + S is the number of times each instruction is executed by a self-interpreter running the top level program.
M(MP + S) + S = MMP + MS + S is the number of times each instruction is executed by a stack of two self-interpreters running the top level program.
And so on for higher levels...
So far, so good. Now I was able to use the Numpy package for Python (which provides support for various matrix operations) and after fixing a couple of errors in my matrices was able to reproduce exactly the same figures for counts of instructions executed as I had previously achieved but only by having my own C interpreter track them.
I was pretty happy at that point, having found a nice way to use matrices to predict exactly the number of times each instruction would be executed for a given stack of N copies of some self-interpreter plus some other program on top of that. But now I found myself wondering what characteristic of the matrix M (ignoring S because it is very small in comparison and therefore has very little effect on the big picture) produced the apparently fixed ratios observed as described earlier.
Now it's been a long time since I've had any need to use matrix determinants and so on, but I did vaguely remember the names of a few things of that nature, if not the actual uses of them! So more or less on a whim I googled "eigenvalues", found an online calculator, copied and pasted the contents of my own 'M' matrix, and clicked "calculate". Imagine my surprise when the first listed eigenvalue matched the (11.20...) ratio I had observed - to some number of decimal digits!! Definitely one of those "euraka moments"!
To cut a long story slightly shorter, I have now refreshed my memory on eigenvalues and eigenvectors and have also subsequently found the description of the "Power iteration" algorithm which seems to provide at least a basis for understanding why repeated multiplication by the matrix M eventually generates results that are essentially identical to multiplying by the magnitude of the dominant eigenvalue of M instead. Building a stack of interpreters is essentially equivalent to executing the Power Iteration algorithm using the matrix derived from the interpreter being used in that stack!
In summary, as we let N increase towards infinity, the ratio of total instructions executed to run N+1 interpreters (plus a some other fixed program as the very top level), to the total instructions executed when running just N interpreters (plus the same top-level program), is very close to the magnitude of the dominant eigenvector for the matrix M that describes the internals of the interpreter (as described earlier).
The eigenvalue equation tells us that:
M.(eigenvector k) = (eigenvalue k).(eigenvector k)
So (disregarding the effect of S) we can see that as we pile on more and more interpreters onto the stack, the vector holding the current number of times each instruction has been executed converges towards the "dominant eigenvector" for the matrix M (which describes the heart of the interpreter) and the ratio converges to the magnitude of the corresponding "dominant eigenvalue".
So far I've just been talking about counts of instructions executed. We can however easily include different costs for each different instruction in another 1 by m vector. The dot product of this and the final counts vector gives the total "cost". But when you calculate the ratio using a formula that includes the costs it turns out to be easy to see that the terms cancel and the final ratio is still the same as what got earlier using just the number of instructions executed.
Where to from here? At one point I was hoping that when I understood the relationship between a particular matrix M and the corresponding ratio, that it might help me find the smallest possible ratio - and therefore (in some sense) describe the "most efficient" self-interpreter possible. However, clearly M could (at least in a purely hypothetical sense) be the identity matrix and in that case the ratio is 1. Adding new levels of the interpreter wouldn't make anything run more slowly because if M equals the identity matrix (with ones down the leading diagonal and zeroes elsewhere) then we are also saying each instruction from level N+1 of our stack of programs can be emulated by a single instruction at level N. That doesn't make any sense to me without appealing to some kind of magic or perhaps some as yet undiscovered computing device that can do something for nothing. So the question of what the lower limit is on the "eigenratio" for any self-interpreter (and any "computing machine") is still an open question as far as I can see.
I wanted to share a "discovery" that I had made. I knew it could (and still might) turn out to be something already well known in certain circles (or be obvious to others) but I couldn't find anything on the net that seemed to be describing the same kind of thing. It certainly wasn't obvious to me, so I felt rather excited that perhaps this might be something "new" - even if it ultimately also turned out to have little or no real value!
In essence, I first found a way (for some simple cases at least) to precisely describe the number of instructions executed by a particular self-interpreter (when running another program) using a matrix, plus a less significant vector, both derived directly from a static analysis of the internals of that self-interpreter. Later on I stumbled onto the discovery that the dominant eigenvalue of that same matrix was the limit, as N goes to infinity, for the ratio of the total instructions executed to run a stack of N+1 levels of the interpreter to the total instructions executed by a stack of just N levels instead (and with both "stacks" running the same top level program). This also meant that as long as the time required to execute each opcode is some constant positive amount, then the same ratio describes the difference in total running time also.
In the same way that the set of eigenvalues and associated eigenvectors characterise a matrix, this ratio appears to characterise a particular self-interpreter. Therefore I decided to name this ratio the eigenratio of that self-interpreter.
Here's a blow by blow description of how I got to this point:
In an earlier post I described how I had done some experiments, and it seemed that as long as you had a high enough stack of um.um self-interpreters running some other UM binary, then adding yet another instance of um.um to the stack made the whole thing about 24.5 times slower. (Just under that value actually.)
That is, you can construct a stack of N um.um interpreters with another program to top it all off and then run the whole thing with some other valid interpreter (your own C implementation for example). If we then look at the ratio between the time taken to run the stack with N+1 interpreters to the time taken to run the stack with just N interpreters, it appears that the ratio converges to a value of around 24.5 as N increases. That is, as N increases, N+1 levels of the interpreter will take about 24.5 times as long as N levels to run the same top level program. Furthermore it seemed that the limit value of this ratio didn't change when a different top level program was used (on the very top of the stack).
Later I added a few lines of code to my own UM interpreter (the C implementation) so it would report exactly how many times each instruction in the UM instruction set was being executed at the lowest level of the "stack of self-interpreters" (at the level where things were "really" happening!) I collected those statistics with different numbers of um.um in the stack, and then put them into a spreadsheet where I could experiment with assigning more or less arbitrary, but still fixed positive "costs" to each instruction. Once again, it seemed that no matter what cost (e.g., execution time) I assigned to individual instructions, in the end the ratio of the total "cost" for N+1 to just N levels of um.um running some other top level program still converged to the same value of around 24.5! This was intriguing...
Next I decided to implement my own simple virtual machine/instruction set, and then to write a self-interpreter for it (in the same style as um.um) but with the specific goal of achieving a lower ratio than the 24.5 (in the limiting case) that um.um produced. I did this with a very simple machine with (initially) only 6 instructions and finally got my own self-interpreter running. It showed the same kind of behaviour as um.um when run as a stack of multiple copies of itself, but with a ratio of about 15.2. After examining the code for this intepreter I added another specialised instruction to replace a commonly used sequence of three others, and the updated VM and self-interpreter achieved a ratio of about 11.5. (Still a long way to go to get anywhere near the "Golden Ratio" (~1.618...) that I had plucked out of thin air for my earlier (and not entirely serious) "conjecture" but at least some progress was being made in the right direction!) Another subsequent smaller tweak to the instruction set reduced the ratio to about 11.2.
At this stage I decided to try a different approach, and after spending some time with pen and paper found a tidy way to describe and directly calculate the number of times each instruction would be executed in order to run a stack of N self-interpreters plus some other program on top. The solution was to use matrices.
The structure of my own various self-interpreters was simple. A very short section of code executed once at start-up, and then a loop was entered that on each iteration decoded the opcode for the next instruction from the program being interpreted (next level up) before branching to some code to emulate that instruction (adjusting memory addresses etc., along the way).
Therefore for any instruction being interpreted, I could follow the execution path through the interpreter and determine exactly which instructions it executed and also how many times. From this I built a matrix M with m rows and m columns (each row and column corresponding to one of the m instructions in the instruction set). Each column of the matrix corresponds to the emulation of one instruction from the instruction set, and the values in that column (reading the rows from top to bottom) are exactly the number of times each instruction is executed in the current level of the interpreter in order to handle that task. In other words, the value in row i and column j is the number of times instruction i has to be executed during the main loop in order to handle one instance of instruction j in the code being interpreted.
Next, I constructed an m by 1 vector (call it P) where each value is a count of times each instruction is executed during a full run of the top level program. (The order of the values in this vector must match the order used for the instructions in the rows and columns in M also. In my case the op codes were integers from 0 to m-1 and the same ordering was used in rows/columns of the matrices.)
Thirdly, construct one more vector (S) with the same dimensions as P where the values are again the numbers of times the various instructions are executed, but this time covering any code in the start-up section of the interpreter (before it enters its main loop to handle instructions from the program being interpreted).
Now (using juxtaposition to represent matrix multiplication), we can say:
P is the number of times instruction is executed to run the top level program.
MP + S is the number of times each instruction is executed by a self-interpreter running the top level program.
M(MP + S) + S = MMP + MS + S is the number of times each instruction is executed by a stack of two self-interpreters running the top level program.
And so on for higher levels...
So far, so good. Now I was able to use the Numpy package for Python (which provides support for various matrix operations) and after fixing a couple of errors in my matrices was able to reproduce exactly the same figures for counts of instructions executed as I had previously achieved but only by having my own C interpreter track them.
I was pretty happy at that point, having found a nice way to use matrices to predict exactly the number of times each instruction would be executed for a given stack of N copies of some self-interpreter plus some other program on top of that. But now I found myself wondering what characteristic of the matrix M (ignoring S because it is very small in comparison and therefore has very little effect on the big picture) produced the apparently fixed ratios observed as described earlier.
Now it's been a long time since I've had any need to use matrix determinants and so on, but I did vaguely remember the names of a few things of that nature, if not the actual uses of them! So more or less on a whim I googled "eigenvalues", found an online calculator, copied and pasted the contents of my own 'M' matrix, and clicked "calculate". Imagine my surprise when the first listed eigenvalue matched the (11.20...) ratio I had observed - to some number of decimal digits!! Definitely one of those "euraka moments"!
To cut a long story slightly shorter, I have now refreshed my memory on eigenvalues and eigenvectors and have also subsequently found the description of the "Power iteration" algorithm which seems to provide at least a basis for understanding why repeated multiplication by the matrix M eventually generates results that are essentially identical to multiplying by the magnitude of the dominant eigenvalue of M instead. Building a stack of interpreters is essentially equivalent to executing the Power Iteration algorithm using the matrix derived from the interpreter being used in that stack!
In summary, as we let N increase towards infinity, the ratio of total instructions executed to run N+1 interpreters (plus a some other fixed program as the very top level), to the total instructions executed when running just N interpreters (plus the same top-level program), is very close to the magnitude of the dominant eigenvector for the matrix M that describes the internals of the interpreter (as described earlier).
The eigenvalue equation tells us that:
M.(eigenvector k) = (eigenvalue k).(eigenvector k)
So (disregarding the effect of S) we can see that as we pile on more and more interpreters onto the stack, the vector holding the current number of times each instruction has been executed converges towards the "dominant eigenvector" for the matrix M (which describes the heart of the interpreter) and the ratio converges to the magnitude of the corresponding "dominant eigenvalue".
So far I've just been talking about counts of instructions executed. We can however easily include different costs for each different instruction in another 1 by m vector. The dot product of this and the final counts vector gives the total "cost". But when you calculate the ratio using a formula that includes the costs it turns out to be easy to see that the terms cancel and the final ratio is still the same as what got earlier using just the number of instructions executed.
Where to from here? At one point I was hoping that when I understood the relationship between a particular matrix M and the corresponding ratio, that it might help me find the smallest possible ratio - and therefore (in some sense) describe the "most efficient" self-interpreter possible. However, clearly M could (at least in a purely hypothetical sense) be the identity matrix and in that case the ratio is 1. Adding new levels of the interpreter wouldn't make anything run more slowly because if M equals the identity matrix (with ones down the leading diagonal and zeroes elsewhere) then we are also saying each instruction from level N+1 of our stack of programs can be emulated by a single instruction at level N. That doesn't make any sense to me without appealing to some kind of magic or perhaps some as yet undiscovered computing device that can do something for nothing. So the question of what the lower limit is on the "eigenratio" for any self-interpreter (and any "computing machine") is still an open question as far as I can see.