Yazoo ---> Online Help Docs ---> Yazoo bytecode ---> Compiled expressions

Anatomy of a compiled sentence

Yazoo bytecode is a free-standing format on its own, and one can write perfectly legitimate bytecode programs in it without ever compiling from a script. In fact, one can write many programs which cannot be compiled from any script. For now, though, we will always take some script as our starting point, and work out its bytecode translation. The teaser at the end of the last section came from the following script:

print("This is a sample script")

This one turns out to be a surprisingly tough nut to crack, and it will take us until the end of the chapter to get it open. But many other expressions have translations that can be inferred, in rough form, from a zeroth-order prescription which we'll give below.

The basic rule for translating a script into bytecode is to think of each operator as a function, and re-arrange its arguments so that they follow the operator. For example, in the following expression

b + 5

the addition operator is essentially a function with two arguments, and the expression could be heuristically rewritten as

Add(b, 5)

Notice how the word order gets scrambled -- the operator comes first, followed by its left-hand argument, then its right-hand argument.

Just as functions sometimes modify their arguments directly instead of returning values (we often call these `commands'), we can find Yazoo operators that do the same. For example, the equation

a = 12

can be thought of as the following command:

Equate(a, 12)

When expressions are built up from sub-expressions, the rule is that the outermost operator in a standard expression is also the outermost -- i.e. first -- operator in the functional representation. This ensures that outermost operator is evaluated last, since functions evaluate their arguments before they themselves do anything. Putting our last two examples together:

a = b + 5

we see that the equate operator goes last, after b and 5 have been added. The functional representation is thus:

Equate(a, Add(b, 5))

In such a way, we can build up arbitrarily complex sentences of functions that correspond to any reasonable sequence of operator expressions that we may encounter.

What we have been writing is fairly close to bytecode, but there is one further thing we would have had to change besides the word order. To make execution easier, the compiler replaces each variable name with an ID number. For example, `a' would be replaced by `5' if it was the fifth member name that the compiler had encountered when processing the script. This brings up a potential problem: when Yazoo encounters "Add(..., 5)" should it add 5, or the contents of variable `a'? To make matters worse, operators themselves are denoted by ID numbers; `5' corresponds to the invocation of a user-defined function which could also be legitimate to use here.

Yazoo discriminates numbers, variables, etc. by tagging each of these with an operator that specifies what kind of thing it is. For numbers we use either a `slong' or `double' operator; for variables the operator is really a `member' operator (we have been speaking loosely when calling those things variables). Thus our expression turns into

Equate(member a, Add(member b, slong 5))

where we have to keep in mind that `a' and `b' will appear as numbers in the final reckoning.

Let's compare our guess to what the disassembler gives us. To disassemble the code officially, we can write:

compile("a = b + 5") bc_str := R_string print(disassemble(bc_str))

It's important to make sure that the compilation worked -- R_error_code should be 0 after the compile() call. If we pass a bad R_string then the disassembler will probably crash, disassemble itself in confusion and bail out of Yazoo. In our case, we should get the following output:

equ ( sm 1 , add ( sm 2 , csl 5 ) )

If we look at it long enough, we can see that we hit the nail on the head. The disassembler does have its own abbreviations for commands, such as `sm' for search-member and `csl' for constant-signed-long. More confusing is that it writes member IDs rather than member names, since it doesn't know the mapping between the two. We can improve our output by compiling into the AllNames namespace and then passing this optional third argument to the disassembler telling it to read member names from AllNames.

compile("a = b + 5", AllNames) bc_str := R_string print(disassemble(bc_str), *, AllNames)

(Notice that, if we want to pass the third argument of disassemble(), we also need a placeholder for the second -- just set it void.) The output is now:

equ ( sm $a , add ( sm $b , csl 5 ) )

which is very close to what we originally guessed.

We've been using the disassembler, so we are still a step away from what Yazoo actually looks at when it runs the program. The final step in reducing our script is to replace the operators with the operator ID numbers. Equate is special; it is one particular instance of a general-purpose define/equate operator (other instances being, e.g., `::' and `=@'). All define-equate operators have the ID number 11, but the variants are distinguished from each other by a second flags word, which in the case of equate is set to 1.

Let's go back to the original disassembly that made no mention of AllNames, where the member IDs show up as 1 and 2. Using the reference section to obtain the operator IDs (and replacing the member names with IDs), we obtain the following machine-level version of our bytecode:

11 1 ( 18 1 , 40 ( 18 2 , 50 5 ) )

The raw output of the compiler is a string; converting that string to an array of signed longs gives us something that we can compare with:

> words[*] :: ulong > compile("a = b + 5") > words[*] =! R_string > sprint(words) { 11, 1, 18, 1, 40, 18, 2, 50, 5, 0 }

so we were correct. The only new item in the `real' compiled array is the terminating zero, which signifies the end of the script. Every script ends with a null word, which tells the interpreter to fall back to the enclosing function -- or, if that was the starting script, to exit the program.

Pathnames

A Yazoo pathname looks superficially like a C pathname: it starts with the name of some member, which may be followed by some combinations of dots and brackets that refine the path. For example, the expression

my_struct.array[5].x

could be a legitimate path in either C or a Yazoo script. However, the roles of those dots and brackets are completely different. In C the compiler reduces the entire pathname to some fixed offset from my_struct. By contrast, Yazoo with its dynamic memory cannot know at compile-time where, or even if, each subsequent piece of the path can be found; it has to search for those at runtime. Thus the entire path is stored member-by-member in the compiled bytecode, with the dots and brackets retained as operators which move Yazoo's search-beam from one variable to the next.

A path typically begins with some member that Yazoo can find on its own, by searching backwards from the current function, followed (optionally) by members or indices which blaze a new trail forward one-step-at-a-time from that first member. The ordering in compiled code is, basically, backwards. Again thinking of the step-operators as functions, we reason that since the innermost function gets evaluated first, that must correspond to the first step. In heuristic language we might write our path as

step_to_member( step_to_index( search_member x, 5 ), my_struct.array )

where, as before, we have explicitly written the operators out longhand. We can compare with the disassembly.

dqa* ( sm $550 , sID ( sti ( sID ( sm $my_struct , $array ) , csl 5 ) , $x ) )

For now we will ignore the first few operators, up through sm $550 (that number will change) -- those will be explained at the end of the chapter. The step-to-member operator is abbreviated sID (since member names are replaced by ID numbers), and the step-to-index is abbreviated sti. Notice that step-to-member is a different operator from search-member (sm), since the latter begins a path, whereas the former continues a path and so requires an additional first argument specifying the path unto that point.

Some array operators require two arguments. The most basic example is the step-to-indices command (as opposed to step-to-index), which accepts two additional arguments after the initial path. The compiled forms of the array-element-insertion operators `[+...]' and `+[...]' also take index ranges, although the compiler accepts single indices and simply duplicates the entry to generate a range. For example, the code

array[+5]

translates as

iiu ( sm $array , csl 5 , csl 5 )

When confronted with a register (built-in variable), the disassembler will simply print out a register abbreviation, for example `[R_sl]' for R_slong. The actual bytecode contains two words: the first is the register operator, and the second is an ID number of the register to access. The ID numbers of the registers are given in the reference section.

The compiler does something funny to the pathname to the left of a define-operator: it replaces the `search-member' operator with a `step-from-this-to' sequence. For example, the sentence

my_var :: ulong

compiles into

def ( sID ( this , $my_var ) , ul )

This little oddity exists to prevent compiled define statements from accidentally re-defining members of enclosing functions. For example, if my_var had been defined in some class that also contains the active function, this command will define a new my_var member in the active function rather than modify the class's member; subsequent my_var invocations within the function will then access the new member. This allows functions to reuse common names, such as for-loop counters, without having to worry about accidentally overwriting global variables. Admittedly, that means we need a workaround when a script actually does want to redefine part of its parent. In the end this convention was adopted because it seemed the lesser of the two evils, and because it saves us from some rather uncouth programming practices.

Inlined constants

Inlined constants -- numbers and strings written directly into a script -- survive compilation relatively intact and are copied in one piece into the bytecode. If we were to try reading bytecode in the raw, without the aid of an operator ID table or a disassembler, these occasional sprinkles would be the only recognizable parts of the original script. There are three types of inlined constants in bytecode, each associated with a unique operator. The raw data follows in subsequent words of bytecode.

Numeric constants are stored either as signed longs or double-precision floating point numbers. The compiler chooses the latter only if the number cannot be stored in a slong integer, either because it has a fractional component or it is too large to be stored as an integer. Only numbers typed directly into the script are stored as constants; all numeric expressions (even trivial ones such as 5+2) are evaluated at runtime. Signed longs are written with a constant-signed-long operator (1 word), followed by the signed long value (1 word). Doubles are written in bytecode with a constant-double operator (1 word) followed by the double-precision value (2 words).

String constants have the added complication of having indefinite length. Since Yazoo strings are often used for storing binary data, Yazoo opts for the `Pascal' string convention rather than the C format: the byte-length of the string is stored explicitly, and there is no terminating character. The constant-string operator (1 word) is followed by the character length of the string (1 word, signed long, must be positive) followed by the bytes of the string (N/size(slong) words rounded up). Notice that the string length refers to the number of characters, not the number of bytecode words; and that the final byte of string data is followed by anywhere from 0 to one less than \texttt{size(slong) null bytes to fill in the last word of bytecode.

The disassembler doesn't show the string-length word -- it just writes out the string in quotes. The phrase "Hello" disassembles to

cst "Hello"

which corresponds to the following four bytecode words:

{ 52, 5, 1214606444, 1862270976 }

Here, the string characters are contained in the upper five bytes of the last two words.

Prev: Compiled expressions Next: Flow control: conditionals and `goto's

Last update: July 28, 2013