05 Feb, 2009, Silenus wrote in the 1st comment:

Votes: 0

Hi everyone,

This might be a somewhat obscure topic but I am having some difficulties designing an appropriate intermediate language/representation for a compiler which with map LPC on to a variety of different backends (both the FluffOS stack machine and LLVM are currently planned). The difficulty I am having is with typing and the IL. Certain IL's are typed like JVM bytecodes and LLVM bitcodes but certain ones are not. What are the issues involved? My desire was to use a Tree based intermediate representation for this similar to one described in Appel's compiler book but with modifications which allow for const,load/store of different types(int,float,string,array,mapping,function,class), run-time type identification, object cloning/program loading and dynamic calling.

The problem is i am not sure how to do this with non-simple types. I.e. LPC has types like arrays of ints and perhaps should have types like mappings with keys being string and integer values. I am not really quite sure what to do here. There are several possibilities. Just have type conversions for simple types and somehow write in the IR some tree expressions which can

Any compiler people around here :-)?

05 Feb, 2009, David Haley wrote in the 2nd comment:

Votes: 0

I'm not necessarily a "compiler person", however I know a few things about the subject. So with that in mind… :smile:

I'm not sure I follow why you care about the typedness of the "machine language". I don't really think of JVM bytecode as an intermediate representation and I'm not sure it's appropriate to do so. After all, the bytecode is generated at the end of the process, after all kinds of optimization etc. has been applied to the program structure.

Your IR should obviously be able to capture all of the information you want to express. If types are important to you, then clearly your IR needs to be able to handle types. It is then up to your code generation routines to do the appropriate thing for platforms that don't have types.

You should also think about whether you need the typing to be enforced by the backend. Keep in mind the history of why the Java virtual machine has types. It is because the idea is that you don't always trust code being loaded, and you want to keep funny business to a minimum. Just look at assembly to see that you don't need any typing at all in the machine language to provide types at the higher level languages.

I'm not entirely sure what your question is, and you have some unfinished sentences :wink:

What exactly is the problem? Why can't you just write design your IR to capture the semantics of your language, and then do your semantic analysis over that, and leave code generation to the very last minute?

05 Feb, 2009, Silenus wrote in the 3rd comment:

Votes: 0

Oops I forgot to finish that sentence before posting ;-). Thanks for the reply.

Basically I am trying to keep certain aspects of the system abstract in the IR (this is post-AST of course). I am essentially mapping the AST -> IR then again to possibly 3 different targets the fluffos system which is completely runtime typed, the LLVM system which is also typed and perhaps some other typed stack machine. So basically it's LPC -> AST -> IR -> Fluff| LLVM | JVM. The problem is say i completely delete typing from the IR and expressed notions such as arrays as pointers with offsets for example- it becomes extremely problematic mapping this onto the FluffOS system because I would have to reconvert this lower level representation back into a higher-level one given how high level the machine is. That I think is my rationale for typing the IR like this.

Most of the semantic analysis will be already done in the AST -> IR pass. I am just adding essentially an extra layer to keep things a bit cleaner (given three possible targets and the possibility I may add a different language frontend in the future). The question I have is are there standard methods of adding types to existing IR representations. I was looking over an old book of mine on the JVM bytecode specification and I noticed that though it has certain addition things I might need like simple type conversion it may also lack certain things I need given the strangeness that is LPC (mixed types). I suspect after posting this I am getting closer to a solution but I am not quite there yet. I suspect to make it all work I will have to introduce a general Convert(type,exp) code maybe a general TypeCheck(type set, exp) code to make things work.

But how about more complex types with nesting? I.e. an associative array of strings with values of arrays of floats. If I mungle these into the convert and type check operators (possibly using some sort of unification style checking + minimum compatible type computation. I am worried that might be considered bad style because the IR opcode is so high level and doing essentially a ton of work.

06 Feb, 2009, David Haley wrote in the 4th comment:

Votes: 0

Why do you have an IR if you already have the AST? The AST, or some form of augmented AST, normally is the intermediate representation – I guess that sometimes some form of three-address code is used as an IR, perhaps, depending on the target platform. Anyhow, especially since you're generating code for VMs, I don't understand why you need an IR after the AST.

You should never pick something that loses information that you later need to create. That is (obviously I should think) a terrible thing to do. So you would never pick an IR that loses typing information if you later need to type things for the VM.

Don't forget that you don't necessarily need to do all your type-checking in the VM. There's a reason why some VMs do type checking, and if you don't have that reason, you shouldn't worry too much about it. Even in Java, for example, with generics, the type checking happens at specific points, with casts introduced into the bytecode, and with some checks done at compile-time. It's not like the entire generics system is implemented in the VM – you don't need to.

To summarize, I think you really need to motivate the existence of an IR if you already have an AST. (I would count a slightly augmented AST as an AST – the IR you're talking about seems completely different.) Even if you really do need an IR after the AST, you should never discard information in your IR if you need it later on.

If you need some kind of type checking that isn't in your target VM, add it yourself in the form of a function hook that the VM can call like any other function, and associate with each object a type object at some known offset in memory (or something like that). Really, you're in charge of the memory here, and if you need to add more stuff, well, go ahead and do so. Think as if you were writing assembly or something, and then realize that your life is easier because you're not. :wink:

06 Feb, 2009, Silenus wrote in the 5th comment:

Votes: 0

Well AST is a form of IR but certain compilers use multiple IR's for different purposes. It has come across my mind that translating directly from the IR -> whatever it is may in some ways be easier but it also means the language is tightly coupled to the code generation since an unaugmented AST is pretty close still to the structure of the original language. That being said however it is definitely the simpler approach. I was using Appel's compiler project as a template for some of what I was doing and he does use multiple IR's though not being a compiler person I am not sure entirely why. The justification he has is that with multilple frontends to multiple code generation targets the added indirection makes it easier to implement a new frontend and backend, but there might be other reasons.

Also the AST's I have are basically generated by the parser generation tool- so unlike some AST designs they are extremely close to the language and somewhat cumbersome to work with directly. Yes, I already have an AST but it basically is just an AST where various symbols needed for parsing a sentence are removed since the ambiguities they help resolved are captured by the tree structure.

06 Feb, 2009, quixadhal wrote in the 6th comment:

Votes: 0

For the AIP (Acronym Impaired Population), AST is "Abstract Syntax Tree", and IR is "Internal Representation".

Now that you've been thinking about the design from the top down, as all good programmers are beaten into submission with large hammers to do…. try thinking about the design from the bottom up as well. :)

You have two fixed points in the process. You have a bytecode-level VM that is your FluffOS state machine. You have the dialect of LPC which will be compatible with FluffOS/MudOS mudlibs – possibly extended if you like. Since the stuff in the middle is the place you have control over, design from both ends towards each other. You might find a natural spot to collapse type info that isn't obvious when looking from the top down.

At the end of the day, you only have a handful of base types… integer, real, string – perhaps character, perhaps address. Arrays, mappings, structures are all composites that will (eventually) be broken down into those base types. I don't know at what level your VM will work with things, but if it deals with composite types, you may end up breaking things down to verify correctness before passing it along anyways.

06 Feb, 2009, David Haley wrote in the 7th comment:

Votes: 0

It makes sense to have multiple IRs in some cases, but it never makes sense to have an IR that loses information needed later on. One thing you could do is to have a typed IR and an untyped IR, depending on whether your target VM has types or not.

I think that the kind of IR you're talking about is quite close to three-address code: something that is almost but not quite machine code. I wasn't sure if you wanted something like that, or something like an augmented AST. In any case, I think an augmented AST can be quite useful.

ASTs (or similar tree structures) are much more easily dealt with using some kind of visitor pattern, at least in my experience.

So anyhow, at this point, what's the question? :tongue: I lost track of it, but maybe the general "why" is the question…

06 Feb, 2009, David Haley wrote in the 8th comment:

Votes: 0

Hmm, Quix posted while I was writing. Technically, IR is intermediate, not internal, representation.

Just keep in mind that you wouldn't collapse type information when targeting something that actually has typing built-in.

06 Feb, 2009, Silenus wrote in the 9th comment:

Votes: 0

The original question had to do with how to design an intermediate representation which was typed from minimally typed/untyped representations and how to and what sorts of additional node types/opcodes needed to be added to added to a untyped IR to make it type safe. As noted I dont want to throw out type information at the IR level because fluffos is runtime typed in a manner very close to the original language types. My design for this IR is actually another Tree derived from the AST. It is based on Appel's Tree language but because of difference in the base types the system is a bit more complicated.

Now with FluffOS- evidently we have a number of types builtin to the system. However with LLVM they will need to be emulated to some extent in a thin runtime. These would need to be in the IR because of fluffos but would represent pretty high level constructs for LLVM. My speculation is as follows (for a fluffos compatible system)-

new a Const/Constructor opcode for each of int,float,string. Perhaps also need a Constructor opcode for arrays, mappings and functions. Need a Conversion opcode for converting between various (simple) types. need some sort of TypeCheck opcode/method for generating runtime errors when types mismatch and insert these checks at the relevant places.

The issue I have with this is: A) I have no real model for what I am doing. B) it is prvetty unsystematic. So the questions are: (to iterate)

1) are there standard methods for constructing Tree like IR's which are typed?

2) or are there models of such things which exist in the wild?

3) for complex/composite types how do you cope with typechecking them in a IR which maybe for simplicity should just have simple type checks for various things.

06 Feb, 2009, quixadhal wrote in the 10th comment:

Votes: 0

Since FluffOS is one of your targets, it might be worth poking a query in at http://lpmuds.net/forum/index.php. I know wodan frequents that site, and he's the maintainer for FluffOS (as well as Discworld), so he'd probably be a good source of knowledge about how FluffOS's own LPC parser translates the language into the stack machine code it runs.

I'm just thinking that knowing how the current system works, and perhaps having some insight in why it was done the way it was done, might make it easier to build your own system.

06 Feb, 2009, Silenus wrote in the 11th comment:

Votes: 0

Well I am actually now passably familiar with the aspects of the FluffOS internals which are relevant with help from ppl. like Wodan. However there are many aspects for my own driver target where I do not wish to emulate the FluffOS system and one of the goals of this project was to modernize the frontend of the fluffos compiler system. The reason why I am asking here and might try asking in some other places as well (maybe LtU and comp.compilers) is that the issues I am struggling with are more general and not specific to any of the lpmud driver implementations. In fact I would go so far to speculate that my implementation would differ in many ways from fluffos, ldmud and dgd.

06 Feb, 2009, David Haley wrote in the 12th comment:

Votes: 0

Quote

1) are there standard methods for constructing Tree like IR's which are typed?

I would imagine that if you didn't find what you were looking for in sufficiently advanced compiler textbooks, the answer is probably no. But yes, there are general principles.

Quote

2) or are there models of such things which exist in the wild?

Yes, we've talked about several already e.g. Java. You can look into the Java quad representation for an IR for Java, which is really not that different from the bytecode.

Quote

3) for complex/composite types how do you cope with typechecking them in a IR which maybe for simplicity should just have simple type checks for various things.

I don't understand your question. If you have a full notion of types in your IR, then just check that 'obj' is of type Array<Array<Map<Integer,Integer>>> or whatever.

Don't do something "for simplicity" if it makes your life harder. You're not trying to write a machine language here that needs to be fast. You're trying to write a representation from which it is convenient to generate bytecode for some virtual machines.

06 Feb, 2009, Silenus wrote in the 13th comment:

Votes: 0

Thanks for the reply.

I havent looked at many advanced compiler texts yet. Only some comprehensive undergrad/graduate ones like Appel and of course the purple book. Maybe I need to really read those two Pierce volumes all the way through to get a real handle on this issue.

I meant with question 2 whether there were tree like IR's like this. Obviously JVM bytecodes is one as is LLVM bitcodes.

I guess the question is that to what extent should you have complex/composite types in such a language or if you could somehow reduce it to just having a container as a type and then do some sort of process of typechecking on the elements expressed in the IR as a more complex tree expression. In many cases for a given operation the contained type in LPC at least is quite irrelevant on many levels. The only cases where say when you need to fully check the type is during an assignment operation. You have however answered my question- the simplest method would be to hold the full nested type information in the tree.

There is however some rather tricky stuff in LPC w.r.t. mixed types and unknown types (where you dont know the type of something when you do something like a java invoke() ). The problem is the runtime type of mixed can influence the operator which should be selected and applied. I suspect I might have to have some sort of special case for situations when I cannot deduce the type of both operands to a binary operator or have some wierd TypeOf + Branch tree expressions which would do the job for me. Introducing types into the BinOps would probably be simplest.

Thanks for the help.

06 Feb, 2009, David Haley wrote in the 14th comment:

Votes: 0

I haven't heard of the Appel book; I used the "dragon book" in all of my compiler classes. ("Compilers: Principles and Techniques" (or something like that) by Aho, Sethi & Ullman 1st ed., + Lam in 2nd edition, who incidentally taught one of my classes…) Not sure what the purple book is to be honest. I'll have a look at it over the weekend to see what they have to say about this.

But yes, it might be worth it to read it all the way through. You're doing something pretty hard, frankly. So it should be expected that this takes a while to understand, let alone actually implement. To be honest I don't fully understand all the issues either.

Silenus said:

I guess the question is that to what extent should you have complex/composite types in such a language or if you could somehow reduce it to just having a container as a type and then do some sort of process of typechecking on the elements expressed in the IR as a more complex tree expression. In many cases for a given operation the contained type in LPC at least is quite irrelevant on many levels. The only cases where say when you need to fully check the type is during an assignment operation. You have however answered my question- the simplest method would be to hold the full nested type information in the tree.

I think that there is a question of type theory here. You can think of a container as having one of two types of type: (heh)

1- a container of elements of type T
2- a container of untyped elements

These are quite different things, really. If you adopt the first convention, you fully specify the type of the container and its elements. If you want to do such things, you could leave the inner type "blank" by specifying the top-most type in your type system (e.g. Object in Java).

I believe that if you have types in the first place, you might as well go with the first convention. This is what C++ does, incidentally: all types are fully specified. In C++, it really matters: you cannot pass vector<int> to a function that expects vector<string>. By contrast, in Java, an ArrayList can be passed around to anything that expects an ArrayList, regardless of the element type(s) – at the risk of creating a runtime exception. It is interesting to look at how generics work w.r.t. this issue.

You also need to think about how the types matter. Does the IR assume that type checking has already taken place? Well, then you only need as much information as the target VM cares about. For Java, the type of the container is ArrayList – not ArrayList<Integer> or something. Does the VM want to do full nested type checking itself? Well, then you have your answer by necessity.

(Un)Fortunately, depending on how you look at it, the common case is kind of irrelevant. If LPC usually doesn't care about contained types, but sometimes does, you have to always care about them. If you cared about the common case, it would be in optimizing your runtime type checking etc. – i.e. it becomes a performance issue, not a type theory or representational issue.

Silenus said:

There is however some rather tricky stuff in LPC w.r.t. mixed types and unknown types (where you dont know the type of something when you do something like a java invoke() ). The problem is the runtime type of mixed can influence the operator which should be selected and applied. I suspect I might have to have some sort of special case for situations when I cannot deduce the type of both operands to a binary operator or have some wierd TypeOf + Branch tree expressions which would do the job for me. Introducing types into the BinOps would probably be simplest.

In Java, you always know an object's type to some extent: it is always a subclass of Object. This is actually remarkably useful information, and basically solves the issue of unknown types: if you don't know what it is, you still know that it "is" an Object.

When you have operators that depend on the run-time characteristics of your objects, the easy (and perhaps standard? dunno) solution is to do it all at run-time. That is, look up the left-hand side's type, and then find the appropriate function, and then call it with the right-hand side. Ruby, Python etc. give a good example of this, where things are quite simple really: calling the operator means just looking up some magic addition method on the object – if it exists, use it, if it doesn't, use the default. Problem solved.

I'm not really sure how else you could do it if you truly have absolutely unknown types without going to an awful lot of trouble analyzing your entire program to deduce all possible types at all entry points…

07 Feb, 2009, Silenus wrote in the 15th comment:

Votes: 0

Actually by purple book I was referring to the dragon book. The green one is the 1966 original edition, the orange one the 1980's edition and the purple book the 2nd edition. At the time I was looking for compiler books the dragon book was still in the 1980's edition and seemed a bit dated so I went with Appel's introductory book. I think Andrew Appel is a Princeton CS Prof who invented CPS which was used in the original ML NJ compiler.

Unfortunately because of the mixed type again both container cases you mention occur. I am finding LPC compared to cleaner languages designs actually quite a challenge to implement since it is such a wierd hybrid. It has first class untyped objects like smalltalk, first-class functions, a mixed type for polymorphic behavior and defaults to a invoke type call for calling functions in other objects. It also supports structs because objects are so heavy weight. Where to insert unification style type checking is quite a challenge especially since it's slow and sometimes will need to be done at run time. This is obviously mostly an optimization issue as you indicated.

basically Fluffos does all dispatch and function selection at runtime. I however dont like this design since good FluffOS/MudOS code is often carefully type annotated and mixed is only used in certain limited places. Someone suggested to me that it might be of some interest to do a form of procedure based analysis of type flows inside each function i.e. something like HM style type inference. But even if you dont go this far it should be possible in many cases to know which operator you need to use since in most cases things do reduce to a single type pair. At the cost of introducing some complexity I suspect the best way to go with this is to have both a general runtime system for cases you cannot figure out and a faster common case when you can statically determine the types.

I suspect for library functions which are builtin you could use parametrics too to get stricter type controls with functions such as map,filter etc when these "efuns" are called.

07 Feb, 2009, David Haley wrote in the 16th comment:

Votes: 0

Silenus said:

Oh, I've never heard people refer to the books by colors. Come to think of it, I have heard of Appel, I believe in the context you mentioned, I just have never read his book(s).

The 80s version of the dragon book is actually pretty good, although you are correct that it is lacking somewhat when it comes to dynamic run-time environments. I own both that one and the most recent one. (Well, I "borrowed" the 80s one from my dad and never gave it back… :tongue:)

Silenus said:

Unfortunately because of the mixed type again both container cases you mention occur.

Note that from a type theory perspective the second type of types is actually a special case of the first type of types, where all containers only contain "Object". Calling something of type "Object" is basically saying that it has no type (or more accurately, no known type). I'm not sure how much this matters for you in practice, although it suggests to me that you could go with a fully typed system, and occasionally cop out and say that contained elements are "Object".

Silenus said:

It also supports structs because objects are so heavy weight.

What difference is there between structs and objects? Is it that structs have no methods, and are essentially aggregations of variables? I know next to nothing about LPC's internals…

Silenus said:

Someone suggested to me that it might be of some interest to do a form of procedure based analysis of type flows inside each function i.e. something like HM style type inference.

Type flow is fascinating and I agree that you can get a lot out of it. Depending on the complexity of the programming language, it is more or less complicated to infer types. When you start having nested types, it can get pretty messy…

Silenus said:

But even if you dont go this far it should be possible in many cases to know which operator you need to use since in most cases things do reduce to a single type pair.

Unless things can be modified at runtime, in which case you'd have to recompute all of your "direct links". There are languages where you can add and remove methods at any point you like. At the first point of invocation, you might have a single possible match, but at the next point of invocation, there can be multiple matches. Furthermore, you might have to consider not just a line of code, but all possible control flows into that line of code! Basically, taking the runtime-only approach can really make your life simpler when it's possible to add and remove methods at will.

But, if your language doesn't allow that, then yes, it is quite possible to infer to some extent which actual function will be invoked at a given point in the code.

In any case, it sounds like you will always need the runtime system, so you can start with just that and take some timing measurements. If it turns out to be a serious impediment to performance, you can try thinking about optimizing it with some kind of static inference.

08 Feb, 2009, Silenus wrote in the 17th comment:

Votes: 0

Sorry for the delayed reply.

Quote

The 80s version of the dragon book is actually pretty good, although you are correct that it is lacking somewhat when it comes to dynamic run-time environments.

I actually have this book as well and it is quite good- but definitely a bit dated. I.e. no discussion on the important SSA form for one.

Quote

What difference is there between structs and objects? Is it that structs have no methods, and are essentially aggregations of variables? I know next to nothing about LPC's internals…

I really dont know too much about how objects are implemented in FluffOS. I do know there data storage requirements are probably somewhat higher than one would expect since they contain extra information like fields for (typically) pointers to the program, environment and list of contained objects, time of load and other information which makes them relatively inefficient for allocating small objects. Also for these objects you cannot access fields directly and must access things through functions through the call_other mechanism which is the default mechanism for invoking functions in other objects and dynamic like Java's invoke. If you are interested you can look at the object.c file in the /src/ directory of the current incarnation of FluffOS v2.15.

The implementation in FluffOS for structs (called oddly classes) are based on the existing array implementation where the array is restricted to a fixed size where the fields are typed checked in the relevant manner. Like C structs you cannot have either static/virtual functions in these constructs though you can have function variables.

Quote

But, if your language doesn't allow that, then yes, it is quite possible to infer to some extent which actual function will be invoked at a given point in the code.

In general you cannot move/remove functions at will from objects which are extant though you can recompile those said objects. Different versions of LPC behave differently however when it comes to inherited programs. When an inheritable is updated in FluffOS and I believe LDMud (but not DGD), the old version remains extant so that programs which inherited from that program dont alter their behavior- so to some extent you can perhaps do interprocedural analysis of some sort as well.

Thanks. I think I have a pretty good picture of what needs to be done though I am somewhat loathed to actually do some of it. LPC (FluffOS) internals are unfortunately a bit messy compared to DGD where the code is quite clean. What I might end up doing is writing a compatible LPC parser and one which doesnt quite have so much baggage though I suspect very few people will use the latter.

Random Picks

PythonWars

Acropolis

The Two Towers