Recoder-CS "manual"

Introduction

This manual describes RECODER-CS, the port of the RECODER framework for C# language (RECODER originally worked on JAVA files).

In this description we will suppose that you already know how to write programs using RECODER, and will only be focusing on the changes and differences that have been made to the program. We will not explain C# terms and definitions either.

What's in this document

This document is not a real manual. It only explains the differences between RECODER-CS and RECODER. We will try to explain briefly, what has been changed and how (mostly describe what is not possible, and what is possible). We will not go into details, since there were a big number of changes, and there is still a lot of todos.

If you want to get more detailed information about these changes you need to look at the sources. We have marked all codepieces that have beem disabled or are incomplete. Disabled codes are commented with // and are always marked with a // DISABLED comment and a description of the reason. If there is a todo item, you will find a // TODO comment before the critical section. If you find code commented without any tag: it should not be our change (possibly comes from RECODER).

First of all

RECODER-CS is a modification of RECODER, but is is neither compatible, nor interoperable with RECODER. This means that

you should not use RECODER-CS and RECODER in the same application, because some classes still have the same name, but work pretty different. (By the way, there is also a name clash, since both programs use recoder as the main namespace).
you have to modify your programs that were working with RECODER, so that they work with RECODER-CS.
you have to reimplement your transformations (because of the different semantics and AST of the two languages).

However: During the modifications we were trying to make as little changes as possible, in order to maintain maximal compatibility with RECODER. This means that you will not have to change much in your programs, since RECODER-CS is almost API-compatible with RECODER.

Features

RECODER-CS can parse C# code, build up an AST, and run a semantical analysis on it. It can show you what classes are available, what methods and fields they have, handle variables and resolve references to variables, fields and methods. Sources can be pretty printed and transformed (in a limited way).

Limitations

The abilities of Recoder-CS are still limited. The biggest limitation is that you can not parse code which contains preprocessor directives, unsafe extensions, or is not available in source form.You can use transactions on the program model, but you cannot unroll/revert/undo them. Also the kits in recoder.kit are still pretty incomplete. Also the semantical analysis has still some bottlenecks (see later).

You must also consider that RECODER-CS implements a parser for the ECMA C# language, and not some vendor specific extension.

Infrastructure

Just like RECODER, RECODER-CS is also based on a central program model of the sources, and service modules, which can manipulate, analyze, build and update that model.

These services have almost the same API, as in RECODER, except the services, which were working on bytecode - those have been removed. Some additions have been made to SourceInfo to deal with the new types and constructs of C#.

In the following sections, we will just briefly describe each service of recoder, and the changes we made to them.

Source file repositories (recoder.io)

Source file repositories are responsible for loading and saving compilation units, loading classes and etc.

Major changes:

First of all: RECODER-CS supports neither bytecode parsing, nor loading classes by reflection (which would be impossible to implement, since we program in JAVA...)

Another important change has been done to the source file repository as well. Since compilation units in C# may have multiple public class declarations with multiple namespaces it is impossible to find files (compilation units) by their class names. The way we solved this problem was that whenever the source file repository is created, it parses each and every compilation unit available in the input path.

This means however that you must have the sources of the core library (e.g. System.Object, System.String), etc available in the input path to get the source info working normally. In the CVS you can find the corlib of the MONO project, which shall be usable for this purpose.

The input path may be (should be) specified by the input.path system property, the environment variable CLASSPATH is ignored. To set the input path property, you either have to use the -D parameter at program start, or

System.getProperties().put("input.path","<whatever>")

at the beginning of your program, or alternatively, you can use the

ServiceConfiguration.getProjectSettings().getSearchPathList().add("<whatever>")

method. As usual, the path must be specfied in a list, separated by a semicolon (;) or a colon (:) according to your platform (Windows/UNIX).

In some cases the source file repository may die, if it finds an error (for example an unresolved reference) while loading the classes in the input path. This is because the repository uses its own error handler at initialization, and this error handler terminates at the first error. A workaround is that you only place correct files into the input path and read the ather files only later using ServiceConfiguration.getProjectSettings().getSearchPathList().add("<filename>"). Before those classes will be loaded you should replace the DefaultErrorHandler with your own handler, which tolerates errors. Note, that this "bug" should have been fixed...

Minor changes:

ProjectSettings does not want to ensure anymore that system classes are in path. It is your responsibility to add the source of those to the input path.

When creating a new compilation unit in a transformation, you have to create its DataLocation first. (This is because the repository can not decide by itself where the file belongs to.)

Room for improvement:

Currently, the repository loads all compilation units, and runs the TypeFinderVisitor to collect the types in that compilation unit. But it runs it for each type separately. We should add a cache here.

Service configurations (recoder.ServiceConfiguration)

The changes in the recoder.io architecture made necessary some changes in the ServiceConfiguration classes. Because the SourceFileRepository now parses all the classes on initialization, we needed to include

error handling to the initialization processes. The service initializator methods now may throw an InitializeException.
support for creating stripped down ServiceConfigurations. For tasks that do not require semantical analysis, and special source repository handling (like pretty printing), it is now possible to define ServiceConfigurations that only create the ProgramFactory and the ProjectSettings services. We use such a configuration in our test suite. This allows pretty printing of sources with incomplete semantical information (e.g. missing references too).

TODO: Clean up the structure of the ServiceConfigurations.

Parser (recoder.parser)

The parser is still an LL-parser generated by JAVACC.

Since we could not find a grammar for C# (the only available ECMA grammar was left-recursive and so not suitable for JAVACC) we have decided to derive the parser from the JAVA parser instead of implementing the ECMA specification directly. The two parsers (although they seem to be similar) have a very similar set of rules, but inside they basically have not much in common.

Using the same parser as in RECODER allowed us a big reuse of the tree classes already written for JAVA (classes in recoder.java). You should though not forget that almost all files have changed a little bit, and so the JAVA and C# classes are not compatible with each other.

At some places (because of the left-recursiveness) we had to use a pretty big lookahead, which might make the parser a bit slower, than the JAVA parser. We don't think that these problems can be solved by any LL parser, so we let it as it is.

Abstract tree - AST(recoder.csharp)

There were a big number of changes, which we can not all count. You should see the API documentation and/or the parser grammar for reference. (Since this part is also very weakly documented in RECODER, these are your only hopes....) Here we only mention the most important changes.

Attributes (recoder.csharp.attributes)

C# allows associating metadata stored in the attributes of some program elements. The interface AttributableElement is implemented by all elements, which can have attributes. On those, you can use getAttributeSectionCount() and getAttributeSectionAt() to get the attribute sections of the element. Then on the attribute sections you can use getAttributeCount() and getAttributeAt() to get the attrbiutes defined by the section, and the attributes you can decompose as well (see the API-doc, it is very straightforward). You can also read the AttributeTarget of the attribute section (those modifiers are stored in recoder.csharp.attributes.modifiers).

Note: there is no semantic analysis on attributes (you can not obtain the meaning of the metadata), only type references are resolved. This is because interpreting the metadata would require a knowledge of all system attributes, and so would be a lot of work.

Expressions (recoder.csharp.expression)

Existing expressions have not been changed. Introduced new operators like CheckedOperator, UncheckedOperator, TypeofOperator, AsOperator.

Literals have been changed to implement the ReferencePrefix interface, since in C# 123.ToString() is a valid expression (this is called boxing and it is resolved by the semantical analysis).

References (recoder.csharp.reference)

ArrayLengthReference is obsolete. In C# arrays are boxed to the System.Array type when used as a prefix. The type System.Array then has a Length property (among others).

A new reference is the UncollatedMethodCallReference (a subclass of UncollatedReferenceQualifier), which is created instead of the MethodReference. This is needed, because you cannot distinguish a delegate call from a method call in the plain analysis. TheUncollatedMethodCallReference will be resolved by the semantical analysis, and replaced by either MethodReference or the DelegateCallReference (also a new class).

Multidimensional arrays

Multidimensional arrays have been added to the model. In RECODER dimensions of an array have been stored in an integer (a[][][] was stored as dimension 3), C# however makes difference between real multidimensional arrays and arrays of arrays (as in JAVA). So you can write something like a[,][] which is not equivalent to a[][,] (although dimension is 3 in each cases).

The concept we used for storing the new dimensions was to use an array of integers instead of a single integer. (Another possible solution would have been to introduce an type reference to a type, whose basetype is also an array - this would have been more complex). With our solution a dimension of a[][,,][,] maps to an integer array of int[1][3][2], while the single expression a has the dimension of either null, or int[0] (int array with length 0) - both is possible.

This kind of dimension mapping was used in FieldDeclarations and VariableDeclarations, TypeReferences.

In ArrayReference (array indexing operator) we used however an other solution. The reason is that here you can also use expressions for indexing the reference. Here, we really use references to array references, and we have added dimensions too. This was needed, because here we also had to store the expressions of the indexes too. So a[3,2][4,5][6] maps to an array reference with dimension 1, expression "6", base type of an array reference with dimension 2, expressions "4","5", basetype of an array reference to the basetype a, dimension 2, expressions "3","2". Basically we nest the references into each other.

Namespaces, imports

In C# there is no longer a relation between namespace (package), compilation unit and assembly (library).

PackageReference is now NamespaceReference (which is by the way more logical) and PackageDeclaration was replaced by NamespaceSpecification. When declaring namespaces you must be aware of the fact that C# uses a completely different semantics to declare namespaces. In C# namespaces are not implicitly specified, but there can be multiple NamespaceSpecifications in a unit. These specifications can be put inside each other, and every specification may have its own imports (called usings). So you can write something like

using x.y;


namespace a {

    namespace b.c {
	
	    using z;
	
        namespace d {
		
			class A {}
		
	    }
	}
}

Here A is in the namespace a.b.c.d. While the three NamespaceSpecifications only have names "a", "b.c", and "d". To make life more easier there is a method getFullName(), which returns the full name of the namespace (for example "a.b.c" for the second namespace). Todo: you can make a more performing (but less robust) implementation of this method by caching the full name.

An other problem was with the different structure of C# compilation units and JAVA compilation units.So, when working on the AST you should consider this.

C# has also support for namespace and type aliases called using-alias (e.g. using ws = System.Web.Services makes a namespace alias). There is no support for these in the semantical analysis yet, but it is not too difficult to add support for those.

Type declarations

In RECODER there was an assertion that types are either primitive types or array types, or class types; which is true in JAVA, but not true in C#. So, we had to introduce a new level of abstraction called declared type (DeclaredType), which represents a type that is declared in the program. Class types are declared types which can have members, but enums and delegates are also type declarations, but they don't have members. So we have introduced TypeDeclaration and ClassTypeDeclaration abstract classes. TypeDeclaration implements DeclaredType only, while ClassTypeDeclaration implements ClassType.

However, we had to leave the member declarations in the TypeDeclaration, instead of pushing them down into ClassTypeDeclaration becaus there would have been too many modifications, and enums (which are not class types) can also have fields (but not methods). This "cheat" however is hidden from the outside world, since only ClassTypeDeclarations have methods to report members.

Class types

Inheritance was also a problem, since C# makes no syntactic difference between inheritance and implementing an interface.

A new class type is the StructDeclaration, which declares a struct. There are some semantical differences between classes and structs, but in the AST the only difference is that a struct can have no destructor, and can not inherit from other classes. These constraints are currently not checked by the parser, which means that you need to check at the semantical analysis.

Declared types

Two new classes are the DelegateDeclaration and EnumDeclaration.

Delegates are types for methods, which have a defined signature (parameters). A variable with a delegate type can have a number of methods assigned to it, and these methods can be invoked by using the variable as if it was a method.

Enums have members (with optional initializers) and a basetype. Enum members (EnumMemberDeclaration / EnumMemberSpecification) behave exactly like fields, therefore they extend FieldDeclaration and FieldSpecification. The initializers are not checked for semantic correctness by the parser, but by the type analysis.

About enum members you also have to know that - since enum members are always declared one-by-one without a type declaration - there is always one specification and one declaration for each member. And since EnumMemberDeclarations have no type implicitly given they return the basetype of the enum as their types (which is true). An eleganter solution could have been to introduce only EnumMemberDeclaration wich could act as a declaration and a specification, but we found it too complicated (the semantical analysis should have been rewritten).

In the future RECODER-CS and RECODER shall be refactored so that they no longer make differences between specifications and declarations (since this is senseless in JAVA and in C#). Then we could also correct this problem.

Class members

Fields and methods are the same.

Properties are special fields with accessors (baseclass Accessor) those you can get with the getGetAccessor() and getSetAccessor() method. As at the enums, the same rule applies here too: since properties are fields they have a declaration and a specification subclassing the field declarations and specifications, but C# allows only one specification at a time. So again, there is always one PropertyDeclaration with one PropertySpecification.

Events are also fields with a type of a delegate, and supporting some "advances" operations. Therefore they have EventDeclaration and EventSpecifications. However (to make life a bit harder) C# also allows events to be defined like properties. In this case we have one EventDeclaration and exactly one EventSpecification with two accessors of course. For convenience in the AST we made no difference between normal and property-like events (maybe we should have?), but stored the two accessors in EventDeclaration. (If it is a normal event these accessors are null of course.) We might want to change this hack in the future.

Operator overloads overload operators. Instead of implementing many classes (e.g. PlusOperatorOverload) we have one field indicating the type of the operator. We consider this a better solution, since we can use a switch statement instead of a bunch of instanceof operators. And again: here is also no semantical check (e.g. we should check that a binary operator has exactly two arguments one of which must have the same type as the class itself).

Indexers are basically overloads of the array ( [] ) operator, but they differ from the operators, so we gave them an own class (IndexerDeclaration).

Abstract program model (recoder.abstraction)

The abstract model is pretty good self-explaining. Looking at the following figure shall make it clear (click to enlarge)

Methods, fields are the same, but as methods don't report their exceptions in C#, getExceptions() is deprecated, and returns null.

The most important change was to introduce the DeclaredType as an abstraction for delegates, enums and class types (which now also includes structures), and changing classes and namespaces so that they can hold declared types instead of class types.

New are the abstract interfaces Enum, Delegate, OperatorOverload, Indexer, Event and Property, they are very self-explaining. However currently there are no methods to report these more specific members, you can only use the usual getMethods() and getFields() methods to get them. Events and properies are reported as fields, operators and indexers as methods. In a similar way: you can use the getDeclaredTypes() method to get all enums, delegates and class types in a class or namespace, but you can not query them one-by-one.

The DelegateConstructor class is needed, because a variable with delegate type must be also constructed (imagine this as a default constructor of a delegate) with a method as parameter, e.g.


  public delegate void MyMethod (int a);
  
  class X {
  
     MyMethod mm;       // Field with delegate type
  
     public m(int a) {} // This is a method which will be bound
	 
	 
	 public void Main(string [] args) {
	    mm = new MyMethod(m); // This is not a real constructor!    
		mm(1); // Calls m(1)
	 }

So here, there is a virtual call to a delegate constructor which has as parameter the name of the variable.

Semantical analysis (classes in recoder.service)

These services are used by the abstract model to synthetise information.

Name analysis (DefaultNameInfo) can now handle the new primitive types, deal with C#'s namespaces, load types from the source file repository.

Source analysis (DefaultSourceInfo). Works, except for a few "limitations". First, there is no support for operator overload (this is mostly the method getType(Expression)), secondly there is no using-alias support. Thirdly the problem with the delegates: in the previous example an access to the delegate mm (which is bound to method m) also means, that m has a reference in the method Main.

Visibility handling. Yes, here are also problems: C# has visibility modifier internal, which is visible to members of the same assembly. However, the components of an assembly are not determinable until compile time, what means, that we don't know which classes belong together. Therefore we assume, that internal equals public for convenience.

There is support for the new primitive types, the two type-aliases (object and string). Boxing is resolved when you use a variable as a reference prefix. There is no boxing support, when you interpret expressions (again method getType(Expression))

DefaultConstantEvaluator should also handle the new primitive types (unsigned int for example), and their new literals, such as "123ul" (unsigned long). However there is possibly a design issue with the latest feature, since unsigned longs will be mapped as supercharged JAVA longs, and interpreted as a negative number.

The classes DefaultConstantEvaluator and the method getType(Expression) in DefaultSourceInfo are incomplete. Also there are some methods which need to be reviewed, whether they match also the C# specifications not only the JAVA specs.Check the // TODO tags in the sources to get more detailed information.

Transformations (recoder.kit)

You can do transformation directly on the AST (e.g. insert/remove/replace a node), using the attach() and detach() methods in ChangeHistory. You can write your own transformations by subclassing the Transformation class, and using the attach(foo, bar) methods. These methods are the same as in JAVA, and so you cannot attach the new classes of C# (e.g. can not attach an EnumDeclaration directly). It may be possible to attach those too, if the attached node has a superclass which can be attached. For example you do can attach an EnumDeclaration using the attach(TypeDeclaration) method, since EnumDeclaration is a TypeDeclaration.

You can do the partial parsing feature in CSharpProgramFactory using the parseWhatever() methods.

There is limited support for the higher level transactions in the recoder.kit packages:

CommentKit is senseless, since it creates JavaDOC comments.
ExpressionKit should be usable, since there is not much difference between the expression logic in C# and JAVA.
NameGenerator was not changed. (Seems to contain nothing language dependant).
ModifierKit can now create C# modifiers. Constants no longer come from recoder.bytecode.AccessModifier, but are kept in the Modifiers interface.
NamespaceKit can only be used to create namespace references.
StatementKit is unchanged, but shall work mostly since C#'s statements and JAVA statements are very similar. We don't know if it can deal with get and set accessors.
VariableKit was not changed, but should be usable.
UnitKit contained a lot of methods managing imports (usings), some of them is rewritten, some of them became senseless, since C# only has multiple imports.

Please note that transformations are still untested, use them with care and lot of testing.

The complete transformations (recoder.kit.transformation) were removed.

Utilities (recoder.util)

These files have not been touched. Suggested is therefore that you use them with care.

Examples

Here are some examples, which demonstrate how you can use RECODER-CS.

Example 1: Pretty Printer

The pretty printer is effectively the Hello World application of the RECODER-CS.

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.Writer;

import recoder.DefaultServiceConfiguration;
import recoder.ParserException;
import recoder.convenience.Naming;
import recoder.csharp.CompilationUnit;
import recoder.csharp.PrettyPrinter;
import recoder.list.CompilationUnitList;
import recoder.service.SourceInfo;



public class ExamplePrinter extends PrettyPrinter {

	public static void main(String[] args)
		throws IOException, ParserException, Exception {
		DefaultServiceConfiguration sc = new DefaultServiceConfiguration();

		ExamplePrinter epr = new ExamplePrinter(sc, new PrintWriter(System.out));
		CompilationUnitList list = sc.getSourceFileRepository().getCompilationUnits();
		epr.printCompilationUnits(list);
	}

	private DefaultServiceConfiguration serviceConfiguration;
	private SourceInfo sourceInfo; // cached	

	public ExamplePrinter(DefaultServiceConfiguration sc, Writer out) {
		super(out, sc.getProjectSettings().getProperties());
		this.serviceConfiguration = sc;
		sourceInfo = sc.getSourceInfo();
	}

	public void printCompilationUnits(CompilationUnitList cus) throws IOException {
		for (int i = 0, s = cus.size(); i < s; i += 1) {
			CompilationUnit cu = cus.getCompilationUnit(i);
			printCompilationUnit(cu);
		}
	}

	public void printCompilationUnit(CompilationUnit cu) throws IOException {
		String name = cu.getDataLocation().toString();
		System.out.println("Visiting compilation unit:" + name);
		visitCompilationUnit(cu);
		getWriter().flush();
	}
}

First, we create the ServiceConfiguration. In this case we use the DefaultServiceConfiguration. On initialization the service configuration loads, parses and analyses every source file in the input path.

Then we instantiate our ExamplePrinter class, which inherits from the recoder.csharp.PrettyPrinter class. This is the actual implementation of the pretty printer using the Visitor design pattern.

Now we ask the SourceFileRepository service to give us every compilation unit in the model. Now we can use the pretty printer to print out the compilation units.

Notes:

The startup time for this example is longer than one would anticipate. This is because recoder performs semantical analysis on startup. For a more performant pretty printer we could use the stripped down ServiceConfiguration and the ProgramFactory service to parse the given source files, and feed the created compilation units to the PrettyPrinter.
The String name = cu.getDataLocation().toString(); line assumes that every CU has an associated DataLocation. This is the case when we read them from files. However, programmaticaly generated CUs not always have a Location.

Other examples

You can find more examples in the examples directory.

The SyntaxPrinter program reads and displays the AST of the given source file. It can be used to debug and view the output of the parser.

The PlainAnalysis program gives you information about all classes in the input path. It writes you information about class members and references. Looking at the source you will be able to understand how RECODER-CS works.

The Sourcerer program is the back-ported version of Sourcerer from RECODER. It visualizes your classes and there members, you can see the results of the analysis.

Tests

Here is what we have used for testing:

To test the AST: We have taken the test from the Mono project (about 400 files). There is a JUnit test, which parses each file, and fails, if there was an error. There were no errors.
Pretty printer test: We took the same files read them and printed them out. Then we used a special diff function (which ignores newline and whitespace differences) to check the result. There were no differences (of course except for whitespace).
Testing the semantical analysis: We could not find out a good automated test for this. We have ported the SOURCERER example from RECODER. It is in the CVS. We use this for testing.

Todo-s

Well, there is still a lot to do. Some of them are here:

Support unsafe extensions. - This is complicated, because basically you have to write a C++ analyzer, which can handle pointers, malloc, etc.
Implement preprocessor - This is a pretty simple task, maybe we can get somebody do it for us.
Add "advanced" reference resolving to delegates. If a method is bound to a delegate, then every access to that delegate (DelegateCallReference) must also be interpreted as a call to the method.
Operator overload in expressions, constants. Rewrite DefaultConstantEvaluator and getType(Expression) in DefaultSourceInfo - This will be worked on in the future.
"Default operator overloads". For example to delegates you can add a method with the operation += .
Type checking enums. Variables with type of an enum have the type of the basetype of the enum. This must be considered.
Bytecode parser - This is also a nice task. In the assembly metadata there should be enough information to build a simple model, but it is not examined yet.
Rewrite transformations - Rewrite all transformations and kits for C#.
Add support for using alias.
Add support for parameter array while checking signature matches.