Does your static analysis tool see the C# source underlying your C# source? I am a compiler engineer at ShiftLeft, the designer and (main) implementor of the programming language layer of our static analysis tool for C# and Python. In this article, I discuss a bit about the static analysis of C# programs.
When you compile your C# program in Visual Studio, a number of DLL and/or EXE files is produced. These files consist of, among other things, CIL (Common Intermediate Language) bytecode—which is translated from C# source by the Roslyn compiler. Upon the launch of an application, such bytecode is JIT-compiled and executed under the environment of the CLR (Common Language Runtime). At this point, it would be good if you have static analysed your program.
Static analysis tools do not execute programs; they reason about a program’s properties by inspecting its syntax—in this case, that of its C# source or its CIL bytecode. On one hand, a static analysis tool benefits from inspecting syntax close to the one that is executed, so the issues that it diagnoses are actual issues during execution. On another hand, a static analysis tool benefits from inspecting syntax close to the one in which a program is written, so the issues that it diagnoses better capture the intent of a programmer and are identifiable during development, i.e., before execution. For security-oriented tools, in particular, an extra benefit of source syntax static analysis is that certain vulnerabilities are API-driven and, to be diagnosed, require the program names are recognisable (what is not always in bytecode syntax).
Due to the trade-off in the static analysis of C# source and CIL bytecode, it is natural that a compiler engineer is interested in the correspondence between the two syntaxes. Perhaps you, as programmer or security expert, is interested in learning about that too…
A Peek at the Syntax of CIL Bytecode Syntax
Consider the C# source below (on the top left). Its syntax is rather similar to the — selected — syntax of its CIL bytecode translation (on the right). For instance, in both syntaxes, we see the declaration of a class whose name is T
. Yet, there are little differences between the source and the bytecode:
- C#’s keyword for a class declaration is
class
; CIL’s keyword is.class
. - In C#, a class implicitly extends
object
(on the middle left); in CIL, inheritance is explicit. - C#’s syntax allows certain “abbreviations” (on the bottom left), e.g.,
object
abbreviatesSystem.Object
andint
abbreviatesSystem.Int32
. - In CIL, a field of
T
, likeV
, is declared with keyword.field
.
Despite the obvious correspondences in the program above, in general there are a lot more dissimilarities — rather than similarities — between the syntax of C# source and CIL bytecode. (Otherwise, what would be the purpose of the latter?) For instance, there are no expressions in the C# source and I omitted all instructions from its CIL bytecode. Nevertheless, the correspondence between the two syntaxes still is interesting, as there are many constructs in which they overlap, e.g., in regards to declarations, as we have just seen.
The C# Source and CIL Bytecode Syntax Duality
Consider the C# source below (on the top left), and the declaration of class U
: it declares an auto propertyP
. Now, observe the CIL bytecode translation (on the right) of U
: it declares, as members of U
, one field'<P>K__BackingField'
and two functionsget_P
and set_P
. A static analysis tool that inspects the syntax of C# source should understand (i) the declaration of propertyP
as that of a field, (ii) every expression that reads P
, like V = P
, as a call to get_P
, and (iii) every expression that writes to P
, like P = 1
, as a call to set_P
; even so (ii) and (iii) look like assignment expressions. Conversely, a static analysis tool that inspects they syntax of CIL bytecode should understand those fields and functions as belonging to an auto property. In fact, there exists a C# source (on the bottom left) that is semantically equivalent to the original one but whose correspondence to CIL bytecode is quite exact.
There are a number of approaches to devise a static analysis tool that is enabled with a dual understanding of C# source / CIL bytecode. For instance, we could inspect the syntax of CIL bytecode and, as a post_-_analysis action, map — through a separate mechanism — the results back to C# source. But elaborating such a “reverse intelligence” a mechanism is not trivial. Also, this approach would fall short for API names in the C# source that are not recognisable in the CIL bytecode.
Another approach to the dual understanding of C# source / CIL bytecode is to inspect the syntax of C# sources and, as an in-analysis action, desugar (i.e., eliminate syntactic sugar) any “high level” syntax into an equivalent “low level” syntax that is closer to CIL bytecode. Such a desugaring may be logical or physical. The idea behind a logical desugaring is to tweak the algorithms and/or data structures of your static analysis to reason about the original C# source as if it was the desugared one. The idea behind a physical desugaring is to actually modify the syntax of a C# source and submit to static analysis the desugared C# source, instead of the original one.
In a succinct comparison between the logical and physical desugaring, one advantage of the latter is that we can verify whether or not our modification to the syntax of a program is valid (and, to a certain extent, that it is semantically equivalent) by recompiling the C# source; in the former, a bug in the tweaking of an algorithm and/or data structure may not be noticed — until a false negative/positive error “detects” it.
C# Source Syntax Rewriting
ShiftLeft’s static analysis adopts the approach of inspecting the syntax of C# sources, with physical desugaring as the in-analysis action. Physical desugaring is accomplished by means of technique known as syntax rewriting. Our syntax rewriting tool is built with the Roslyn compiler; specifically, by taking advantage of its CSharpSyntaxRewriter
. We have made this tool open source at under the Apache 2.0 license; you can find it at https://github.com/ShiftLeftSecurity/SharpSyntaxRewriter.
In The (C) Sharp Syntax Rewriter Tool, you will find a variety of rewriters. One of them is the DecomposeNullConditional
rewriter: it rewrites a statement containing a null-conditional expression like obj?.f()
into the statement if ((object)obj != null) obj.f()
. Another rewriter is UninterpolateString
: it rewrites an expression like $"hi {name}"
into the expression string.Format("hi {0}", name)
. There are others… Below is an example in which a program is shown with its original C# source and CIL bytecode translation (on the top), and with its desugared C# source and CIL bytecode translation (on the bottom). Observe the calls to Add
items to a list, GetEnumerator
, get_Current
, and MoveNext
. All of the rewriters in The © Sharp Syntax Rewriter Tool desugar the syntax of a C# source as according to (a best effort interpretation of) the language specification.
Applications of The (C) Sharp Syntax Rewriter Tool
At ShiftLeft, the end goal of C# source syntax rewriting is static analysis. But there are other applications of our tool. Suppose that a former colleague of yours wrote a custom C# analyser a while ago, and, due to language evolution, it no longer is usable, as recent C# constructs are not recognised: you could use our tool to rewrite unrecognised constructs into (legacy) recognised ones. Or suppose that you organisation would like to ensure a coding convention for a code base: you could use our tool as the basis of a code refactoring plugin. Of course, these two examples assume that the The (C) Sharp Syntax Rewriter Tool offers the rewriter for the task you need — otherwise, you could still extend it (let us know if you need support).