Team

Nurudeen Lameed
McGill University, Montreal.

Motivation

AToM³[1] is a graphical meta-modeling tool that is being developed by the Simulation and Modeling group at McGill University. It is written in Python and has capabilities to create models graphically in Causal Block Diagrams (CBD), Petri- net, among others. A causal block diagram is a graph of connected operational blocks; AToM³ transforms the concrete syntax (the block diagram) into an abstract syntax (the dependency graph). The software generates Python code representing the models. Experiment/simulation is then set-up for some number of iterations and finally, results are generated by computing each block at every iteration. Scientific models often require intensive computations over large number of iterations. But Python interprets computation for every block in the CBD repeatedly. Furthermore, in many cases, the values of some blocks remain constant throughout the simulation. However, the current implementation of CBD in AToM³ re-computes these blocks at each iteration. All this generally results in low performance for large models.

Proposal/Requirements

In the foregoing paragraph, it was mentioned that code interpretation combined with redundant computations generally results in low performance for large models. To speed up performance for most computation intensive applications, we propose the following optimizations over the existing implementation.

Constant propagation:- this technique may improve performance of CBD models. For instance, suppose
```
    z = x + y
```
where x = 2 and y = 3; computing z over many iterations, say 1000000, i.e.
```
    for iteration = 1 to 1000000
       z.signal[iteration] = x + y
```
is unnecessary. First, we replace x and y with their constant values and perform the addition. As discussed later, we only need to compute z once after replacing x with 2 and y with 3. An optimized execution is
```
    z.signal[0] = 2 + 3	  # x = 2; y = 3
    for iteration = 1 to 1000000
       z.signal[iteration] = 5	  # or simply print z for the current iteration
```
Obviously, computing z 1000000 – 1 times simplifies to printing or copying the value of z at time 0. This generally improves performance.
Eliminate redundant computations:- by eliminating redundant computations, performance might be improved significantly. For instance, consider an adder block of a model that performs this computation
```
    for iteration = 1 to 1000000
       s.signal[iteration] = x + y + z + a + b
```
x, y, z are constants; a, b are variables.
Repeated computation of x + y + z is unnecessary and can therefore be avoided using certain algebraic properties of the block. By employing reassociation [2], it is possible to separate the expression into parts that are constant, invariant and variable. In this case, the fact that addition is commutative and associative allows us to split the expression above into a constant part and a variable part, thus
```
   x + y + z + a + b = temp + var
```
where
```
   temp = x + y + z 
```
and
```
   var = a + b 
```
An optimized execution of the adder is
```
    temp = x + y + z
    for iteration = 2 to 1000000
       s.signal[iteration] = temp + a + b
```
This results in savings of 2 * 1000000 – 2 additions!
Compiling code:- typically run faster than interpreted code; to further improve the performance of CBDs, the optimizing compiler translates the computation for each block of the CBD into code in C programming language and integrates all the computations of the model into an optimized C program. C programming language is both efficient and portable.

Design/Models

The optimizing compiler developed for this project generates efficient C programs for causal block diagrams. It works by in-lining C code into AToM³ simulator program. AToM³ flattens models during the first iteration and generates a dependency graph for every model. The optimizer performs some analysis on this dependency graphs to determine among others, which inputs are constants; which blocks require repeated computations and which blocks do not require repeated computations. Nodes of the dependency graph are marked depending on whether they are part of varying computations or not. For example, if a block has mark = 1 then such block should be computed only once; any other value indicates that the block must be recomputed at every iteration. The following rules are used to mark nodes:

if a block is a constant block, then the mark for the node is 1.
if a block is a timedelayed block, then the mark for the node is 0.
if a block is not a constant block but all its input blocks are marked 1, then the mark for the node is 1 otherwise, the mark is 0; loops are handled by considering all the input blocks to the current block and checking for 1 and 2 above.

Combining the above rules with the dependency graph, appropriate code is generated for every block and constant values propagated, where necessary.

Implementation

This project is implemented in Python. Code generation is in-lined with AToM³ code. The implementation uses LAPACK[3] library to solve systems of linear equations generated when an algebraic loop is detected. It is possible to have more than one loop in a model. Each loop is unique and generates a unique system of linear equations. A loop generates a system of linear equation of the form:

      Ax = b

where A is the coefficient matrix formed from the relationships between the variables given by the vector x; b is the vector for containing the right-hand sides of the equations. Because the dependency graph at time = 0 might be different from the dependency graph at time > 0; code generation occurs only during the first two iterations (time = 0 and time = 1). The generated C code includes function for printing the results to the standard output device, file and for collecting timing statistics.

Experiments

This section describes some of the experiments that were developed and, run to test the capabilities of the optimizing compiler.

The compiler was tested with many models; one of which is a model with two linear loops: model with two loops. The generated C program can be found here. The program contains computations for all blocks at time zero and for further iterations. Existence of a loop causes code that uses LAPACK library routine(dgesv) to be generated. The values of the variables in the strong components are the same for further iterations and therefore are copied for subsequent iterations. Further to the above, the optimizing compiler was tested using circle test model, the generated code is shown here and the result of running the generated C program is shown below.

Performance Evaluation

The circle test and Physbe experiments were run with both AToM³ and the corresponding C programs. The following table compares the performance of the models in AToM³ (Python) and in the corresponding C programs.

Table comparison performance of AToM³ with the generated C program
Application	Python(time in seconds)	C (time in seconds)
Physbe(3,000 iterations)	23.87s	0.82s
Circletest (60,000 iterations)	22.33s	0.94s

Although AToM³ performs some tasks of converting concrete syntax of a model into abstract syntax (flattening, building the dependency graphs, and others). This does not have any significant effect on the total time as collecting the timing statistics after from the third iteration to the last iteration yields similar result. The C program generally always outperforms Python especially for large number of iterations. For the experiments/simulations, the generated C program is over 20 times faster than Python.

Conclusions

Through the project, we show that hierarchical ordering in CBD provides opportunities for optimizations and, elimination of redundant computations and compiling code instead of interpreting code as in Python result in performance gains.

References

AToM³.http://atom3.cs.mcgill.ca/
Steven S. Muchnick. Advanced Compiler Design and Implementation. MORGAN KAUFFMAN, CA. ISBN-13: 978-1-55860-320-2, 1997.
LAPACK.http://www.netlib.org/lapack/
McLeod J, PHYSBE...a physiological simulation benchmark experiment, SIMULATION, 324-329, vol.7, no.6, 1966.
The Mathworks. http://www.mathworks.com

Presentation

Presentation for the project can be found here.