COPPER

"Compilation and OPtimization with PERformance feedback" ::.

Overview

Links ::.


nsf

Google

Overview ::.


The Copper integrated environment for application tuning combines robust, existing, open source software - the OpenUH compiler, Dragon program analysis tool and two performance tools, KOJAK and PerfSuite. The environment provides automated, scalable performance measurement, analysis, and optimization to increase user productivity by reducing the manual effort of existing approaches.

Interaction with the compiler has enabled the performance tools to instrument more accurately and to be more selective in the measurement process while making this process as automatic as possible. As a result, we are able to accomplish a scalable strategy for performance analysis, which is essential if performance tuning tools are to address the needs of emerging very large scale systems. The work in this project has enabled the compiler to provide the performance software with fine grained information on the control flow of a program within its procedures selectively with minimal user intervention. We have determined that this information can be exploited by the performance tools to provide more accurate feedback on the behavior of a program. As a result, the approach scales and the information is detailed. We have also created interfaces that permit performance information to be fed back to the compiler. This feedback is the basis for even higher levels of scalability, as the compiler uses the data to further refine the performance measurement process, and also can exploit the information to improve its translation of the program. Thus, integration of the compiler and tools is not only beneficial for the application developer, who no longer has to deal with multiple and disparate sources of information, but it can also have a significant impact on the ability of each of these components to perform its function. The resulting environment is, in consequence, considerably more than the sum of its parts: it provides a coherent infrastructure for application tuning and enables a scalable approach to investigating and overcoming performance problems.

communications

Interactions: OpenUH and KOJAK ::.


KOJAK is responsible for delivering performance feedback to the OpenUH compiler. It can also directly provide feedback to the programmer in manual performance tuning mode, and is supported by the compiler to reduce the amount of data generated in trace files. This interaction has produced the following: By using OpenUH's rich set of profiling interface routines, KOJAK is now able to generate more fine-grained trace events. In addition to the previously supported MPI communication operations, OpenMP constructs, and user-defined regions, trace records for loop nests, conditional branches, and switches are now supported. The final analysis result provides application developers with detailed information on control flow.

Interactions: OpenUH and PerfSuite ::.


In the COPPER framework, PerfSuite is being employed to enable its flexible invocation by the OpenUH compiler and KOJAK for low cost, selective performance monitoring and runtime data collection. This interaction has produced the folowing: The OpenUH runtime library now supports direct online performance measurement of parallel constructs without requiring user intervention. The measurements include hardware performance counter event measurements as well as low-overhead wall clock timings using the "mmtimer" interface that PerfSuite employs. Further extensions to PerfSuite that allow for callstack walking from within PerfSuite will enable the runtime to determine the calling context associated with performance data sampling in order to more selectively invoke compiler optimizations and provide finer-grained feedback to the compiler and user.

Performance Experiments ::.


Platform: Cobalt - SGI Altix with 2 SMP systems running on Linux. Each system has 512 Intel Itanium 2 processors

Experiments on ASPCG were performed to investigate varying performance by adopting different configurations of MPI/OpenMP, i.e. different values for M and N in an MxN run, where M is the number of MPI processes and there are N threads per process. Running ASPCG on 32 processors, we found that the configuration of 8 MPI processes x 4 OpenMP threads is slower than 4 MPI processes x 8 OpenMP threads by 12.8%. Further analysis of the data uncovered the following reasons: GenIDLEST was instrumented with KOJAK directives using the OpenUH compiler. The performance of OpenMP and MPI for GenIDLEST was analyzed by using KOJAK results. The comparison of execution time at the module level was made to identify the subroutines responsible for the less than optimal performance of OpenMP. The modules which are responsible for high execution time were identified and an analysis of the data returned the following preliminary conclusions: The cause and resolution to these bottlenecks are still under investigation.

`

universities