Overview ::.
The Copper integrated environment for application tuning combines robust, existing, open source software - the OpenUH compiler, Dragon program analysis tool and two performance tools, KOJAK and PerfSuite. The environment provides automated, scalable performance measurement, analysis, and optimization to increase user productivity by reducing the manual effort of existing approaches.
Interaction with the compiler has enabled the performance tools to instrument more accurately and to be more selective in the measurement process while making this process as automatic as possible. As a result, we are able to accomplish a scalable strategy for performance analysis, which is essential if performance tuning tools are to address the needs of emerging very large scale systems. The work in this project has enabled the compiler to provide the performance software with fine grained information on the control flow of a program within its procedures selectively with minimal user intervention. We have determined that this information can be exploited by the performance tools to provide more accurate feedback on the behavior of a program. As a result, the approach scales and the information is detailed. We have also created interfaces that permit performance information to be fed back to the compiler. This feedback is the basis for even higher levels of scalability, as the compiler uses the data to further refine the performance measurement process, and also can exploit the information to improve its translation of the program. Thus, integration of the compiler and tools is not only beneficial for the application developer, who no longer has to deal with multiple and disparate sources of information, but it can also have a significant impact on the ability of each of these components to perform its function. The resulting environment is, in consequence, considerably more than the sum of its parts: it provides a coherent infrastructure for application tuning and enables a scalable approach to investigating and overcoming performance problems.

Interactions: OpenUH and KOJAK ::.
KOJAK is responsible for delivering performance feedback to the OpenUH compiler. It can also directly provide feedback to the programmer in manual performance tuning mode, and is supported by the compiler to reduce the amount of data generated in trace files. This interaction has produced the following:
- A profiling interface supporting automatic instrumentation of sequential and OpenMP C/Fortran codes
- A subscriber module to process events from the OpenUH profiling interface
- A new KOJAK instrumentation script kinst-uh64 to simplify an application's instrumentation with OpenUH
Interactions: OpenUH and PerfSuite ::.
In the COPPER framework, PerfSuite is being employed to enable its flexible invocation by the OpenUH compiler and KOJAK for low cost, selective performance monitoring and runtime data collection. This interaction has produced the folowing:
- Extensions to the PerfSuite library to support fine grain, customized performance measurements
- Implementation of a new OpenMP profiling interface consistent with the recently standarized OpenMP runtime profiling interface
- Instrumentation of the OpenMP profiling interface inside the OpenMP runtime library
Performance Experiments ::.
Platform: Cobalt - SGI Altix with 2 SMP systems running on Linux. Each system has 512 Intel Itanium 2 processors
Experiments on ASPCG were performed to investigate varying performance by adopting different configurations of MPI/OpenMP, i.e. different values for M and N in an MxN run, where M is the number of MPI processes and there are N threads per process. Running ASPCG on 32 processors, we found that the configuration of 8 MPI processes x 4 OpenMP threads is slower than 4 MPI processes x 8 OpenMP threads by 12.8%. Further analysis of the data uncovered the following reasons:
- 8 x 4 Load Imbalance 5.6% in barrier
- 8 x 4 has more MPI overhead than 4x8
- 4 x 8 Better load balance 1.4% in barrier
- Critical overheads identified in OpenMP are idle threads and synchronization barriers, which take about 28% of total execution time
- The modules responsible for idle threads are linked to the dynamic allocation of memory
- The number of implicit barriers is higher in OpenMP than MPI, which is responsible for synchronization overheads
- Additionally, about five to six modules were identified which take a disproportionate amount of execution time under OpenMP