Mandalika's scratchpad [ Work blog @Oracle | Stock Market Notes | My Music Compositions ]

Old Posts: 09.04  10.04  11.04  12.04  01.05  02.05  03.05  04.05  05.05  06.05  07.05  08.05  09.05  10.05  11.05  12.05  01.06  02.06  03.06  04.06  05.06  06.06  07.06  08.06  09.06  10.06  11.06  12.06  01.07  02.07  03.07  04.07  05.07  06.07  08.07  09.07  10.07  11.07  12.07  01.08  02.08  03.08  04.08  05.08  06.08  07.08  08.08  09.08  10.08  11.08  12.08  01.09  02.09  03.09  04.09  05.09  06.09  07.09  08.09  09.09  10.09  11.09  12.09  01.10  02.10  03.10  04.10  05.10  06.10  07.10  08.10  09.10  10.10  11.10  12.10  01.11  02.11  03.11  04.11  05.11  07.11  08.11  09.11  10.11  11.11  12.11  01.12  02.12  03.12  04.12  05.12  06.12  07.12  08.12  09.12  10.12  11.12  12.12  01.13  02.13  03.13  04.13  05.13  06.13  07.13  08.13  09.13  10.13  11.13  12.13  01.14  02.14  03.14  04.14  05.14  06.14  07.14  09.14  10.14  11.14  12.14  01.15  02.15  03.15  04.15  06.15  09.15  12.15  01.16  03.16  04.16  05.16  06.16  07.16  08.16  09.16  12.16  01.17  02.17  03.17 

Thursday, July 14, 2005
Sun Studio C/C++: Profile Feedback Optimization (PFO)

Profile Feedback

In general, compilers do optmizations based on the compiler flags supplied during the compilation, and the other information that it thinks appropriate for an optimization though the user hasn't asked for it specifically eg., inlining certain routines, even if they are not accompanied by inline key word, loop unrolling etc., However since it can't predict the run-time behavior of the application, it can't do certain optimizations like block reordering, register allocation etc., during compile time. So, it is upto the developer for careful laying out of the instructions for better performance of the application. As developers may not be the end users in most of the cases, it is a cumbersome exercise for the developers to get the run-time data, to identify the hot code (where the application spends most of the time), and to rewrite the some code to improve the performance.

The developer can be relieved from such tasks by assisting optimizers with the program profile ie., the run-time behavior of a program: expected probabilities of branches and frequencies of executions of given program blocks.

Sun Studio C/C++ compilers support profile feedback technique to automate tasks mentioned above. Profile feedback is a mechanism by which a user can gather information about the run-time behavior of the application, and allow the compiler to use this information to optimize the application further.

Typical steps involved in using profile feedback mechanism:
  1. Compilation for profile data collection

    • Build the application with -xprofile=collect option. In this step, the source code is instrumented to gather data; counters are inserted into the source code to facilitate determining the number of times the code was executed. This data will be used to build a control flow graph.

    • -xprofile=collect may suppress optimizations that modify the structure of the control flow graph, in order to preserve the accuracy of the description used to generate control flow instrumentation

    • You can also specify the name of the program that is being analyzed, during compilation with -xprofile=collect flag. The flag will be -xprofile=collect:name. The name is optional and, if not specified, is assumed to be a.out. When compiling different object files with the -xprofile=collect:<name> option, <name> must be the same for all the object files used to produce the final program. If they are not consistent, then the result will be dependent on the final link order of the program

    • Note that profile feedback works only at optimization levels -xO2 and above

    • eg., (trivial C code, just for example)
      % cat bubblesort.c
      #include <stdio.h>
      #include <stdlib.h>

      #define COUNT 1000

      void swap (int *Array, int i, int j) {
      int temp;
      temp = Array[i];
      Array[i] = Array[j];
      Array[j] = temp;

      void bubblesort(int *Array) {
      int i, j;

      for (i = 0; i < COUNT; ++i) {
      for (j = (i + 1); j < COUNT; ++j) {
      if (Array[i] > Array[j]) {
      swap (Array, i, j);

      int main() {
      int i, *Array;

      Array = (int *) malloc (sizeof (int) * COUNT);

      for (i = COUNT; i > 0; --i)
      Array[COUNT - i] = i;


      for (i = 0; i < COUNT; ++i)
      printf("\nArray[%d] = %d", i, Array[i]);

      return (0);

      % cc -o bubblesort -xO4 -xprofile=collect bubblesort.c

  2. Data collection run

    • Run the application built in (1) with one or more representative workloads. If the workload is representative, then the branches that are normally taken in the training workload, are normally taken in the real workload

    • In this phase, the compiler instrumented code collects the branch frequencies for all branches & the counts for all basic blocks. This will create a directory named after the program with a .profile appended to it. feedbin file under program.profile directory, holds the execution frequency for later use by the optimizer with -xprofile=use

    • The data collection is additive, which means that if you keep running the same executable (even with different inputs), the data on the latest run will get added to the data about previous runs. Therefore, the data will be an aggregate of all your runs with the profiled executable.

    • If you have profile data from earlier runs, and if you recompile the program with -xprofile=collect and rerun it, the code that writes out the profile data will detect that this is a different program and overwrite the old data

    • If you only run your program with a single input file, then you can just run that input file and you will have collected good data. However, if you are creating a general purpose application which can have a variety of inputs which cause execution of different parts of your program, you should choose different kinds of sample input so that the samples are representative of all the inputs that your program will receive. Using only certain kinds of input will bias the compiler in favoring the executed parts of the program more than the unexecuted parts.

    • By default, the program.profile directory will be created in the same directory, from where the executable is being run. If you wish to change the directory in which the profile data goes, you can use the SUN_PROFDATA_DIR environment variable.

    • eg.,
      % ./bubblesort
      Array[0] = 1
      Array[1] = 2
      Array[9998] = 9999
      Array[9999] = 10000

      % ls -dF *.profile

      %ls -dF /tmp/*.profile
      No match

      % setenv SUN_PROFDATA_DIR /tmp

      % ./bubblesort
      Array[0] = 1
      Array[1] = 2

      % ls -dF /tmp/*.profile

  3. Rebuild the application with feedback

    • Once we gather the feedback data from the application, we can give this data back to the compiler, so it can do a better job optimizing it. To do this, we need to use the compiler flag: -xprofile=use:<profile data directory>. Make sure to supply the profile data directory. If you only use -xprofile=use, then the compiler does not know what the profile data is called, and therefore will assume that the data is called a.out.profile. Note that it is not necessary to specify .profile, when supplying the profile data directory name to -xprofile=use. Hence from the previous example, it is valid to specify either -xprofile=use:bubblesort.profile or -xprofile=use:bubblesort on compile line

    • Except for the -xprofile option which changes from -xprofile=collect to -xprofile=use, the source files and other compiler options must be exactly the same as those used for the compilation that created the compiled program which in turn generated the feedback file. The same version of the compiler must be used for both the collect build and the use build as well. If compiled with -xprofile=collect:name, the same program name name must appear in the optimizing compilation: -xprofile=use:name

    • -xprofile=use incorporates information collected by running binaries compiled using -xprofile=collect. The information (execution counts associated with basic blocks and control flow edges) is read from a feedback file and is attached to the compiler intermediate representation at the beginning of optimization. Many compiler optimizations subsequently both use the execution counts and update them when the control flow graph is modified as part of an optimizing transformation

    • Based on the profile feedback data, the compiler can do optimisations of the following types:

      1. Code layout
        • Place code in a way that the frequently executed code in a routine is grouped together
          eg., Block chaining, block ordering, block splitting, register allocation
        • Place code to improve instruction cache (I$) utilization, instruction prefetching etc.,
      2. Loop unrolling, loop unswitching
      3. Inline routines that are frequently called, to both remove the cost of calling the routine, and potentially to enable further optimisation of the inlined code

    • eg.,
      % cc -o bubblesort -xO4 -xprofile=use:bubblesort.profile bubblesort.c
      % ./bubblesort
      Array[0] = 1
      Array[1] = 2

    • With the following cg (Code Generator) options, the compiler dumps the execution count of each basic block, in an assembly listing

      • % {f90,CC} -xprofile=use:<feedback dir> -Qoption cg -assembly,-Qcg-V
        % cc -xprofile=use:<feedback dir> -Wc,-assembly,-Qcg-V

        The -assembly option will generate a .s file with the same basename and dirname as the object file (e.g., bubblesort.o will be accompanied by bubblesort.s in the same directory). The -Qcg-V option adds more information as assembler comments to the generated .s file. If -xprofile=use has been specified, this information includes execution counts derived from the <profile data directory>
      • eg.,
        % cc -xprofile=use:bubblesort -Wc,-assembly,-Qcg-V -xO4 bubblesort.c
        ! 13 !void bubblesort(int *Array) {
        ! 14 ! int i, j;
        ! 16 ! for (i = 0; i < COUNT; ++i) {

        ! SUBROUTINE bubblesort

        .global bubblesort

        bubblesort: /* frequency 1.0 confidence 1.0 */
        /* 000000 16 ( 0 1) */ sethi %hi(0x2400),%o5 ! const ! ssar:9216
        /* 0x0004 ( 0 1) */ or %g0,0,%o2
        /* 0x0008 ( 1 2) */ add %o5,783,%o3 ! const ! ssar:9999 ! urw: 32
        /* 0x000c ( 1 2) */ add %o5,784,%o1 ! const ! ssar:10000 ! urw: 32

        ! 17 ! for (j = (i + 1); j < COUNT; ++j) {

        /* 0x0010 17 ( 2 3) */ or %g0,1,%o5

        ! Registers live out of bubblesort:
        ! o0 o1 o2 o3 o5 sp o7 fp gsr

        ! predecessor blocks : bubblesort .L900000205

        .L900000206: /* frequency 10000.0 confidence 1.0 */
        /* 0x0014 17 ( 0 1) */ cmp %o5,%o1 ! urw: 32
        /* 0x0018 ( 0 1) */ bge,pn %icc,.L77000113 ! tprob=0.00 ! urw: 32
        /* 0x001c ( 0 1) */ add %o0,4,%g3 ! hoisted ! urw: 32

        ! Registers live out of .L900000206:
        ! g3 o0 o1 o2 o3 o5 sp o7 fp gsr

  4. Measure the application performance, and compare with baseline run
  1. In general, any profile feedback improves performance

  2. Designing good training runs is a complex issue and nearly impossible for some programs. So care must be taken to run the application using a typical data set or several typical data sets. It is important to use data that is representative of the data that will be used by your application in a real-world scenario

  3. Because the process requires compiling the entire application twice, it is intended to be used only after other debugging and tuning is finished, as one of the last steps before putting the application into production

  4. The profile data collection is synchronous. The instrumented code dumps the profile data during the shutdown of the application process. Since it is not asynchronous, multi-threaded applications may experience some (profile) data loss due to the race condition between multiple threads. There is some work in progress to get asynchronous profile data collection, in the presence of multiple threads and without the need to shutting down the process. Hopefully this feature will be available in Sun Studio 11

Patch releases

If the application is very big, and if only few modules were changed, profile only those binaries (executables or shared libraries) that are rebuilt for the patch. However, in order to collect a meaningful profile, there needs to be -xprofile=collect versions of all object files comprising a rebuilt executable or shared library. For example, if the executable mtserver is built with object files smiwork.o and smiutil.o, then rebuild those object files with -xprofile=collect, along with mtserver. Then simply replace the old binaries of the collect build with the new patched binaries. And then re-run the training run, collect the feedback data for the entire build; and finally recompile all object files in the binary (executable or library) with -xprofile=use

In other words:
Let's say the shared library, built from objects model.o and buscomp.o, has to be patched. Under PFO, this will be done as follows:
  1. Compile model, buscomp objects with -xprofile=collect

  2. Build with -xprofile=collect

  3. Simply replace the of the previous complete collect build, with the (latest) patched

  4. Collect profile data for the entire application

  5. Compile model, buscomp objects again with -xprofile=use and with the feedback data from step 4 (see above)

  6. Re-build with -xprofile=use and with the feedback data from step 4 (see above)

  7. Release as a patch

Other optimizations that would work well with profile feedback:
  1. Profile feedback works best with crossfile optimisation (controlled by the flag -xipo) since this allows the compiler to look at potentially optimisations between all source files

  2. Mapfiles work at the routine level, and profile feedback works within routines; it would seem to be a simple progression to do both optimisations at the same time. This is possible with link-time optimisation (controlled by the flag -xlinkopt). This is also called post-optimisation

The whole discussion on profile feedback optimization can be summarized in the following 3 steps:

  1. Build the application with the flags -xprofile=collect -xipo
    % cc -xO2 -xprofile=collect:application.profile -xipo -o application application.c

  2. Run the application with one or more representative workloads
    % ./application args

  3. Rebuild the application with -xprofile=use -xipo -xlinkopt
    % cc -xO2 -xprofile=use:application.profile -xipo -xlinkopt -o application application.c

Suggested Reading:
  1. Sun Studio C/C++ compiler options
  2. Improving Code Layout Can Improve Application Performance by Darryl Gove
Chris Aoki (Sun Microsystems)
Technorati tags: | | |

Guess what I am really a fan of your blogs :). I cant tell you how useful they are for my work and how precise yet informative your blogs are.
Post a Comment

Links to this post:

Create a Link

<< Home


This page is powered by Blogger. Isn't yours?