Mandalika's scratchpad [ Work blog @Oracle | Stock Market Notes | My Music Compositions ]

Old Posts: 09.04  10.04  11.04  12.04  01.05  02.05  03.05  04.05  05.05  06.05  07.05  08.05  09.05  10.05  11.05  12.05  01.06  02.06  03.06  04.06  05.06  06.06  07.06  08.06  09.06  10.06  11.06  12.06  01.07  02.07  03.07  04.07  05.07  06.07  08.07  09.07  10.07  11.07  12.07  01.08  02.08  03.08  04.08  05.08  06.08  07.08  08.08  09.08  10.08  11.08  12.08  01.09  02.09  03.09  04.09  05.09  06.09  07.09  08.09  09.09  10.09  11.09  12.09  01.10  02.10  03.10  04.10  05.10  06.10  07.10  08.10  09.10  10.10  11.10  12.10  01.11  02.11  03.11  04.11  05.11  07.11  08.11  09.11  10.11  11.11  12.11  01.12  02.12  03.12  04.12  05.12  06.12  07.12  08.12  09.12  10.12  11.12  12.12  01.13  02.13  03.13  04.13  05.13  06.13  07.13  08.13  09.13  10.13  11.13  12.13  01.14  02.14  03.14  04.14  05.14  06.14 


Monday, November 14, 2005
 
Sun Studio C/C++: Tuning iropt for inline control

It is desirable to inline as many hot routines as possible to reduce the run-time overhead of CPU intensive applications. In general, it appears that compilers go by their own rules when to inline a routine, and when to not inline it. This blog post is intended to introduce some of the not widely known (or used) compiler internal flags, to tweak the pre-defined rules of compiler.

Consider the following trivial C code:
% cat inline.c
#include <stdio.h>
#include <stdlib.h>

inline void freememory(int *ptr)
{
free(ptr);
}

extern inline void swapdata(int *ptr1, int *ptr2)
{
int *temp;

temp = (int *) malloc (sizeof (int));
printf("\nswapdata(): before swap ->");

*temp = *ptr1;
*ptr1 = *ptr2;
*ptr2 = *temp;

printf("\nswapdata(): after swap ->");

free (temp);
}

inline void printdata(int *ptr)
{
printf("\nAddress = %x\tStored Data = %d", ptr, *ptr);
}

inline void storedata(int *ptr, int data)
{
*ptr = data;
}

inline int *getintptr()
{
int *ptr;
ptr = (int *) malloc (sizeof(int));
return (ptr);
}

inline void AllocLoadAndSwap(int val1, int val2)
{
int *intptr1, *intptr2;

intptr1 = getintptr();
intptr2 = getintptr();
storedata(intptr1, val1);
storedata(intptr2, val2);
printf("\nBefore swapping .. ->");
printdata(intptr1);
printdata(intptr2);
swapdata(intptr1, intptr2);
printf("\nAfter swapping .. ->");
printdata(intptr1);
printdata(intptr2);
freememory(intptr1);
freememory(intptr2);
}

inline void InitAllocLoadAndSwap()
{
printf("\nSnapshot 1\n___________");
AllocLoadAndSwap(100, 200);
printf("\n\nSnapshot 2\n___________");
AllocLoadAndSwap(435, 135);
}

int main() {
InitAllocLoadAndSwap();
return (0);
}
By default auto inlining is turned off with Sun compilers; and to turn it on, one has to compile the code with -O4 or higher optimization. This example tries to suggest the compiler to inline all the routines, with inline keyword. Note that inline keyword is a suggestion/request for the compiler to inline the function; however there is no guarantee that compiler honors our suggestion/request. Just like any other useful system in the world, compiler has a pre-defined set of rules, and based on those rules, it tries to do its best, as long as those rules are not violated. If the compiler chooses to inline a routine, the function body will be expanded at all the call sites (just like a macro expansion).

When this code is compiled with Sun Studio C compiler, it doesn't print any diagnostic information on stdout/stderr; so, using nm or elfdump tools are one way to find what routines are inlined and what routines are not.
% cc -xO3 -c inline.c
% nm inline.o

inline.o:

[Index] Value Size Type Bind Other Shndx Name

[4] | 0| 0|NOTY |LOCL |0 |3 |Bbss.bss
[6] | 0| 0|NOTY |LOCL |0 |4 |Ddata.data
[8] | 0| 0|NOTY |LOCL |0 |5 |Drodata.rodata
[16] | 0| 0|NOTY |GLOB |0 |ABS |__fsr_init_value
[14] | 0| 0|FUNC |GLOB |0 |UNDEF |InitAllocLoadAndSwap
[1] | 0| 0|FILE |LOCL |0 |ABS |inline.c
[15] | 0| 20|FUNC |GLOB |0 |2 |main
From this output, we can see that InitAllocLoadAndSwap() is not inlined; but still we have no information as to why this function is not inlined.

Compiler commentary with er_src tool

To get some useful diagnostic information, Sun Studio compiler collection offers a tool called er_src. When the source code was compiled with debug (-g or -g0) flag, er_src tool can print the compiler commentary. However since compiler does auto inlining only at O4 or later optimization levels, unfortunately compiler commentary for inlining is not available at O3 opt. level.

iropt's inlining report

iropt component is the global optimizer in Sun Studio compiler collection suite; and inlining will be taken care by iropt. It performs inlining for callees in the same file, unless compiler options for cross file optimizations like -xipo, -xcrossfile are specified on compile line.

Fortunately there are some internal flags of iropt, we could use to control inlining heuristics. Note that these flags have no dependency on the optimization level.

Getting the list of iropt phases, and the corresponding flags

Sun C/C++ compilers on SPARC platform support a variety options for inlining control. iropt -help displays the list of supported flags.
% /opt/SS9/SUNWspro/prod/bin/iropt -help

****** General Usage Information about IROPT ******

To get general help information about IROPT, use -help
To list all the optimization phases in IROPT, use -phases
To get help on a particular phase, use -help=phase
To turn on phases, use -A<phase_name>+<phase_name>+...+<phase_name>
To turn off phases, use -R<phase_name>+<phase_name>+...+<phase_name>
To use phase-specific flags, use -A<phase_name>:<flags list>

% /opt/SS9/SUNWspro/prod/bin/iropt -phases

****** List of Optimization Phases in IROPT ******

Phase Name Description
-------------------------------------------------------------
loop Loop Invariant Code Motion
copy Copy ProPaGation
const Const ProPaGation and folding
reg Virtual Register Allocation
reassoc Reconstruction of associative and/or distributive expressions
rename Scalar Rename
mvl Two-version loops for parallelization
loop_dist Loop Distribution
ddint Loop Interchange
fusion Loop Fusion
eliminate Scalar Replacement on def-def and def-use
private Private Array Analysis
scalarrep Scalar Replacement for use-use
tile Cache Blocking
ujam Register Blocking
ddrefs Loop Invariant Array References Moving
invcc Invariant Conditional Code Motion
restrict_g Assume global pointers as restrict
dead Dead code elimination
pde Partial dead code elimination
ansi_alias Apply ANSI Aliase Rules to Pointer References
yas Scalar Replacement for reduction arrays
cond_elim Conditional Code Elimination
vector Vectorizing Some Intrinsics Functions Calls in Loops
whole Whole Program Mode
bopt Branches Reordering based on Profile Data
invccexp Invariant Conditional Code Expansion
bcopy Memcpy and Memset Transformations
ccse Cross Iteration CSE
data_access Array Access Regions Analysis
ipa Interprocedual Analysis
contract Array Contraction Analysis
symbol Symbolic Analysis
ppg2 optimistic strategy of constant propagation
parallel Parallelization
pcg Parallel Code Generator
lazy Lazy Code Motion
region Region-based Optimization
loop_peeling Loop Peeling
loop_shifting Loop Shifting
loop_collapsing Loop Collapsing
memopt Merge memory allocations
sr Strength reduction (new)
ivsub3 Induction Variable Substitution
crit Critical path optimisations
scalar_repl
loop_bound
loop_condition
measurement
memopt_pattern

% /opt/SS9/SUNWspro/prod/bin/iropt -help=inline

NAME
inline - Qoption for IPA-based inlining phase.

SYNOPSIS
-Ainline[:<op1>][:<op2>]:...[:<opn>] - turn on inline.
-Rinline - turn off inline

DESCRIPTION
inline is on by default now. -Ainline turns it on.
-Rinline turns it off.

NOTE: the following is a brief description of the old inliner qoptions
1. Old inliner qoptions that do not have equivalent
options in the new inliner--avoid to use them later:
-Ml -Mi -Mm -Ma -Mc -Me -Mg -Mw -Mx -Mx -MC -MS

2. Old inliner qoptions that have equivalent option
in the new inliner--use the new options later:
Old options new options
-Msn recursion=n
-Mrn irs=n
-Mtn cs=n
-Mpn cp=n
-MA chk_alias
-MR chk_reshape
-MI chk_reshape=no
-MF mi

The acceptable sub-options are:

report[=n] - dump inlining report.
n=chain:
show to-be-inlined call chains.
n=0: show inlined calls only.
n=1: (default): show both inlined and
non-inlined calls and reasons for
inlining/non-inlining.
n=2: n=1 plus call id and node id
n=3: show inlining summary only
n=4: n=2 and iropt aborts after the
inlining report is dumped out.
cgraph - dump cgraph.
call_in_pragma[=no|yes]:
- call_in_pragma or call_in_pragma=yes:
Inline a call that in the Parallel region
into the original routine
- call_in_pragma=no: (default)
Don't inline a call that in the Parallel region
into the original routine
inline_into_mfunction[=no|yes]:(only for Fortran)
- inline_into_mfunction or inline_into_mfunction=yes:(default)
Inline a call into the mfunction if it is in the
Parallel Region
- inline_into_mfunction=no:
Don't inline a call into the mfunction if it
in the Parallel Region
NOTE: for other languages, if you specify inline_into_mfunction=yes
The compiler will silently ignore this qoption. As a result,
Calls in parallel region will still be inlined into pragma constructs
rs=n - max number of triples in inlinable routines.
iropt defines a routine as inlinable or not
based on this number. So no routines over
this limit will be considered for inlining.
irs=n - max number of triples in a inlining routine,
including size increase by inlining its calls
cs=n - max number of triples in a callee.
In general, iropt only inline calls whose
callee is inlinable (defined by rs) AND
whose callee size is not greater than n.
But some calls to inlinable routines are
actually inlined because of other factors
such as constant actuals, etc.
recursion=n
- max level of resursive call that is
considered for inlining.
cp=n - minimum profile feedback counter of a call.
No call with counter less than this limit
would be inlined.
inc=n - percentage of the total number of triples
increased after inlining. No inlining over
this percentage. For instance, 'inc=30'
means inlining is allowed to increase the
total program size by 30%.
create_iconf=<filename>:
use_iconf=<filename>:
This creates/uses an inlining configuration.
The file lists calls and routines that are
inlined and routines that inline their calls.
Its format is:
air /* actual inlining routines */
n11 n12 n13 ...
n21 n22 n23 ...
.....
ari /* actual routines inlined */
n11 n12 n13 ...
n21 n22 n23 ...
.....
aci /* actual calls inlined */
n11 n12 n13 ...
n21 n22 n23 ...
.....
The numbers are call ids and node ids
printed out when report=2. It is used for
debugging. The usual usage is to use
create_iconf= to create a config file.
then, comment (by preceding numbers line
with #) to disallow inlining for those
calls or routines. For instance,
aci
2 5 6 90
10 234 45 6
# 21 34 46
with the above config file, calls whose
call ids are 21, 34, or 46 will not be
inlined.
do_inline=<routine_name>:
- guide inliner to do inlining for a given
routine only.
mi:
- Do maximum inlining for given routines if do_inline
is used; otherwise, do maximum inlining for main routine.
(The inliner will not check inlining parameters.
remove_ip[=no|yes]:
- remove_ip or remove_ip=yes:
removing inliningPlan after inlining.
- remove_ip=no [default]:
keep inliningPlan after inlining.
chk_alias[=no|yes]:
- chk_alias or chk_alias=yes [default]:
Don't inline a call if inlining it causes
aliases among callee's formal arrays.
- chk_alias=no:
Ignore such checking.
chk_reshape[=no|yes]:
- chk_reshape or chk_reshape=yes [default]:
Don't inline a call if its array argument
is reshaped between caller and callee.
- chk_reshape=no:
Ignore such checking.
chk_mismatch[=no|yes]:
- chk_mismatch or chk_mismatch=yes [default]:
Don't inline a call if any real argument
mismatches with its formal in type.
- chk_mismatch=no:
Ignore such checking.
do_chain[=no|yes]:
- do_chain or do_chain=yes [default]:
Enable inlining for call chains.
- do_chain=no:
Disable inlining for call chains.
callonce[=no|yes]:
- callonce=no [default]:
Disable inlining a routine that is
called only once.
- callonce or callonce=yes:
Enable inlining a routine that is
called only once.

All of a sudden we have overwhelming information to get all the heuristic data from compile time. If we carefully look at all the options listed above, there is a sub-option (report) to -Ainline that dumps inlining report. To pass special flags to iropt, we need to specify -W2,<option>:<sub-option> on compile line.

Here's how to:
%%cc -xO3 -c -W2,-Ainline:report=2 inline.c

INLINING SUMMARY

inc=400: percentage of program size increase.
irs=4096: max number of triples allowed per routine after inlining.
rs=450: max routine size for an inlinable routine.
cs=400: call size for inlinable call.
recursion=1: max level for inlining recursive calls.
Auto inlining: OFF

Total inlinable calls: 14
Total inlined calls: 36
Total inlined routines: 7
Total inlinable routines: 7
Total inlining routines: 3
Program size: 199
Program size increase: 744
Total number of call graph nodes: 11

Notes for selecting inlining parameters

1. "Not inlined, compiler decision":
If a call is not inlined by this reason, try to
increase inc in order to inline it by
-Qoption iropt -Ainline:inc= for FORTRAN, C++
-W2,-Ainline:inc= for C

2. "Not inlined, routine too big after inlining":
If a call is not inlined by this reason, try to
increase irs in order to inline it by
-Qoption iropt -Ainline:irs= for FORTRAN, C++
-W2,-Ainline:irs= for C

3. "Not inlined, callee's size too big":
If a call is not inlined by this reason, try to
increase cs in order to inline it by
-Qoption iropt -Ainline:cs= for FORTRAN, C++
-W2,-Ainline:cs= for C

4. "Not inlined, recursive call":
If a call is not inlined by this reason, try to
increase recursion level in order to inline it by
-Qoption iropt -Ainline:recrusion= for FORTRAN, C++
-W2,-Ainline:recrusion= for C

5. "Routine not inlined, too many operations":
If a routine is not inlinable by this reason, try to
increase rs in order to make it inlinable by
-Qoption iropt -Ainline:rs= for FORTRAN, C++
-W2,-Ainline:rs= for C


ROUTINES NOT INLINABLE:

main [id=7] (inline.c)
Routine not inlined, user requested

CALL INLINING REPORT:

Routine: freememory [id=0] (inline.c)
Nothing inlined.

Routine: swapdata [id=1] (inline.c)
Nothing inlined.

Routine: printdata [id=2] (inline.c)
Nothing inlined.

Routine: storedata [id=3] (inline.c)
Nothing inlined.

Routine: getintptr [id=4] (inline.c)
Nothing inlined.

Routine: AllocLoadAndSwap [id=5] (inline.c)
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined

Routine: InitAllocLoadAndSwap [id=6] (inline.c)
AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
(inc limit reached. See INLININING SUMMARY)
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined

Routine: main [id=7] (inline.c)
InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
(inc limit reached. See INLININING SUMMARY)
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
Finally there's some very useful information. The above report shows the threshold values being used while making decisions, all the routines, and information about whether a call to any function is inlined or not; and if not inlined, the reason for not inlining it, and some suggestions on how to make it succeed. This is very cool!

From the report: the compiler is trying to inline all the routines, as long as the program size doesn't go beyond 400% of the original size (ie., without inlining). Unfortunately AllocLoadAndSwap() fall beyond the limits; and hence compiler decides not to inline it. Fair enough. If we don't bother about the size of the binary, and if we really need to inline this routine, we can increase the value for inc, in such a way that AllocLoadAndSwap()s inclusion would fit into the new limits.

eg.,
% cc -xO3 -c -W2,-Ainline:report=2,-Ainline:inc=650 inline.c
INLINING SUMMARY

inc=650: percentage of program size increase.
irs=4096: max number of triples allowed per routine after inlining.
rs=450: max routine size for an inlinable routine.
cs=400: call size for inlinable call.
recursion=1: max level for inlining recursive calls.
Auto inlining: OFF

Total inlinable calls: 14
Total inlined calls: 60
Total inlined routines: 7
Total inlinable routines: 7
Total inlining routines: 3
Program size: 199
Program size increase: 1260
Total number of call graph nodes: 11

Notes for selecting inlining parameters

... skip ... (see prev reports for the text that goes here)

ROUTINES NOT INLINABLE:

main [id=7] (inline.c)
Routine not inlined, user requested


CALL INLINING REPORT:

Routine: freememory [id=0] (inline.c)
Nothing inlined.

Routine: swapdata [id=1] (inline.c)
Nothing inlined.

Routine: printdata [id=2] (inline.c)
Nothing inlined.

Routine: storedata [id=3] (inline.c)
Nothing inlined.

Routine: getintptr [id=4] (inline.c)
Nothing inlined.

Routine: AllocLoadAndSwap [id=5] (inline.c)
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined

Routine: InitAllocLoadAndSwap [id=6] (inline.c)
AllocLoadAndSwap [call_id=22], line 64: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined

Routine: main [id=7] (inline.c)
InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
AllocLoadAndSwap [call_id=22], line 64: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
From the above output, AllocLoadAndSwap() was inlined by the compiler when we let the program size to increase by 650%.

Notes:
  1. Multiple iropt options separated by a comma (,) can be specified after -W2
    eg., -W2,-Ainline:report=2,-Ainline:inc=650

  2. For C++ programs, -Qoption can be used to pass internal flags to iropt.
    eg., -Qoption iropt -Ainline:report=2
    -Qoption iropt -Ainline:report=2,-Ainline:inc=650

  3. Inlining those functions whose function call overhead is large relative to the function's code, improves performance. The obvious reason for the performance improvement is the elimination of the function call, stack frame manipulation, and the function return

  4. Even though inlining may increase the run-time performance of an application, do not try to inline too many functions. Inline only those functions (from profiling data) that could benefit from inlining.

  5. In general, compiler threshold values are good enough for inlining the functions. Use iropt's flags only if some very hot routines, couldn't make it due to some reason. Turn on auto inlining with -xO4 option

  6. Inline functions increases build times, and program sizes. Sometimes it is possible that some of the very large routines (when inlined) may not fit into processor's cache and may lead to poor performance, due to the increased cache miss rate

Relevant information:
  1. Sun C/C++ compilers: Inlining routines
  2. Sun Studio: Advanced Compiler Options for Performance

___________________
Technorati tags: |


Comments:
Really helpful information, lots of thanks for your post.
 
Post a Comment

Links to this post:

Create a Link



<< Home


2004-2014 

This page is powered by Blogger. Isn't yours?