292 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			292 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
<HTML>
 | 
						|
<HEAD>
 | 
						|
<TITLE>Debugging Garbage Collector Related Problems</title>
 | 
						|
</head>
 | 
						|
<BODY>
 | 
						|
<H1>Debugging Garbage Collector Related Problems</h1>
 | 
						|
This page contains some hints on
 | 
						|
debugging issues specific to
 | 
						|
the Boehm-Demers-Weiser conservative garbage collector.
 | 
						|
It applies both to debugging issues in client code that manifest themselves
 | 
						|
as collector misbehavior, and to debugging the collector itself.
 | 
						|
<P>
 | 
						|
If you suspect a bug in the collector itself, it is strongly recommended
 | 
						|
that you try the latest collector release, even if it is labelled as "alpha",
 | 
						|
before proceeding.
 | 
						|
<H2>Bus Errors and Segmentation Violations</h2>
 | 
						|
<P>
 | 
						|
If the fault occurred in GC_find_limit, or with incremental collection enabled,
 | 
						|
this is probably normal.  The collector installs handlers to take care of
 | 
						|
these.  You will not see these unless you are using a debugger.
 | 
						|
Your debugger <I>should</i> allow you to continue.
 | 
						|
It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV
 | 
						|
("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb,
 | 
						|
"<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx)
 | 
						|
and set a breakpoint in <TT>abort</tt>.
 | 
						|
The collector will call abort if the signal had another cause,
 | 
						|
and there was not other handler previously installed.
 | 
						|
<P>
 | 
						|
We recommend debugging without incremental collection if possible.
 | 
						|
(This applies directly to UNIX systems.
 | 
						|
Debugging with incremental collection under win32 is worse.  See README.win32.)
 | 
						|
<P>
 | 
						|
If the application generates an unhandled SIGSEGV or equivalent, it may
 | 
						|
often be easiest to set the environment variable GC_LOOP_ON_ABORT.  On many
 | 
						|
platforms, this will cause the collector to loop in a handler when the
 | 
						|
SIGSEGV is encountered (or when the collector aborts for some other reason),
 | 
						|
and a debugger can then be attached to the looping
 | 
						|
process.  This sidesteps common operating system problems related
 | 
						|
to incomplete core files for multithreaded applications, etc.
 | 
						|
<H2>Other Signals</h2>
 | 
						|
On most platforms, the multithreaded version of the collector needs one or
 | 
						|
two other signals for internal use by the collector in stopping threads.
 | 
						|
It is normally wise to tell the debugger to ignore these.  On Linux,
 | 
						|
the collector currently uses SIGPWR and SIGXCPU by default.
 | 
						|
<H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2>
 | 
						|
The garbage collector generates warning messages of the form
 | 
						|
<PRE>
 | 
						|
Needed to allocate blacklisted block at 0x...
 | 
						|
</pre>
 | 
						|
when it needs to allocate a block at a location that it knows to be
 | 
						|
referenced by a false pointer.  These false pointers can be either permanent
 | 
						|
(<I>e.g.</i> a static integer variable that never changes) or temporary.
 | 
						|
In the latter case, the warning is largely spurious, and the block will
 | 
						|
eventually be reclaimed normally.
 | 
						|
In the former case, the program will still run correctly, but the block
 | 
						|
will never be reclaimed.  Unless the block is intended to be
 | 
						|
permanent, the warning indicates a memory leak.
 | 
						|
<OL>
 | 
						|
<LI>Ignore these warnings while you are using GC_DEBUG.  Some of the routines
 | 
						|
mentioned below don't have debugging equivalents.  (Alternatively, write
 | 
						|
the missing routines and send them to me.)
 | 
						|
<LI>Replace allocator calls that request large blocks with calls to
 | 
						|
<TT>GC_malloc_ignore_off_page</tt> or
 | 
						|
<TT>GC_malloc_atomic_ignore_off_page</tt>.  You may want to set a
 | 
						|
breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls.
 | 
						|
Make sure that a pointer to somewhere near the beginning of the resulting block
 | 
						|
is maintained in a (preferably volatile) variable as long as
 | 
						|
the block is needed.
 | 
						|
<LI>
 | 
						|
If the large blocks are allocated with realloc, we suggest instead allocating
 | 
						|
them with something like the following.  Note that the realloc size increment
 | 
						|
should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable
 | 
						|
performance.  But we all know we should do that anyway.
 | 
						|
<PRE>
 | 
						|
void * big_realloc(void *p, size_t new_size)
 | 
						|
{
 | 
						|
    size_t old_size = GC_size(p);
 | 
						|
    void * result;
 | 
						|
 
 | 
						|
    if (new_size <= 10000) return(GC_realloc(p, new_size));
 | 
						|
    if (new_size <= old_size) return(p);
 | 
						|
    result = GC_malloc_ignore_off_page(new_size);
 | 
						|
    if (result == 0) return(0);
 | 
						|
    memcpy(result,p,old_size);
 | 
						|
    GC_free(p);
 | 
						|
    return(result);
 | 
						|
}
 | 
						|
</pre>
 | 
						|
 | 
						|
<LI> In the unlikely case that even relatively small object
 | 
						|
(<20KB) allocations are triggering these warnings, then your address
 | 
						|
space contains lots of "bogus pointers", i.e. values that appear to
 | 
						|
be pointers but aren't.  Usually this can be solved by using GC_malloc_atomic
 | 
						|
or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc.  Sometimes the problem can be solved with trivial changes of encoding
 | 
						|
in certain values.  It is possible, to identify the source of the bogus
 | 
						|
pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>,
 | 
						|
which will cause it to print the "bogus pointers", along with their location.
 | 
						|
 | 
						|
<LI> If you get only a fixed number of these warnings, you are probably only
 | 
						|
introducing a bounded leak by ignoring them.  If the data structures being
 | 
						|
allocated are intended to be permanent, then it is also safe to ignore them.
 | 
						|
The warnings can be turned off by calling GC_set_warn_proc with a procedure
 | 
						|
that ignores these warnings (e.g. by doing absolutely nothing).
 | 
						|
</ol>
 | 
						|
 | 
						|
<H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2>
 | 
						|
 | 
						|
This typically happens while the collector is trying to remove an entry from
 | 
						|
its free list, and the free list pointer is bad because the free list link
 | 
						|
in the last allocated object was bad.
 | 
						|
<P>
 | 
						|
With > 99% probability, you wrote past the end of an allocated object.
 | 
						|
Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and
 | 
						|
allocating with <TT>GC_MALLOC</tt>.  This will try to detect such
 | 
						|
overwrite errors.
 | 
						|
 | 
						|
<H2>Unexpectedly Large Heap</h2>
 | 
						|
 | 
						|
Unexpected heap growth can be due to one of the following:
 | 
						|
<OL>
 | 
						|
<LI> Data structures that are being unintentionally retained.  This
 | 
						|
is commonly caused by data structures that are no longer being used,
 | 
						|
but were not cleared, or by caches growing without bounds.
 | 
						|
<LI> Pointer misidentification.  The garbage collector is interpreting
 | 
						|
integers or other data as pointers and retaining the "referenced"
 | 
						|
objects.
 | 
						|
<LI> Heap fragmentation.  This should never result in unbounded growth,
 | 
						|
but it may account for larger heaps.  This is most commonly caused
 | 
						|
by allocation of large objects.  On some platforms it can be reduced
 | 
						|
by building with -DUSE_MUNMAP, which will cause the collector to unmap
 | 
						|
memory corresponding to pages that have not been recently used.
 | 
						|
<LI> Per object overhead.  This is usually a relatively minor effect, but
 | 
						|
it may be worth considering.  If the collector recognizes interior
 | 
						|
pointers, object sizes are increased, so that one-past-the-end pointers
 | 
						|
are correctly recognized.  The collector can be configured not to do this
 | 
						|
(<TT>-DDONT_ADD_BYTE_AT_END</tt>).
 | 
						|
<P>
 | 
						|
The collector rounds up object sizes so the result fits well into the
 | 
						|
chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K
 | 
						|
on 64 bit machines) used by the collector.   Thus it may be worth avoiding
 | 
						|
objects of size 2K + 1 (or 2K if a byte is being added at the end.)
 | 
						|
</ol>
 | 
						|
The last two cases can often be identified by looking at the output
 | 
						|
of a call to <TT>GC_dump()</tt>.  Among other things, it will print the
 | 
						|
list of free heap blocks, and a very brief description of all chunks in
 | 
						|
the heap, the object sizes they correspond to, and how many live objects
 | 
						|
were found in the chunk at the last collection.
 | 
						|
<P>
 | 
						|
Growing data structures can usually be identified by
 | 
						|
<OL>
 | 
						|
<LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>,
 | 
						|
<LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt>
 | 
						|
before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>),
 | 
						|
so that objects will be identified by their allocation site,
 | 
						|
<LI> Running the application long enough so
 | 
						|
that most of the heap is composed of "leaked" memory, and
 | 
						|
<LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h
 | 
						|
a few times to determine why some randomly sampled objects in the heap are
 | 
						|
being retained.
 | 
						|
</ol>
 | 
						|
<P>
 | 
						|
The same technique can often be used to identify problems with false
 | 
						|
pointers, by noting whether the reference chains printed by
 | 
						|
<TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers.
 | 
						|
An alternate technique is to build the collector with
 | 
						|
<TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that
 | 
						|
are almost, but not quite, look like heap pointers.  It is very likely that
 | 
						|
actual false pointers will come from similar sources.
 | 
						|
<P>
 | 
						|
In the unlikely case that false pointers are an issue, it can usually
 | 
						|
be resolved using one or more of the following techniques:
 | 
						|
<OL>
 | 
						|
<LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers.
 | 
						|
This is especially important for large arrays containing compressed data,
 | 
						|
pseudo-random numbers, and the like.  It is also likely to improve GC
 | 
						|
performance, perhaps drastically so if the application is paging.
 | 
						|
<LI> If you allocate large objects containing only
 | 
						|
one or two pointers at the beginning, either try the typed allocation
 | 
						|
primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component.
 | 
						|
<LI> Consider using <TT>GC_malloc_ignore_off_page()</tt>
 | 
						|
to allocate large objects.  (See <TT>gc.h</tt> and above for details.
 | 
						|
Large means > 100K in most environments.)
 | 
						|
</ol>
 | 
						|
<H2>Prematurely Reclaimed Objects</h2>
 | 
						|
The usual symptom of this is a segmentation fault, or an obviously overwritten
 | 
						|
value in a heap object.  This should, of course, be impossible.  In practice,
 | 
						|
it may happen for reasons like the following:
 | 
						|
<OL>
 | 
						|
<LI> The collector did not intercept the creation of threads correctly in
 | 
						|
a multithreaded application, <I>e.g.</i> because the client called
 | 
						|
<TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it.
 | 
						|
<LI> The last pointer to an object in the garbage collected heap was stored
 | 
						|
somewhere were the collector couldn't see it, <I>e.g.</i> in an
 | 
						|
object allocated with system <TT>malloc</tt>, in certain types of
 | 
						|
<TT>mmap</tt>ed files,
 | 
						|
or in some data structure visible only to the OS.  (On some platforms,
 | 
						|
thread-local storage is one of these.)
 | 
						|
<LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by
 | 
						|
XORing it with another pointer.
 | 
						|
<LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation.
 | 
						|
<LI> An incorrect <TT>GC_free</tt> call.
 | 
						|
<LI> The client program overwrote an internal garbage collector data structure.
 | 
						|
<LI> A garbage collector bug.
 | 
						|
<LI> (Empirically less likely than any of the above.) A compiler optimization
 | 
						|
that disguised the last pointer.
 | 
						|
</ol>
 | 
						|
The following relatively simple techniques should be tried first to narrow
 | 
						|
down the problem:
 | 
						|
<OL>
 | 
						|
<LI> If you are using the incremental collector try turning it off for
 | 
						|
debugging.
 | 
						|
<LI> If you are using shared libraries, try linking statically.  If that works,
 | 
						|
ensure that DYNAMIC_LOADING is defined on your platform.
 | 
						|
<LI> Try to reproduce the problem with fully debuggable unoptimized code.
 | 
						|
This will eliminate the last possibility, as well as making debugging easier.
 | 
						|
<LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt>
 | 
						|
calls with calls to <TT>GC_malloc</tt>.
 | 
						|
<LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable
 | 
						|
<TT>#define</tt>).
 | 
						|
<LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>.
 | 
						|
<LI> If the following works on your platform (i.e. if gctest still works
 | 
						|
if you do this), try building the collector with
 | 
						|
<TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>.  This will cause
 | 
						|
the collector to scan memory allocated with malloc.
 | 
						|
</ol>
 | 
						|
If all else fails, you will have to attack this with a debugger.
 | 
						|
Suggested steps:
 | 
						|
<OL>
 | 
						|
<LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure.  Verify
 | 
						|
that the collectors idea of the root set (i.e. static data regions which
 | 
						|
it should scan for pointers) looks plausible.  If not, i.e. if it doesn't
 | 
						|
include some static variables, report this as
 | 
						|
a collector bug.  Be sure to describe your platform precisely, since this sort
 | 
						|
of problem is nearly always very platform dependent.
 | 
						|
<LI> Especially if the failure is not deterministic, try to isolate it to
 | 
						|
a relatively small test case.
 | 
						|
<LI> Set a break point in <TT>GC_finish_collection</tt>.  This is a good
 | 
						|
point to examine what has been marked, i.e. found reachable, by the
 | 
						|
collector.
 | 
						|
<LI> If the failure is deterministic, run the process
 | 
						|
up to the last collection before the failure.
 | 
						|
Note that the variable <TT>GC_gc_no</tt> counts collections and can be used
 | 
						|
to set a conditional breakpoint in the right one.  It is incremented just
 | 
						|
before the call to GC_finish_collection.
 | 
						|
If object <TT>p</tt> was prematurely recycled, it may be helpful to
 | 
						|
look at <TT>*GC_find_header(p)</tt> at the failure point.
 | 
						|
The <TT>hb_last_reclaimed</tt> field will identify the collection number
 | 
						|
during which its block was last swept.
 | 
						|
<LI> Verify that the offending object still has its correct contents at
 | 
						|
this point.
 | 
						|
The call <TT>GC_is_marked(p)</tt> from the debugger to verify that the
 | 
						|
object has not been marked, and is about to be reclaimed.
 | 
						|
<LI> Determine a path from a root, i.e. static variable, stack, or
 | 
						|
register variable,
 | 
						|
to the reclaimed object.  Call <TT>GC_is_marked(q)</tt> for each object
 | 
						|
<TT>q</tt> along the path, trying to locate the first unmarked object, say
 | 
						|
<TT>r</tt>.
 | 
						|
<LI> If <TT>r</tt> is pointed to by a static root,
 | 
						|
verify that the location
 | 
						|
pointing to it is part of the root set printed by <TT>GC_dump()</tt>.  If it
 | 
						|
is on the stack in the main (or only) thread, verify that
 | 
						|
<TT>GC_stackbottom</tt> is set correctly to the base of the stack.  If it is
 | 
						|
in another thread stack, check the collector's thread data structure
 | 
						|
(<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds
 | 
						|
are set correctly.
 | 
						|
<LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the
 | 
						|
collector's layout description for <TT>s</tt> is such that the pointer field
 | 
						|
will be scanned.  Call <TT>*GC_find_header(s)</tt> to look at the descriptor
 | 
						|
for the heap chunk.  The <TT>hb_descr</tt> field specifies the layout
 | 
						|
of objects in that chunk.  See gc_mark.h for the meaning of the descriptor.
 | 
						|
(If it's low order 2 bits are zero, then it is just the length of the
 | 
						|
object prefix to be scanned.  This form is always used for objects allocated
 | 
						|
with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.)
 | 
						|
<LI> If the failure is not deterministic, you may still be able to apply some
 | 
						|
of the above technique at the point of failure.  But remember that objects
 | 
						|
allocated since the last collection will not have been marked, even if the
 | 
						|
collector is functioning properly.  On some platforms, the collector
 | 
						|
can be configured to save call chains in objects for debugging.
 | 
						|
Enabling this feature will also cause it to save the call stack at the
 | 
						|
point of the last GC in GC_arrays._last_stack.
 | 
						|
<LI> When looking at GC internal data structures remember that a number
 | 
						|
of <TT>GC_</tt><I>xxx</i> variables are really macro defined to
 | 
						|
<TT>GC_arrays._</tt><I>xxx</i>, so that
 | 
						|
the collector can avoid scanning them.
 | 
						|
</ol>
 | 
						|
</body>
 | 
						|
</html>
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 |