[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Python's GC approach



During my Python talk, Matthias asked me why Python had its own
garbage collection instead of using Hans Boehms.  My answer wasn't
particularly good, because I wasn't much involved in the design of the
collector.  Historical considerations, of course, played a role;
Python started with a reference-counting-only scheme in the early 90s
and the relatively recent addition of a cycle-detection phase built
on reference counting rather than replacing it.

I discussed some of the considerations with colleagues here and can
offer a better answer now.  I'm not familiar with the details of
Boehm's approach, so I still wouldn't call this a fully informed
answer.

First a subjective comment: I believe Python uses (mallocs/frees) a
lot of memory relative to the sorts of C/C++ programs that the Boehm
collector is typically used with.  Take the simple example:

    for i in xrange(100000):
        print i # or, generally, do something with the ints

The range() builtin generates 100,000 integer objects, which consumes
12 bytes + malloc overhead on a 32-bit platform.  Since Python
programs spend a lot of time allocating and deallocating, there is a
dedicated free list for integers.  Reference counting does pretty well
here, as a very small number of allocated ints can handle all of the
allocations because of the free list.

As I emphasized in my talk, Python's implementation favors simplicity
and portability over performance.  I think the Boehm-gc approach is at
odds with this design goal.  I just glanced at the various platform
READMEs in the current implementation.  They have warnings like this
in the Linux file:

    To use threads, you need to abide by the following requirements:

    1) You need to use LinuxThreads (which are included in libc6).

       The collector relies on some implementation details of the LinuxThreads
       package....

    3a) Every file that makes thread calls should define LINUX_THREADS and 
       _REENTRANT and then include gc.h.  Gc.h redefines some of the
       pthread primitives as macros which also provide the collector with
       information it requires....
     
    5) The combination of LINUX_THREADS, REDIRECT_MALLOC, and incremental
       collection fails in seemingly random places....

The README for HP-UX is similar:

    Thread support for HP/UX 11 Pthreads was also recently added.  It
    is still flakey in this release.  (It has only been tested on a
    uniprocessor.  Even there some fraction of thread creation calls
    fail with a not-yet-understood error return from sem_wait.)

None of this sounds particularly portable or reliable.  Python runs
with several different pthreads implementations and it is often
extended with or embedded in foreign C code that uses threads.  It's
also straightforward to port Python to new platforms like the
Playstation or Palm OS, which would presumably be difficult if the
collector had to be ported first.

An earlier effort to integrate the collector with Python ran into
problems with the Python Tk bindings.  Apparently Tk had references to
Python objects that the collector couldn't find; thus, objects
referenced by Tk were collected when they shouldn't have been.  See
the newsgroup thread for details:
http://groups.google.com/groups?selm=000401bed023%24d08d3360%2406a22299%40tim