FreeBSD 8 + R14B problems

Wed Nov 24 23:59:25 CET 2010

hi there,

a few months ago we used to run FreeBSD 7.3 and R13B4.  our cluster started
having major problems, so we upgraded to FreeBSD 8 and R14B. ever since
upgrading we have two nagging problems which caused us problems and user
visible downtime:

1. every few days we get "corrupted external term" from
erts/emulator/beam/external.c:bad_dist_ext() which in turns calls
erts_kill_dist_connection().  the atom cache string that follow is:
ATOM_CACHE_REF translations: 0='ejabberd@REDACTED', 1='',
2=xmlelement, 3=xmlcdata, 4=jid, 5=never, 6=offline_msg

2. our front-end servers run on single proc Xeon-Westmere 5670 Hex core with
hyperthreading, in essence getting 12 CPUs:

FreeBSD/SMP: Multiprocessor System Detected: 12 CPUs
FreeBSD/SMP: 1 package(s) x 6 core(s) x 2 SMT threads

every few days one of our front-end nodes goes into a weird state where only
four or two threads are running any doing work instead of 12. this
eventually causes each one of them to hit CPU limit and node starts getting
back log, brining entire cluster down with it.  it almost feels like a bug
in the erlang migration logic in how it handles schedulers and runqueues.
 here is a compare of the top(1) output:

healthy machine:

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
28414 whatsapp  64    0 23073M 13302M CPU0    0  25:55 27.49% {beam.smp}
28414 whatsapp  65    0 23073M 13302M ucond   6  25:52 26.56% {beam.smp}
28414 whatsapp  64    0 23073M 13302M ucond   5  25:49 26.56% {beam.smp}
28414 whatsapp  67    0 23073M 13302M ucond   6  24:51 26.37% {beam.smp}
28414 whatsapp  64    0 23073M 13302M ucond   3  25:46 26.27% {beam.smp}
28414 whatsapp  63    0 23073M 13302M ucond   6  26:10 26.17% {beam.smp}
28414 whatsapp  64    0 23073M 13302M ucond   8  25:41 25.98% {beam.smp}
28414 whatsapp  64    0 23073M 13302M ucond  11  25:49 25.88% {beam.smp}
28414 whatsapp  64    0 23073M 13302M ucond   2  25:47 25.88% {beam.smp}
28414 whatsapp  62    0 23073M 13302M ucond   1  25:46 25.88% {beam.smp}
28414 whatsapp  62    0 23073M 13302M ucond   9  25:52 25.59% {beam.smp}
28414 whatsapp  63    0 23073M 13302M ucond  10  25:56 25.49% {beam.smp}
28414 whatsapp  44    0 23073M 13302M ucond   4   0:12  0.00% {beam.smp}
28414 whatsapp  44    0 23073M 13302M ucond   9   0:00  0.00% {beam.smp}
28414 whatsapp  44    0 23073M 13302M ucond   8   0:00  0.00% {beam.smp}

problematic machine:

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
33329 whatsapp 109    0 31021M 17250M CPU11  11  17.3H 62.79% {beam.smp}
33329 whatsapp  76    0 31021M 17250M ucond   1  17.3H 62.70% {beam.smp}
33329 whatsapp  76    0 31021M 17250M kqread  0  17.4H 61.57% {beam.smp}
33329 whatsapp 109    0 31021M 17250M CPU2    2  17.5H 60.06% {beam.smp}
33329 whatsapp  71    0 31021M 17250M ucond   1 557:40  0.00% {beam.smp}
33329 whatsapp  70    0 31021M 17250M ucond   1 517:10  0.00% {beam.smp}
33329 whatsapp  76    0 31021M 17250M ucond   4 435:15  0.00% {beam.smp}
33329 whatsapp  76    0 31021M 17250M ucond  11 259:11  0.00% {beam.smp}
33329 whatsapp  66    0 31021M 17250M ucond   1 178:32  0.00% {beam.smp}
33329 whatsapp  59    0 31021M 17250M ucond   6 137:56  0.00% {beam.smp}
33329 whatsapp  54    0 31021M 17250M ucond   0  71:03  0.00% {beam.smp}
33329 whatsapp  54    0 31021M 17250M ucond   3  65:09  0.00% {beam.smp}

sorry about possible text alignment issues in advance, but you can clearly
see that in problematic case only four beam.smp threads are doing work and
CPU time for each thread time is quite high and 2x the normal machine.  from
the TIME column you can also see threads did work but at some point just
stopped doing work.

we don't use any +S or +s options.  procstat(1) output was identical on both
machines.  is there anything we should be looking at in the future when this
bug happens again?

we appreciate any help or suggestions regarding 1) or 2)

thanks,

-- jan