FreeBSD 8 + R14B problems
Jan Koum
jan@REDACTED
Wed Nov 24 23:59:25 CET 2010
hi there,
a few months ago we used to run FreeBSD 7.3 and R13B4. our cluster started
having major problems, so we upgraded to FreeBSD 8 and R14B. ever since
upgrading we have two nagging problems which caused us problems and user
visible downtime:
1. every few days we get "corrupted external term" from
erts/emulator/beam/external.c:bad_dist_ext() which in turns calls
erts_kill_dist_connection(). the atom cache string that follow is:
ATOM_CACHE_REF translations: 0='ejabberd@REDACTED', 1='',
2=xmlelement, 3=xmlcdata, 4=jid, 5=never, 6=offline_msg
2. our front-end servers run on single proc Xeon-Westmere 5670 Hex core with
hyperthreading, in essence getting 12 CPUs:
FreeBSD/SMP: Multiprocessor System Detected: 12 CPUs
FreeBSD/SMP: 1 package(s) x 6 core(s) x 2 SMT threads
every few days one of our front-end nodes goes into a weird state where only
four or two threads are running any doing work instead of 12. this
eventually causes each one of them to hit CPU limit and node starts getting
back log, brining entire cluster down with it. it almost feels like a bug
in the erlang migration logic in how it handles schedulers and runqueues.
here is a compare of the top(1) output:
healthy machine:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
28414 whatsapp 64 0 23073M 13302M CPU0 0 25:55 27.49% {beam.smp}
28414 whatsapp 65 0 23073M 13302M ucond 6 25:52 26.56% {beam.smp}
28414 whatsapp 64 0 23073M 13302M ucond 5 25:49 26.56% {beam.smp}
28414 whatsapp 67 0 23073M 13302M ucond 6 24:51 26.37% {beam.smp}
28414 whatsapp 64 0 23073M 13302M ucond 3 25:46 26.27% {beam.smp}
28414 whatsapp 63 0 23073M 13302M ucond 6 26:10 26.17% {beam.smp}
28414 whatsapp 64 0 23073M 13302M ucond 8 25:41 25.98% {beam.smp}
28414 whatsapp 64 0 23073M 13302M ucond 11 25:49 25.88% {beam.smp}
28414 whatsapp 64 0 23073M 13302M ucond 2 25:47 25.88% {beam.smp}
28414 whatsapp 62 0 23073M 13302M ucond 1 25:46 25.88% {beam.smp}
28414 whatsapp 62 0 23073M 13302M ucond 9 25:52 25.59% {beam.smp}
28414 whatsapp 63 0 23073M 13302M ucond 10 25:56 25.49% {beam.smp}
28414 whatsapp 44 0 23073M 13302M ucond 4 0:12 0.00% {beam.smp}
28414 whatsapp 44 0 23073M 13302M ucond 9 0:00 0.00% {beam.smp}
28414 whatsapp 44 0 23073M 13302M ucond 8 0:00 0.00% {beam.smp}
problematic machine:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
33329 whatsapp 109 0 31021M 17250M CPU11 11 17.3H 62.79% {beam.smp}
33329 whatsapp 76 0 31021M 17250M ucond 1 17.3H 62.70% {beam.smp}
33329 whatsapp 76 0 31021M 17250M kqread 0 17.4H 61.57% {beam.smp}
33329 whatsapp 109 0 31021M 17250M CPU2 2 17.5H 60.06% {beam.smp}
33329 whatsapp 71 0 31021M 17250M ucond 1 557:40 0.00% {beam.smp}
33329 whatsapp 70 0 31021M 17250M ucond 1 517:10 0.00% {beam.smp}
33329 whatsapp 76 0 31021M 17250M ucond 4 435:15 0.00% {beam.smp}
33329 whatsapp 76 0 31021M 17250M ucond 11 259:11 0.00% {beam.smp}
33329 whatsapp 66 0 31021M 17250M ucond 1 178:32 0.00% {beam.smp}
33329 whatsapp 59 0 31021M 17250M ucond 6 137:56 0.00% {beam.smp}
33329 whatsapp 54 0 31021M 17250M ucond 0 71:03 0.00% {beam.smp}
33329 whatsapp 54 0 31021M 17250M ucond 3 65:09 0.00% {beam.smp}
sorry about possible text alignment issues in advance, but you can clearly
see that in problematic case only four beam.smp threads are doing work and
CPU time for each thread time is quite high and 2x the normal machine. from
the TIME column you can also see threads did work but at some point just
stopped doing work.
we don't use any +S or +s options. procstat(1) output was identical on both
machines. is there anything we should be looking at in the future when this
bug happens again?
we appreciate any help or suggestions regarding 1) or 2)
thanks,
-- jan
More information about the erlang-questions
mailing list