<div dir="ltr"><div>Hi Lukas & Gopi,</div><div><br></div><div><a href="https://groups.google.com/g/rabbitmq-users/c/BYJzgySEdr8">There is a week-long thread in rabbitmq-users</a>, the RabbitMQ equivalent of erlang-questions, which has some interesting details from David Cook that you may find useful Lukas. For example, David picked up on some interesting behaviour coming from <a href="https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L65-L161">https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L65-L161</a> when <font face="monospace">+sbwt none</font> is used.</div><div><br></div><div>Sharing some insights that might help:</div><div><br></div><div><a href="https://github.com/docker-library/rabbitmq/blob/35b41e318d9d9272126f681be74bcbfd9712d71b/Dockerfile-ubuntu.template#L134">Erlang in RabbitMQ Docker image is compiled with extra microstate accounting</a>, which will help better understand where scheduler time is spent. This is what I'm getting in a freshly booted RabbitMQ node, with no queues / connections / channels:</div><div><br></div><div><font face="monospace"><span style="background-color:rgb(255,242,204)">rabbitmq-diagnostics runtime_thread_stats --sample-interval</span><br>Will collect runtime thread stats on rabbit@5e8752d68ad2 for 60 seconds...<br>Average thread real-time    : 60001084 us<br>Accumulated system run-time :  1498785 us<br>Average scheduler run-time  :   179363 us<br><br>        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timer<br></font></div><div><font face="monospace">Stats per type:<br>         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%<br>           aux    0.00%    0.00%    0.00%    0.00%    0.01%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.98%    0.00%<br>dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%<br>dirty_io_sched    0.00%    0.00%    0.00%    0.01%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.99%    0.00%<br>          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%<br>     scheduler    0.01%    0.01%    0.02%    0.19%    0.00%    0.04%    0.00%    0.01%    0.00%    0.00%    0.01%    0.00%    0.00%   99.70%    0.00%</font><br></div><div><br></div><div>All beam.smp flags that RabbitMQ v3.8.9 (latest stable) uses on Erlang/OTP v23.1.1 by default:</div><div><br></div><div><font face="monospace">/usr/local/lib/erlang/erts-11.1.1/bin/beam.smp -W w -K true -A 128 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -B i</font><br></div><div><br></div><div><a href="https://github.com/rabbitmq/tgir/blob/2d4dc1bde3f9ead0fe3555bcc2dffafae4422978/s01/e01/Makefile#L50-L58">It takes 1 gcloud CLI command</a> to spin up this container image on a dedicated VM that will have the same performance as if RabbitMQ was installed from package.</div><div><br></div><div>By default, every 5 seconds, a lot of metrics are collected from all queues / channels / connections & even the OS itself and stored in ETS. This will cause periodic CPU spikes (every 5s) even when there is no publishing/consuming activity. Erlang processes to watch for: <font face="monospace">rabbit_mgmt_external_stats</font> (~6k reductions every 5s), <font face="monospace">vm_memory_monitor</font> (~2.5k reductions every 5s) & <font face="monospace">rabbit_disk_monitor</font> (~2k reductions every 5s).</div><div><br></div><div>rabbitmq-diagnostics observer will show us Gopi which process use the most reductions in your case. Quick how-to: <a href="https://gerhard.io/slides/observe-understand-rabbitmq/#/53">start it</a>, then <a href="https://gerhard.io/slides/observe-understand-rabbitmq/#/54">filter by reductions</a>. This is what the text-form looks for me on a freshly booted node:</div><div><br></div><div><font face="monospace">|Home(H)|Network(N)|System(S)|Ets(E)/Mnesia(M)|App(A)|Doc(D)|Plugin(P)recon:proc_window(reductions, 29, 5000) Interval | 0Days 0:8:28     |<br>|Erlang/OTP 23 [erts-11.1.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:128]                                                  |<br>|System     | Count/Limit           | System Switch             | Status                | Memory Info          | Size                     |<br>|Proc Count | 565/1048576           | Version                   | 23.1.1                | Allocted Mem         | 123.8125 MB     | 100.0% |<br>|Port Count | 11/1048576            | Multi Scheduling          | enabled               | Use Mem              | 88.6950 MB      | 71.64% |<br>|Atom Count | 46994/5000000         | Logical Processors        | 8                     | Unuse Mem            | 35.1175 MB      | 28.36% |<br>|Mem Type   | Size                  | Mem Type                  | Size                  | IO/GC                | Interval: 5000ms         |<br>|Total      | 99.6300 MB   | 100.0% | Binary                    | 147.4844 KB  | 00.14% | IO Output            | 7.1914 KB                |<br>|Process    | 36.7505 MB   | 36.89% | Code                      | 31.0206 MB   | 31.14% | IO Input             | 140 B                    |<br>|Atom       | 1.3694 MB    | 01.37% | Reductions                | 266508                | Gc Count             | 20                       |<br>|Ets        | 3.5130 MB    | 03.53% | RunQueue/ErrorLoggerQueue | 1/0                   | Gc Words Reclaimed   | 434099                   |<br>||1                         00.52% |5                         00.01% |9                         00.00% |13                         00.00% |<br>||2                         00.01% |6                         00.01% |10                        00.00% |14                         00.00% |<br>||3                         00.01% |7                         00.01% |11                        00.00% |15                         00.00% |<br>||4                         00.01% |8                         00.01% |12                        00.00% |16                         00.03% |<br>|No | Pid        |     Reductions      |Name or Initial Call                  |      Memory | MsgQueue |Current Function                  |<br>|1  |<0.702.0>   |5827                 |rabbit_mgmt_external_stats            |  139.6016 KB| 0        |gen_server:loop/7                 |<br>|2  |<0.352.0>   |2472                 |vm_memory_monitor                     |  119.2305 KB| 0        |gen_server:loop/7                 |<br>|3  |<0.354.0>   |2095                 |rabbit_disk_monitor                   |  119.2305 KB| 0        |gen_server:loop/7                 |<br>|4  |<0.237.0>   |1277                 |aten_emitter                          |    6.7227 KB| 0        |gen_server:loop/7                 |<br>|5  |<0.59.0>    |496                  |file_server_2                         |   66.3164 KB| 0        |gen_server:loop/7                 |<br>|6  |<0.44.0>    |403                  |application_controller                |  674.3008 KB| 0        |gen_server:loop/7                 |<br>|7  |<0.724.0>   |374                  |node_coarse_metrics_metrics_collector |   33.6211 KB| 0        |gen_server:loop/7                 |<br>|8  |<0.599.0>   |305                  |rabbit_node_monitor                   |   21.2813 KB| 0        |gen_server:loop/7                 |<br>|9  |<0.357.0>   |290                  |file_handle_cache                     |   10.2695 KB| 0        |gen_server2:process_next_msg/1    |<br>|10 |<0.723.0>   |262                  |node_persister_metrics_metrics_collect|   25.8945 KB| 0        |gen_server:loop/7                 |<br>|11 |<0.591.0>   |237                  |rabbit_event                          |   33.5430 KB| 0        |gen_event:fetch_msg/6             |<br>|12 |<0.725.0>   |236                  |node_metrics_metrics_collector        |  106.7383 KB| 0        |gen_server:loop/7                 |<br>|13 |<0.726.0>   |234                  |node_node_metrics_metrics_collector   |   28.8398 KB| 0        |gen_server:loop/7                 |<br>|14 |<0.209.0>   |219                  |timer_server                          |   11.7031 KB| 0        |gen_server:loop/7                 |<br>|15 |<0.727.0>   |178                  |connection_churn_metrics_metrics_colle|   21.1211 KB| 0        |gen_server:loop/7                 |<br>|16 |<0.618.0>   |175                  |rabbit_memory_monitor                 |   11.6758 KB| 0        |gen_server2:process_next_msg/1    |<br>|17 |<0.266.0>   |156                  |sysmon_handler_filter                 |   24.0313 KB| 0        |gen_server:loop/7                 |<br>|18 |<0.741.0>   |139                  |cowboy_clock                          |    5.6836 KB| 0        |gen_server:loop/7                 |<br>|19 |<0.3.0>     |139                  |erts_dirty_process_signal_handler:star|    2.5391 KB| 0        |erts_dirty_process_signal_handler:|<br>|20 |<0.999.0>   |84                   |inet_tcp_dist:do_accept/7             |    8.5078 KB| 0        |dist_util:con_loop/2              |<br>|21 |<0.350.0>   |72                   |rabbit_alarm                          |    8.6094 KB| 0        |gen_event:fetch_msg/6             |<br>|22 |<0.238.0>   |63                   |aten_detector                         |   23.9883 KB| 0        |gen_server:loop/7                 |<br>|23 |<0.722.0>   |53                   |queue_coarse_metrics_metrics_collector|    5.6836 KB| 0        |gen_server:loop/7                 |<br>|24 |<0.50.0>    |47                   |code_server                           |  416.4180 KB| 0        |code_server:loop/1                |<br>|25 |<0.721.0>   |44                   |queue_metrics_metrics_collector       |    5.6836 KB| 0        |gen_server:loop/7                 |<br>|26 |<0.720.0>   |44                   |consumer_created_metrics_collector    |    5.6836 KB| 0        |gen_server:loop/7                 |<br>|27 |<0.719.0>   |44                   |channel_process_metrics_metrics_collec|    5.6836 KB| 0        |gen_server:loop/7                 |<br>|28 |<0.718.0>   |44                   |channel_exchange_metrics_metrics_colle|    5.6836 KB| 0        |gen_server:loop/7                 |<br>|29 |<0.717.0>   |44                   |channel_queue_metrics_metrics_collecto|    5.6836 KB| 0        |gen_server:loop/7                 |<br>|q(quit) p(pause) r/rr(reduction) m/mm(mem) b/bb(binary mem) t/tt(total heap size) mq/mmq(msg queue) 9(proc 9 info) F/B(page forward/back |</font><br></div><div><br></div><div><a href="https://rabbitmq.github.io/rabbitmq-perf-test/stable/htmlsingle/#using-environment-variables-as-options">PerfTest</a> can be configured to periodically stop all publishing via e.g. <font face="monospace">--variable-rate 0:60 --variable-rate 100:60</font> which translates to don't publish for 60 seconds, publish 100 msg/s for 60 seconds, then repeat. Consumers will not put any pressure on RabbitMQ if messages are pushed instead of pulled (the default PerfTest behaviour), so as soon as all messages are consumed and publishing stops, there should be a few seconds of inactivity between the periodic metric collection.</div><div><br></div><div>In some cases, applications pull messages from RabbitMQ and run constant <font face="monospace">basic.get</font> operations which can result in excessive CPU usage even though there are no messages. I don't think that this is your case Gopi, but I'm going to mention it since it is known to cause massive expeditions down rabbit holes.</div><div><br></div><div>I hope this moves the discussion in the right direction.</div><div><br></div><div>Thanks all, Gerhard.</div><div></div></div>