[erlang-questions] Automated Stripping of otp libraries / modules

Thu Jun 23 16:49:07 CEST 2011

Hi,

Dale Harvey asked about how to figure out what modules are actually
needed in a system. I promised to try out some different approaches on
the control system for the hardware I work on.

Executive summary:

   - Using my code:all_loaded() approach nets me 153 modules
   - Xref looks like a dead end. Xref gives me 475 modules in a simple analysis
   - Dialyzer might be able to do better. I don't know.

The rest of this post is long. Sorry. I should probably get a personal
blog, but then nobody would read it.

Hacky approach:

   I previously described what we actually do: run our test suite and
   then call code:all_loaded(). Simple, but only pulls in modules the
   test suite touches. Seems unbeatable for my purposes.

Use Xref to find module dependencies: 475 modules

   Xref is OTP's cross-referencing tool.

   There's an analyzer in xref which reads .beam files and produces a
   call graph. A call graph is just a (big) list of which function
   calls which other function. So we could use that to see which modules
   are actually needed in a system.

   The other part of xref is a query language which lets you determine
   things about the call graph.

   Anyway, let's just dive in. The call

      {ok, Modules} = xref:analyse(Xref, {module_call, [gth_mop]}) 

   returns a list of modules used by gth_mop. gth_mop is the 'entry point'
   for my system. So if I keep calling

      {ok, More} = xref:analyse(Xref, {module_call, Modules}) 

   until More == Modules, then I've got every module the system needs.

   The list produced this way is huge, 475 modules, and includes a
   bunch of things which are obviously _not_ needed, e.g. 'wx' on 
   a system which can't possibly run 'wx'. Not so good.

Use reltool (Håkan Mattsson's suggestion)

   It looks like reltool just uses 'xref'. Instead of using
   xref:analyse/2, it uses a query, xref:q(Pid, "UM").  I'm not sure
   if it does that recursively or not, but I can't see how it can
   solve this problem better than xref. But I'm no reltool expert.

Can Xref do a better job?

   My first approach with xref is crap. Imagine this system with two modules:

     system entry point is m:f
     m:f calls n:f
     n:f calls nothing
     n:g calls o:f

   o:f can't be reached from m:f, but xref's module_call analysis will
   include 'o' in the results. So the module_call analysis is not 
   the right way to go for this problem. We need to use the call graph.

   There's probably an xref query to do what I want, but thinking in
   terms of xref's query language is beyond me*. So I just get xref to
   give me the call graph edges, like this (E stands for edge):

      > xref:q(Xref, "E").
      {ok,[{{m,f,0},{n,f,0}},  {{n,g,0},{o,f,0}}]}

   you can see from the call graph that o:f/0 isn't reachable from m:f/0

   But this falls in a heap as soon as you use 'spawn' or M:F in
   even slightly tricky ways, e.g.

       go() ->
          B = b,
          spawn(fun() -> B:f() end).

   call graph: [{{mml,go,0},{'$M_EXPR',f,0}}, {{mml,go,0},{erlang,spawn,1}}]}

   '$M_EXPR' is xref-speak for "I don't know which module this is".   

Can Dialyzer do this better?

   Dialyzer is remarkably good at finding dead code, so I wonder if
   it can produce a call graph better than xref does. But I've already
   spent the better part of a day on xref so poking around dialyzer's
   will have to wait.

Matt

 * The xref manpage says that xref has a "simple query language". The
   language has more than 20 predefined variables, a bunch of
   operators including |, || and |||, regular expressions and a cast
   syntax. I don't think that qualifies as "simple", unless your hobby
   is designing query languages.