[erlang-questions] Fast directory walker

Sat Dec 10 14:13:16 CET 2016

I just had a look at the GNU find source, it uses fts, man fts(3). I didn't bother digging deeper into glibc to see how fts is handling the inodes.

Frank's earlier non-Erlang examples seem to be using dedicated file hierarchy walker modules, perhaps those are using the fts, or similar, C functions. In which case they are being compared with the 
pure erlang implementation of the tree walker.

Staying with pure erlang version, a concurrent/multi-process tree walker could be an interesting project in itself :)

Cheers,
f.

On 10/12/16 11:25, Joe Armstrong wrote:
> This is very interesting - I've often wondered why directory traversal
> is faster in C than
> Erlang since all Erlang is doing is calling C primitives -
>
> I think measuring the times for this is difficult - if for example you
> run a python test immediately followed by an erlang test, I'd expect
> you get a different result than if you run the erlang
> test with a "cold" cache - running the python program will have the
> side effect of loading
> various memory buffers - I'd also expect different results on
> different OS and different
> results depending upon your physical memory sizes and the size of the
> trees you are traversing.
>
> I've often wondered if fast C programs achieve their speed by directly
> poking around in the
> underlying inode structures, but this would involve detailed knowledge
> of the underlying
> file system (ie is an ext3, HFS+, FAT32 etc.)
>
> Programs like rsync and the Dropbox sync algorithm seem unreasonably
> fast to me -
> I wonder if they poke around in the underlying OS representation of
> the file system and not
> use a more portable and easier to use interface.
>
> /Joe
>
>
>
>
> On Sat, Dec 10, 2016 at 11:33 AM, Fred Youhanaie <fly@REDACTED> wrote:
>>
>> Out of interest, what do you get when you run the same benchmark twice in
>> succession?
>
>
>
>>
>> The buffer cache would play a role on the second and subsequent attempts.
>> For your code below, running on my oldish laptop, I get {35667060,158949}
>> and {8606920,158949}.
>>
>> You can use "blockdev --flushbufs <device>" to ensure the buffers are clean
>> before running the benchmarks, especially when comparing different versions
>> and languages.
>>
>> Having said that, the erlang version does look slow, especially when
>> compared with the shell equivalent "time find /usr/share -print | wc -l"
>>
>> $ sudo blockdev --flushbufs /dev/...
>>
>> $ time find /usr/share -print | wc -l
>> 186911
>>
>> real    0m32.446s
>> user    0m0.796s
>> sys     0m1.808s
>>
>> $ time find /usr/share -print | wc -l
>> 186911
>>
>> real    0m0.336s
>> user    0m0.152s
>> sys     0m0.200s
>>
>>
>> Perhaps there is room for improvement within the library itself!
>>
>> Cheers,
>> f.
>>
>>
>> On 10/12/16 09:20, Frank Muller wrote:
>>>
>>> Combining previous hints (Benoit, Sergej):
>>>
>>> -module(directory).
>>> -include_lib("kernel/include/file.hrl").
>>> -export([walker/1]).
>>>
>>> walker(Path) ->
>>>     case file:read_file_info(Path, [raw]) of
>>>         {ok, #file_info{type = regular}} ->
>>>             1;
>>>         _ -> %% not care about symlink for now, assume a directory
>>>             Children = filelib:wildcard(Path ++ "/*"),
>>>             lists:foldl(fun(P, N) -> N + walker(P) end, 0, Children)
>>> end.
>>>
>>>> timer:tc(fun() -> directory:walker("/usr/share") end).
>>>
>>> {1611688, <tel:1611688,28953>/28953 <tel:1611688,28953>/}
>>>
>>> I'm only counting number of files in this case.
>>>
>>> /Frank
>>>
>>> Le sam. 10 déc. 2016 à 10:05, Sergej Jurečko <sergej.jurecko@REDACTED
>>> <mailto:sergej.jurecko@REDACTED>> a écrit :
>>>
>>>     read_file_info does the job of is_dir and file_size in a single call.
>>> That was the intention.
>>>
>>>     Also use file:read_file_info(name,[raw])
>>>
>>>
>>>     Sergej
>>>
>>>>     On 10 Dec 2016, at 09:42, Benoit Chesneau <bchesneau@REDACTED
>>>> <mailto:bchesneau@REDACTED>> wrote:
>>>>
>>>>     this is kind of bullshit (sorry ;).... at the end this is what does
>>>> the helpers in filelib:
>>>>
>>>> https://github.com/erlang/otp/blob/maint/lib/stdlib/src/filelib.erl#L257
>>>>
>>>>     except if you have a better algorithm in mind i don't se the point of
>>>> rewriting something that is aleaready existing ...
>>>>
>>>>     On Sat, 10 Dec 2016 at 09:36, Sergej Jurečko
>>>> <sergej.jurecko@REDACTED <mailto:sergej.jurecko@REDACTED>> wrote:
>>>>
>>>>         Stop using filelib functions. Use file:read_file_info and
>>>> file:list_dir.
>>>>
>>>>         Sergej
>>>>
>>>>         On Dec 10, 2016 9:29 AM, "Frank Muller"
>>>> <frank.muller.erl@REDACTED <mailto:frank.muller.erl@REDACTED>> wrote:
>>>>
>>>>             Hi Stanislaw
>>>>
>>>>             First, I don't care if I've to use documented/undocumented
>>>> calls as long as I can achieve my goal: faster dir walking.
>>>>
>>>>             And you're right, here is a detailed comparison with other
>>>> scripting languages:
>>>>
>>>>             In my /usr/share, there’s:
>>>>             2580 directories
>>>>             28953 files
>>>>
>>>>             1. Erlang (no io:format/1, just recurse):
>>>>
>>>>             walk(Dir) ->
>>>>                 {ok, Files} = file:list_dir(Dir),
>>>>                 walk(Dir, Files).
>>>>
>>>>             walk(Dir, [ Basename | Rest ]) ->
>>>>                 Path = filename:join([ Dir, Basename ]),
>>>>                 case filelib:is_dir(Path) of
>>>>                     true  ->
>>>>                         walk(Path);
>>>>                     false ->
>>>>                       %%  io:format("~s~n", [Path]),
>>>>                         filelib:file_size(Path)
>>>>                 end,
>>>>                 walk(Dir, Rest);
>>>>             walk(_, []) ->
>>>>                 ok.
>>>>
>>>>             timer:tc(fun() -> directoy:walker("/usr/share") end).
>>>>             {4662361 <tel:4662361>,ok}
>>>>
>>>>             2. Python (this code even count the size of dir):
>>>>             From:
>>>> http://stackoverflow.com/questions/1392413/calculating-a-directory-size-using-python
>>>>
>>>>             import os
>>>>             def get_size(start_path = '.'):
>>>>                 total_size = 0
>>>>                 for dirpath, dirnames, filenames in os.walk(start_path):
>>>>                     for f in filenames:
>>>>                         fp = os.path.join(dirpath, f)
>>>>                         total_size += os.path.getsize(fp)
>>>>                 return total_size
>>>>
>>>>             print get_size()
>>>>
>>>>             $ cd /usr/share
>>>>             $ time dir_walker.py
>>>>             432034130 <tel:432034130>
>>>>             0.25 real         0.13 user         0.10 sys
>>>>
>>>>             2. Perl (same, count dir size)
>>>>             http://www.perlmonks.org/?node_id=168974
>>>>
>>>>             use File::Find;
>>>>             my $size = 0;
>>>>             find(sub { $size += -s if -f $_ }, "/usr/share");
>>>>
>>>>             $ time perl dir_walker.pl <http://dir_walker.pl/>
>>>>             432034130 <tel:432034130>
>>>>             0.13 real         0.05 user         0.08 sys
>>>>
>>>>             3. Ruby (same, count dir size):
>>>>
>>>>             def directory_size(path)
>>>>               path << '/' unless path.end_with?('/')
>>>>               raise RuntimeError, "#{path} is not a directory" unless
>>>> File.directory?(path)
>>>>               total_size = 0
>>>>               Dir["#{path}**/*"].each do |f|
>>>>                 total_size += File.size(f) if File.file?(f) &&
>>>> File.size?(f)
>>>>               end
>>>>               total_size
>>>>             end
>>>>             puts directory_size '/usr/share’
>>>>
>>>>             $ time walker.rb
>>>>             432028422 <tel:432028422>
>>>>
>>>>             0.21 real         0.09 user         0.11 sys
>>>>
>>>>             4. Lua:
>>>>             From: http://lua-users.org/wiki/DirTreeIterator
>>>>
>>>>             require "lfs"
>>>>
>>>>             function dirtree(dir)
>>>>               assert(dir and dir ~= "", "directory parameter is missing
>>>> or empty")
>>>>               if string.sub(dir, -1) == "/" then
>>>>                 dir=string.sub(dir, 1, -2)
>>>>               end
>>>>
>>>>               local function yieldtree(dir)
>>>>                 for entry in lfs.dir(dir) do
>>>>                   if entry ~= "." and entry ~= ".." then
>>>>                     entry=dir.."/"..entry
>>>>             local attr=lfs.attributes(entry)
>>>>             coroutine.yield(entry,attr)
>>>>             if attr.mode == "directory" then
>>>>               yieldtree(entry)
>>>>             end
>>>>                   end
>>>>                 end
>>>>               end
>>>>
>>>>               return coroutine.wrap(function() yieldtree(dir) end)
>>>>             end
>>>>
>>>>             for filename, attr in dirtree("/usr/share") do
>>>>                   print(attr.mode, filename)
>>>>             end
>>>>
>>>>             $ luarocks install luafilesystem
>>>>             $ time lua walker.lua > /dev/null
>>>>             0.30 real         0.16 user         0.14 sys
>>>>
>>>>             Do you need more?
>>>>
>>>>             Thanks for you help.
>>>>             /Frank
>>>>
>>>>             Le sam. 10 déc. 2016 à 00:51, Stanislaw Klekot
>>>> <erlang.org@REDACTED <mailto:erlang.org@REDACTED>> a écrit :
>>>>
>>>>                 On Fri, Dec 09, 2016 at 11:15:58PM +0000, Frank Muller
>>>> wrote:
>>>>
>>>>                 > I would like to improve the speed of my directory
>>>> walker.
>>>>
>>>>                 >
>>>>
>>>>                 > walk(Dir) ->
>>>>
>>>>                 >     {ok, Files} = prim_file:list_dir(Dir),
>>>>
>>>>                 >     walk(Dir, Files).
>>>>
>>>>
>>>>
>>>>                 Why prim_file:list_dir() instead of file:list_dir()? The
>>>> former is
>>>>
>>>>                 undocumented internal function.
>>>>
>>>>
>>>>
>>>>                 [...]
>>>>
>>>>                 > Compared to almost anything i found on the web, it’s
>>>> still very slow:
>>>>
>>>>                 > > timer:tc(fun() -> dir:walk("/usr/share") end).
>>>>
>>>>                 > {4662361,ok}
>>>>
>>>>
>>>>
>>>>                 What is it this "anything you found on the web"? And how
>>>> did you run
>>>>
>>>>                 your comparisons? There's a large difference between
>>>> first and second
>>>>
>>>>                 consequent run caused by OS' directory cache, and there's
>>>> large
>>>>
>>>>                 difference between simply walking through the directory
>>>> and walking with
>>>>
>>>>                 printing something to the screen for every file.
>>>>
>>>>
>>>>
>>>>                 Then there's also your using filelib:is_dir() and then
>>>>
>>>>                 filelib:file_size(), which means two stat(2) calls, while
>>>> you only need
>>>>
>>>>                 to do it once per file (file:read_file_info()).
>>>>
>>>>
>>>>
>>>>                 --
>>>>
>>>>                 Stanislaw Klekot
>>>>
>>>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>