[erlang-questions] widefinder update

Mon Oct 29 16:25:02 CET 2007

On 10/29/07, Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
>
> --- Steve Vinoski <vinoski@REDACTED> wrote:
>
> > On 10/28/07, Anders Nygren <anders.nygren@REDACTED>
> > wrote:
> > >
> > > On 10/28/07, Hynek Vychodil
> > <vychodil.hynek@REDACTED> wrote:
> > > > Hi Anders,
> > > > I rewrote your code a little. I removed all
> > remaining binary bindings
> > > > and it is noticeable faster again. Try
> > wf_pichi3.erl.
> > > >
> > >
> > > Hynek
> > > that was great, Your change brings my wfinder1_1 +
> > wfbm4_ets1_1
> > > down to
> > > real    0m1.118s
> > > user    0m1.640s
> > > sys     0m0.368s
> > > on my 1.66 GHz dual core laptop.
> >
> >
> > And on the 8-core 2.33GHz Intel Xeon Linux box with
> > 2 GB RAM, this version
> > is extremely fast:
> >
> > real 0m0.567s
> > user 0m2.249s
> > sys 0m0.956s
>
> (I'll ignore the unexplained sys time below. That
> makes the discussion a bit preliminary; perhaps the
> derived results should be computed some other way.
> Apply grain of salt appropriately.)
>
> For those keeping track, the latest result is fully
> 2.7 times faster than the best previous version (which
> was block read), and 17.3 times faster than the
> initial version. The latest speedup is basically due
> to doing less work.

Are we doing less work?
Not really, over 3 versions wfinder, wfinder1 and wfinder1_1
the only differences are how we do binary matching
- do NOT create sub binaries
- keep "pointers" for offsets inside one big binary
The reason that it get so much faster seems to be that
we have much less garbage collection.
wfinder
real    0m1.995s
user    0m3.164s
sys     0m0.396s
garbage collections:  46301
words reclaimed:   501744350

wfinder1
real    0m1.786s
user    0m2.792s
sys     0m0.364s
garbage collections:  36102
words reclaimed:  392187156

wfinder1_1
real    0m1.219s
user    0m1.660s
sys     0m0.376s
garbage collections: 10729
words reclaimed:  114930430

So the amount of garbage has been reduced by a factor > 4

>However, note that user time fell
> by a somewhat greater ratio than real time, which
> might mean parallelization overheads are becoming
> visible.
>
> Also, the user time of 2.249 seconds is now close to
> the Ruby user time, which were 2.095s on the same
> hardware, while the Erlang parallelization speedup
> (user/real) on top of this is 3.95 out of 8. Comparing
> the real times of Erlang (0.567s) and Ruby (2.21s), we
> get about the same execution time speedup, 3.9. Not
> too shabby, huh?
>
> Is there anything more to be wrung out of this
> program?

As I said yesterday, I think that the
current solution has a lower bound of ~0.5s no matter how
many cores You trow at it.

As Steve measured it
Workers  Real time
1              1.252
2              1.079
4              0.701
8              0.575
16            0.567

I think we have reached the limit on this track, we need to look at
this another way to make a solution, that probably is slower on
a low number of cores but that scales better on more > 4 cores.

>Well, apart from further tuning, one can note
> that Ruby had 0.1s sys time, while Erlang apparently
> needs a bit more, 0.5-1.0s. Why?
>
> Taking a wider view, it would also be very interesting
> to see how to apply these lessons to more general
> problems and/or libraries.

Some random thoughts
-The compiler should treat don't care variable (_Var) the same
as (_), and not bind them.
I like to be able to write
<<_Type:TLen/binary,_X:XLen/binary, Val:Len/binary....>>
instead of
SkipLen=TLen+XLen,
<<_:SkipLen/binary,Val:Len/binary...>>

- When doing folds over large binaries do not repetedly
split the binary in head and tail parts for recursive calls.
Keep an offset counter to track the position in the binary.

- One reason that it is even possible to solve the wide finder
in parallel is that it is possible to "resync" at a random place
in the file by locating a newline.
 I have an interest in processing files with BER coded data, so
I have been trying to make a BER version of widefinder, but since
it is necessary to scan the data sequentially to identify each
(TLV) block, it does not lend itself to the type of solution we
have for widefinder. My BER version currently takes
real    0m7.378s
user    0m6.768s
sys     0m0.876s

/Anders