Time series db: please help me optimize it
Ulf Wiger
ulf@REDACTED
Sun Jul 17 19:21:15 CEST 2005
I changed one line of code in tickplus.erl and added three
to tickdb.erl, and now the program seems to be flying along:
What I did: changed the gen_server:cast(DB, {'ADD'...}) to
a gen_server:call().
What was happening was that the process reading the tick
data sucked up the 119,275 TRADE lines in ES4U.csv,
and then asynchronously told the disk_log server to store
the parsed data to disk. The poor logging server's message
queue became extremely large, and garbage collection ate
up all the cpu time. Monitoring the werl.exe process in
Win XP (sigh), it kept growing, and the message queue
length of the DB process grew at least as large as
56,000 messages, before it started shrinking.
My result from the modified code:
2> tickplus:load("ES4U","ES4U.csv").
New log: tick/price/20040901
New log: tick/size/20040901
New log: tick/price/20040902
New log: tick/size/20040902
New log: tick/price/20040903
New log: tick/size/20040903
done: 8641.000000ms
ok
Same thing with the original code:
1> tickplus:load("ES4U","ES4U.csv").
New log: tick/price/20040901
New log: tick/size/20040901
New log: tick/price/20040902
New log: tick/size/20040902
done: 263094.000000ms
ok
New log: tick/price/20040903
New log: tick/size/20040903
Obviously, "done" here only means that the
reader has sent its last message to the DB
process, but the DB process still has a backlog
to take care of.
Adding a 'sync' call to make sure that
the DB process is done:
3> tickplus:load("ES4U","ES4U.csv").
New log: tick/price/20040901
New log: tick/size/20040901
New log: tick/price/20040902
New log: tick/size/20040902
New log: tick/price/20040903
New log: tick/size/20040903
synced: 410234.000000ms
ok
BTW:
$ grep TRADE ES4U.csv > trade_lines.csv
$ ls -l *.csv *.LOG
-rwx------+ 1 Ägaren Ingen 682113 Jul 17 18:28 1826599643.LOG
-rwx------+ 1 Ägaren Ingen 664686 Jul 17 18:25 2449200400.LOG
-rwx------+ 1 Ägaren Ingen 1495347 Jul 17 18:28 3804582371.LOG
-rwx------+ 1 Ägaren Ingen 1847757 Jul 17 18:28 4022035944.LOG
-rwx------+ 1 Ägaren Ingen 1552230 Jul 17 18:25 676106329.LOG
-rwx------+ 1 Ägaren Ingen 802032 Jul 17 18:25 92519506.LOG
-rwx------+ 1 Ägaren Ingen 18386970 Jul 13 00:55 ES4U.csv
-rw-r--r-- 1 Ägaren Ingen 4920505 Jul 17 18:30 trade_lines.csv
5647435 bytes for the disk_logs vs 4920505 for the original
TRADE data. That's 15% storage overhead. Doesn't seem
excessive to me.
Did all the ticks get logged properly?
26> timer:tc(count,logs,[[4022035944,1826599643,3804582371]]).
{219000,119278}
using this program to count the objects:
============================================================
-module(count).
-export([log/1, logs/1]).
log(L) ->
{ok, _} = disk_log:open([{name, L}]),
Count = count(L, disk_log:chunk(L, start), 0),
disk_log:close(L),
Count.
logs(Ls) ->
lists:foldl(
fun(L, Sum) ->
Sum + log(L)
end, 0, Ls).
count(L, {Cont, Terms}, C) ->
count(L, disk_log:chunk(L, Cont), C+length(Terms));
count(L, eof, C) ->
C.
=============================================================
/Uffe
Den 2005-07-17 13:00:14 skrev Joel Reymont <joelr1@REDACTED>:
> Folks,
>
> This is my attempt at creating a time-series database in Erlang. I
> specifically want to store a day's worth of data in a separate disk
> log and keep the columns separate. You can read more on the whys
> here: http://groups-beta.google.com/group/uptick/
>
> You can get the data from http://wagerlabs.com/ES4U.csv.gz. Run it
> like this:
>
> tickdb:load("ES4U", "ES4U.csv").
>
> I had a version that did not cache open disk logs but used tickdb:add
> directly, without going through a gen_server. It ran almost 3x faster!
>
> Any suggestions on how to optimize this are appreciated. There are
> 119,275 trade record in the CSV file. I divide the total run time by
> this number to get the average per insert. I would like to do at
> least 100 inserts per second.
>
> The code is part of the open source trading platform that I'm creating.
>
> Thanks, Joel
>
> --
> http://wagerlabs.com/uptick
>
>
More information about the erlang-questions
mailing list