Hi Richard,<div><br></div><div>Let me see if I understood your points correctly:</div><div>1. Is string the right tool? Alternatives: list/iolist (even binary tree) of hashes for the chunks.</div><div>2. To use compressed data as Joe suggested some messages ago.</div>
<div><br></div><div>If that is the case, here are few thoughts:</div><div><br></div><div>1. a) Using binary tree structure is something I definitely took into account whenever is possible. That is why I was asking for opinions about processing lists vs. binaries vs. something else. Once the basic storage type is set, the rest is coming accordingly. My only question was about that storage type because my experience with Erlang has a gap here (as I said, I am relatively new in using Erlang) and I wanted some opinions from those who used them more extensively. Until now I got nice opinions and I thank everyone who shared them here.</div>
<div><br></div><div>1. b) Using hashes for the chunks may be a solution worth thinking of. I do not share too much enthusiasm on that as converting chunks into hashes may slow down the overall process. But since I have no idea about how much such a solution slows down the overall processing time, I definitely should take this solution into account.</div>
<div><br></div><div>2. As I said, this is something worth taking into account even if I need to find a way to define "read-only" and "read-write" characteristic for each string.</div><div><br></div><div>
Of course, until I have some test cases, this is only a pure discussion. So, the first step now should be to think of some test cases, I suppose.</div><div><br></div><div>Thanks a lot for your input. It definitely made me think twice about how to approach the problem.</div>
<div><br></div><div>CGS</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><br><div class="gmail_quote">On Mon, Jul 16, 2012 at 12:53 AM, Richard O'Keefe <span dir="ltr"><<a href="mailto:ok@cs.otago.ac.nz" target="_blank">ok@cs.otago.ac.nz</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
On 13/07/2012, at 12:46 AM, CGS wrote:<br>
> I am trying to find a balance in between processing speed and RAM consumption for sets of large strings (over 1 M characters per string).<br>
<br>
</div>I've been watching this thread with some interest.<br>
<br>
Just last week I told a 3rd-year software engineering class about the<br>
importance of making sure you are solving the right problem.<br>
<br>
Are all the strings over 1M characters?<br>
Can you characterise the size distribution more clearly?<br>
How many of these strings do you have?<br>
Do you have fixed sets, or do you have them flooding in and<br>
out again?<br>
Where do they come from?<br>
Are you doing anything to these strings,<br>
or just holding them and passing them on?<br>
<br>
Smalltalk systems keep track of the source code of every method,<br>
but they do this by keeping the characters in any of several files<br>
and internally just keeping what-and-where tuples; I used that<br>
successfully in a Prolog program once (think CLOBs).<br>
<br>
If you might be editing the strings, have you considered using<br>
a tree of binaries? (I'm thinking of "piece tables" and AVL DAGs.)<br>
<br>
Haskell's lazy ByteStrings<br>
<a href="http://www.haskell.org/ghc/docs/7.0-latest/html/libraries/bytestring-0.9.1.10/Data-ByteString-Lazy.html" target="_blank">http://www.haskell.org/ghc/docs/7.0-latest/html/libraries/bytestring-0.9.1.10/Data-ByteString-Lazy.html</a><br>
are in effect lists of binaries (default chunk size = 64k).<br>
<div class="im">> About each string, it is constructed from chunks of fixed size,<br>
> usually, much smaller than the string itself, hopefully.<br>
</div>This sounds a lot like the list-of-chunk representation.<br>
Are all the chunks the same size? (Not that it matters much.<br>
Erlang iodata is basically list-of-chunk.)<br>
<br>
Do the contents of these strings have any structure?<br>
Are strings (however represented) *really* the right tool for the job?<br>
<br>
I just have such a hard time believing in 1M-char strings that<br>
are *just* strings and not something structured that has been<br>
represented as a string.<br>
<br>
Do they have shareable parts?<br>
(When I want to process XML or SGML, I have a thing I call the<br>
"Document Value Model", which uses hash consing in C. Even plain<br>
text files usually save at least a few percent.)<br>
Could the chunks be shareable, or are they just an artefact of<br>
packetising?<br>
<br>
Are the data already compressed? Are they of a kind that might benefit<br>
from compression? gzip, which is not state of the art, gets text down<br>
to about 30%. Heck, even random decimal digit sequences compress to<br>
less than 50%. The zlib module may be helpful. Some of the chunks in<br>
a string could be compressed and others not.<br>
<br>
What do the strings signify? Why are they so big?<br>
What is the purpose of the Erlang program?<br>
What is the *real* problem?<br>
<br>
</blockquote></div><br></div>