On Thu, Nov 20, 2008 at 8:08 AM, damien morton <span dir="ltr"><<a href="mailto:dmorton@bitfurnace.com">dmorton@bitfurnace.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
traditionally, at least in C, this stuff is done with a bitvector, or<br>
rather a vector of bits.<br>
<br></blockquote></div>Here's my 'C-like' contribution. It's pretty quick, too (the time below excludes the time taken to read the file into a binary). <br><br>32> charclass:bm("/home/efine/erlang/otp_src_R12B-3.tar.gz").<br>
File "/home/efine/erlang/otp_src_R12B-3.tar.gz" size is 42195557 bytes<br>Speed = 23965801 bytes/sec<br><br><span style="font-family: courier new,monospace;">-module(charclass).</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_CNTRL, 2#00000001).</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_UPPER, 2#00000010).</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">-define(CLS_LOWER, 2#00000100).</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_DIGIT, 2#00001000).</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">-define(CLS_PUNCT, 2#00010000).</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_BLANK, 2#00100000).</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">-define(CLS_SPACE, 2#01000000).</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_8BIT, 2#10000000).</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">-define(CLS_ALPHA, (?CLS_UPPER bor ?CLS_LOWER)).</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-define(CLS_ALNUM, (?CLS_UPPER bor ?CLS_LOWER bor ?CLS_DIGIT)).</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">-define(CLS_PRINT, (bnot (?CLS_CNTRL bor ?CLS_8BIT))).</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">-compile([export_all]).</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_cntrl(Ch) -> class(Ch) band ?CLS_CNTRL /= 0.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_upper(Ch) -> class(Ch) band ?CLS_UPPER /= 0.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">is_lower(Ch) -> class(Ch) band ?CLS_LOWER /= 0.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_digit(Ch) -> class(Ch) band ?CLS_DIGIT /= 0.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">is_punct(Ch) -> class(Ch) band ?CLS_PUNCT /= 0.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_alpha(Ch) -> class(Ch) band ?CLS_ALPHA /= 0.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">is_alnum(Ch) -> class(Ch) band ?CLS_ALNUM /= 0.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_blank(Ch) -> class(Ch) band ?CLS_BLANK /= 0.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">is_space(Ch) -> class(Ch) band ?CLS_SPACE /= 0.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">is_print(Ch) -> class(Ch) band ?CLS_PRINT /= 0.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">is_8bit(Ch) -> class(Ch) band ?CLS_8BIT /= 0.</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_names(Ch) -></span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> Cls = class(Ch),</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> Masks = [</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> ?CLS_UPPER, ?CLS_LOWER, ?CLS_DIGIT, ?CLS_PUNCT, ?CLS_PRINT,</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> ?CLS_BLANK, ?CLS_SPACE, ?CLS_ALPHA, ?CLS_ALNUM, ?CLS_CNTRL,</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> ?CLS_8BIT</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> ],</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> [class_name(Mask) || Mask <- Masks, Cls band Mask /= 0].</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_CNTRL) -> cntrl;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">class_name(?CLS_UPPER) -> upper;</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_LOWER) -> lower;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">class_name(?CLS_DIGIT) -> digit;</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_PUNCT) -> punct;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">class_name(?CLS_SPACE) -> space;</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_BLANK) -> blank;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">class_name(?CLS_ALPHA) -> alpha;</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_ALNUM) -> alnum;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">class_name(?CLS_PRINT) -> print;</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class_name(?CLS_8BIT) -> '8bit'.</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">class(Ch) -></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> element(Ch + 1, </span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> { % 0 1 2 3 4 5 6 7 8 9 A B C D E F </span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 1, 1, 1, 1, 1, 1, 1, 1, 1, 97, 65, 65, 65, 65, 1, 1, % 1</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, % 1</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 96, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, % 2</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, % 3</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 16, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, % 4</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16, 16, 16, 16, 16, % 5</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 16, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, % 6</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 16, 16, 16, 16, 1, % 7</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % 8</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % 9</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % A</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % B</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % C</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % D</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128, % E</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 128,128,128,128,128,128,128,128,128,128,128,128,128,128,128,128 % F</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> }).</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">test() -></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> [{Ch, [Ch], class_names(Ch)}|| Ch <- lists:seq(0,255)].</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">bm(File) -></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> {ok, B} = file:read_file(File),</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> Size = byte_size(B),</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> io:format("File ~p size is ~B bytes~n", [File, Size]),</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> {Micros,_} = timer:tc(?MODULE, classify_binary, [B]),</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> io:format("Speed = ~b bytes/sec~n", [Size * 1000000 div Micros]).</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">classify_binary(<<>>) -></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> ok;</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">classify_binary(<<Ch,Rest/bytes>>) -></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> class(Ch),</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> classify_binary(Rest).</span><br style="font-family: courier new,monospace;"><br><br>