[erlang-questions] Erlang re:run regular exp, match problrm
Jesse Gumm
sigmastar@REDACTED
Sun Oct 31 17:58:04 CET 2010
The problem in that last example is that by default * is greedy and .
doesn't match the linefeed (which is why putting \n at the end of each thing
worked)
"<point.*\/>" will match from the first instance of "<point" to the very
last "/>"
Changing it to:
"<point.*?\/>" will make * act "ungreedy" and only match at the first
instance it finds, then end.
Alternatively, you could use
"<point [^>]*\/>" then you don't really have to worry about greediness or
not.
-Jesse
On Sun, Oct 31, 2010 at 11:41 AM, Mathias <mathiasstalas@REDACTED> wrote:
> Hi,
>
> That expression was actually the first one I tried out, with the only
> difference that I did It from within my application. This was before I knew
> that file:read_file and UTF-8 don't blend well. I used file:read_file to
> read in my UTF-8 encoded file... I only tried dilfen's expression from the
> CLI, and there it succeeded. At the point of posting I had tried so many
> different solutions that I was tired and just wanted my rather simple
> laboratory application to work. Later I found out that neither did that
> sample he gave work from my programs application scope or the one proposed
> form you.
>
> Further investigation has lead me to believe that re:run/3 has either an
> issue with strings lacking of linefeed.
> Putting some non valid xml in a file and using my rather simple program
> always yields([[{0,162}]]) the first char and the last points(see attached
> doc) ending char '>' as a match, it doesn't split them up as expected which
> is either the expected module behaviour which I find a bit odd or a(god
> forbid) programming fault from my part, Here is the code I use:
>
> -module(mock).
> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>
> start() ->
> Bin = read_file("point.xml"),
> UnicodeString = decode_data(Bin),
> NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
> global]),
> NodeList.
>
> read_file(File) ->
> case file:read_file(File) of
> {ok, Bin} -> Bin;
> _ -> []
> end.
>
> decode_data(Data) ->
> case unicode:characters_to_list(Data, utf8) of
> {error, Encoded, Rest} ->
> io:format("Caught Error~w~n", Encoded, Rest),
> [];
> List ->
> List
> end.
>
> find_pattern(Str, Pattern, Options) ->
> case re:run(Str, Pattern, Options) of
> {match, Part} ->
> io:format("find_pattern: ~w~n", [Part]),
> Part;
> nomatch -> []
> end.
>
> However, adding a linefeed '\n' after each entity in the doc will give the
> expected result:
> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
> which to me looks strange. Haven't read up on the re module that much but
> this is my experience.
>
> I have resigned to using xmerl_xpath which seems to do the job. I guess me
> coming from the Java world is a bit spoiled with strong support for string
> manipulation and doing the above would have taken men less then 10 min.
>
> Anyway Thank you both for the effort.
>
> BR,
> Mathias Stalås
>
>
>
>
> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <hynek@REDACTED>wrote:
>
>> I would not thanks on your place. It doesn't do what you want but
>> works only by accident in this particular example. [^<point]* means
>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>> same way in this particular example.
>>
>> This would work much more generally
>>
>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>>
>> but anyway you should use xml parser for xml parsing because xml is
>> not parseable by regular grammar so regular expression is not proper
>> tool for do it. You will end up with error prone solution.
>>
>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <mathiasstalas@REDACTED> wrote:
>> > Works like a charm!
>> >
>> > Many thanks dlfen!
>> >
>> > BR,
>> > Mathias
>> >
>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <erlangdlf@REDACTED> wrote:
>> >
>> >> try this.
>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>> >>
>> >>
>> >> 在 2010-10-30,下午6:44, Mathias 写道:
>> >>
>> >> > Hi there,
>> >> >
>> >> > I'm trying to figure out how Erlangs re:run module works.
>> >> >
>> >> > When executing this::
>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>> >> > {match,[{0,54}]}
>> >> >
>> >> > I can see that it gives me a match on the complete XML representation
>> >> > {match,[{0,54}]}.
>> >> >
>> >> > But what I really would like to do is for it to give me a subset of
>> >> matches
>> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
>> >> >
>> >> > so the output would yield something like this:
>> >> > 0-26 gives the first xml entity complete with it's attributes <point
>> >> x="12"
>> >> > y="2" z="4"/> and
>> >> > match 27,26 gives the remaining entity.
>> >> >
>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>> guide
>> >> me
>> >> > in the right direction closer to find the solution it will be greatly
>> >> > appreciated.
>> >> >
>> >> > I know about xmerl but for my trivial case it seems like overkill.
>> >> >
>> >> > Thx in advance.
>> >> >
>> >> > BR,
>> >> > Mathias Stalås
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> --Hynek (Pichi) Vychodil
>>
>> Analyze your data in minutes. Share your insights instantly. Thrill
>> your boss. Be a data hero!
>> Try GoodData now for free: www.gooddata.com
>>
>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
--
Jesse Gumm
Sigma Star Systems
414.940.4866
gumm@REDACTED
http://www.sigma-star.com
More information about the erlang-questions
mailing list