[erlang-questions] Erlang re:run regular exp, match problrm

Morten Krogh mk@REDACTED
Sun Oct 31 22:08:38 CET 2010


Hi

Take care, '>' is an allowed character in attribute values, e.g.,

<point id="/>"/>

is valid xml.

If you control the input yourself it is no problem, of course.

Morten.


On 10/31/10 5:58 PM, Jesse Gumm wrote:
> The problem in that last example is that by default * is greedy and .
> doesn't match the linefeed (which is why putting \n at the end of each thing
> worked)
>
> "<point.*\/>" will match from the first instance of"<point" to the very
> last "/>"
>
> Changing it to:
>
> "<point.*?\/>" will make * act "ungreedy" and only match at the first
> instance it finds, then end.
>
> Alternatively, you could use
>
> "<point [^>]*\/>" then you don't really have to worry about greediness or
> not.
>
> -Jesse
>
>
> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<mathiasstalas@REDACTED>  wrote:
>
>> Hi,
>>
>> That expression was actually the first one I tried out, with the only
>> difference that I did It from within my application. This was before I knew
>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>> read in my UTF-8 encoded file... I only tried dilfen's expression from the
>> CLI, and there it succeeded. At the point of posting I had tried so many
>> different solutions that I was tired and just wanted my rather simple
>> laboratory application to work. Later I found out that neither did that
>> sample he gave work from my programs application scope or the one proposed
>> form you.
>>
>> Further investigation has lead me to believe that re:run/3 has either an
>> issue with strings lacking of linefeed.
>> Putting some non valid xml in a file and using my rather simple program
>> always yields([[{0,162}]]) the first char and the last points(see attached
>> doc) ending char '>' as a match, it doesn't split them up as expected which
>> is either the expected module behaviour which I find a bit odd or a(god
>> forbid) programming fault from my part, Here is the code I use:
>>
>> -module(mock).
>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>
>> start() ->
>>          Bin = read_file("point.xml"),
>>          UnicodeString = decode_data(Bin),
>>          NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>> global]),
>>          NodeList.
>>
>> read_file(File) ->
>>          case file:read_file(File) of
>>                  {ok, Bin} ->  Bin;
>>                  _ ->  []
>> end.
>>
>> decode_data(Data) ->
>>           case unicode:characters_to_list(Data, utf8) of
>>                   {error, Encoded, Rest} ->
>>                          io:format("Caught Error~w~n", Encoded, Rest),
>>                          [];
>>                   List ->
>>                          List
>>   end.
>>
>> find_pattern(Str, Pattern, Options) ->
>>          case re:run(Str, Pattern, Options) of
>>                  {match, Part} ->
>>                          io:format("find_pattern: ~w~n", [Part]),
>>                          Part;
>>                  nomatch ->  []
>> end.
>>
>> However, adding a linefeed '\n' after each entity in the doc will give the
>> expected result:
>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>> which to me looks strange. Haven't read up on the re module that much but
>> this is my experience.
>>
>> I have resigned to using xmerl_xpath which seems to do the job. I guess me
>> coming from the Java world is a bit spoiled with strong support for string
>> manipulation and doing the above would have taken men less then 10 min.
>>
>> Anyway Thank you both for the effort.
>>
>> BR,
>> Mathias Stalås
>>
>>
>>
>>
>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<hynek@REDACTED>wrote:
>>
>>> I would not thanks on your place. It doesn't do what you want but
>>> works only by accident in this particular example. [^<point]* means
>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>> same way in this particular example.
>>>
>>> This would work much more generally
>>>
>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global])
>>>
>>> but anyway you should use xml parser for xml parsing because xml is
>>> not parseable by regular grammar so regular expression is not proper
>>> tool for do it. You will end up with error prone solution.
>>>
>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<mathiasstalas@REDACTED>  wrote:
>>>> Works like a charm!
>>>>
>>>> Many thanks dlfen!
>>>>
>>>> BR,
>>>> Mathias
>>>>
>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<erlangdlf@REDACTED>  wrote:
>>>>
>>>>> try this.
>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]).
>>>>>
>>>>>
>>>>> 在 2010-10-30,下午6:44, Mathias 写道:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to figure out how Erlangs re:run module works.
>>>>>>
>>>>>> When executing this::
>>>>>> 1>  re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>").
>>>>>> {match,[{0,54}]}
>>>>>>
>>>>>> I can see that it gives me a match on the complete XML representation
>>>>>> {match,[{0,54}]}.
>>>>>>
>>>>>> But what I really would like to do is for it to give me a subset of
>>>>> matches
>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}.
>>>>>>
>>>>>> so the output would yield  something like this:
>>>>>> 0-26 gives the first xml entity complete with it's attributes<point
>>>>> x="12"
>>>>>> y="2" z="4"/>  and
>>>>>> match 27,26 gives the remaining entity.
>>>>>>
>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/>  is failing and
>>> guide
>>>>> me
>>>>>> in the right direction closer to find the solution it will be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> I know about xmerl but for my trivial case it seems like overkill.
>>>>>>
>>>>>> Thx in advance.
>>>>>>
>>>>>> BR,
>>>>>> Mathias Stalås
>>>>>
>>>
>>>
>>> --
>>> --Hynek (Pichi) Vychodil
>>>
>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>> your boss.  Be a data hero!
>>> Try GoodData now for free: www.gooddata.com
>>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
>



More information about the erlang-questions mailing list