[erlang-questions] Erlang re:run regular exp, match problrm
Morten Krogh
Sun Oct 31 22:08:38 CET 2010
Take care, '>' is an allowed character in attribute values, e.g.,
<point id="/>"/>
is valid xml.
If you control the input yourself it is no problem, of course.
On 10/31/10 5:58 PM, Jesse Gumm wrote:
> The problem in that last example is that by default * is greedy and .
> doesn't match the linefeed (which is why putting \n at the end of each thing
> worked)
> "<point.*\/>" will match from the first instance of"<point" to the very
> last "/>"
> Changing it to:
> "<point.*?\/>" will make * act "ungreedy" and only match at the first
> instance it finds, then end.
> Alternatively, you could use
> "<point [^>]*\/>" then you don't really have to worry about greediness or
> not.
> -Jesse
> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<mathiasstalas@REDACTED> wrote:
>> Hi,
>> That expression was actually the first one I tried out, with the only
>> difference that I did It from within my application. This was before I knew
>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>> read in my UTF-8 encoded file... I only tried dilfen's expression from the
>> CLI, and there it succeeded. At the point of posting I had tried so many
>> different solutions that I was tired and just wanted my rather simple
>> laboratory application to work. Later I found out that neither did that
>> sample he gave work from my programs application scope or the one proposed
>> form you.
>> Further investigation has lead me to believe that re:run/3 has either an
>> issue with strings lacking of linefeed.
>> Putting some non valid xml in a file and using my rather simple program
>> always yields([[{0,162}]]) the first char and the last points(see attached
>> doc) ending char '>' as a match, it doesn't split them up as expected which
>> is either the expected module behaviour which I find a bit odd or a(god
>> forbid) programming fault from my part, Here is the code I use:
>> -module(mock).
>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>> start() ->
>> Bin = read_file("point.xml"),
>> UnicodeString = decode_data(Bin),
>> NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>> global]),
>> NodeList.
>> read_file(File) ->
>> case file:read_file(File) of
>> {ok, Bin} -> Bin;
>> _ -> []
>> end.
>> decode_data(Data) ->
>> case unicode:characters_to_list(Data, utf8) of
>> {error, Encoded, Rest} ->
>> io:format("Caught Error~w~n", Encoded, Rest),
>> [];
>> List ->
>> List
>> end.
>> find_pattern(Str, Pattern, Options) ->
>> case re:run(Str, Pattern, Options) of
>> {match, Part} ->
>> io:format("find_pattern: ~w~n", [Part]),
>> Part;
>> nomatch -> []
>> end.
>> However, adding a linefeed '\n' after each entity in the doc will give the
>> expected result:
>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>> which to me looks strange. Haven't read up on the re module that much but
>> this is my experience.
>> I have resigned to using xmerl_xpath which seems to do the job. I guess me
>> coming from the Java world is a bit spoiled with strong support for string
>> manipulation and doing the above would have taken men less then 10 min.
>> Anyway Thank you both for the effort.
>> BR,
>> Mathias Stalås
>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<hynek@REDACTED>wrote:
>>> I would not thanks on your place. It doesn't do what you want but
>>> works only by accident in this particular example. [^<point]* means
>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>> same way in this particular example.
>>> This would work much more generally
>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global])
>>> but anyway you should use xml parser for xml parsing because xml is
>>> not parseable by regular grammar so regular expression is not proper
>>> tool for do it. You will end up with error prone solution.
>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<mathiasstalas@REDACTED> wrote:
>>>> Works like a charm!
>>>> Many thanks dlfen!
>>>> BR,
>>>> Mathias
>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<erlangdlf@REDACTED> wrote:
>>>>> try this.
>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]).
>>>>> 在 2010-10-30,下午6:44, Mathias 写道:
>>>>>> Hi there,
>>>>>> I'm trying to figure out how Erlangs re:run module works.
>>>>>> When executing this::
>>>>>> 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>").
>>>>>> {match,[{0,54}]}
>>>>>> I can see that it gives me a match on the complete XML representation
>>>>>> {match,[{0,54}]}.
>>>>>> But what I really would like to do is for it to give me a subset of
>>>>> matches
>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}.
>>>>>> so the output would yield something like this:
>>>>>> 0-26 gives the first xml entity complete with it's attributes<point
>>>>> x="12"
>>>>>> y="2" z="4"/> and
>>>>>> match 27,26 gives the remaining entity.
>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>>> guide
>>>>> me
>>>>>> in the right direction closer to find the solution it will be greatly
>>>>>> appreciated.
>>>>>> I know about xmerl but for my trivial case it seems like overkill.
>>>>>> Thx in advance.
>>>>>> BR,
>>>>>> Mathias Stalås
>>> --
>>> --Hynek (Pichi) Vychodil
>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>> your boss. Be a data hero!
>>> Try GoodData now for free: www.gooddata.com
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
More information about the erlang-questions
mailing list