[apparmor] [PATCH] parser - more regex unittests and fixes (was Re: [PATCH] [parsers] allow for nested alternations expressions)

Fri Dec 6 02:40:04 UTC 2013

On 12/05/2013 06:01 PM, Seth Arnold wrote:
> On Thu, Nov 07, 2013 at 11:35:53AM -0800, John Johansen wrote:
>> Good question. I'm not really sure. I know I don't want to support all of it
>> but I would like to have more than we have. I think the set available in
>> aare globbing should be smaller than the set we make available in the
>> pcre syntax.
>>
>> eg.
>> @{var} is the variable expansion in aare, but for the pcre syntax I was
>> considering using \@{var}
> 
> This is alright.
> 
> However, I think it'd feel more natural to use a named-group syntax to
> look up the variables by name. (Not that I like the variety of named-group
> syntaxes available -- it just feels like the languages already support a
> named lookup of some sort that we can leverage.)
> 
hrmm maybe

>> I know on the pcre side I want positive and negative lookaheads and
>> back references (though it would not use the confusing \# syntax of
>> pcre, but might use \g#. I'm not sure it makes sense to expose these to
>> aare
> 
> I'm not a fan of \g#. It isn't documented at all on either of these pages
> from one of the best sources of multiple-language regex info:
> http://www.regular-expressions.info/backref.html
> http://www.regular-expressions.info/backref2.html
> (I went there to kick-start my memory on which engine used \g# rather than
> \#)
> 
the problem with \# is that it is ambiguous

>> I think it would be nice to support some of the posix character classes
>> and maybe \d \D.
> 
> Yes.
> 
>> The big ones I want is a way to escape into pcre syntax and back to aare
>> and accept permission embedding, which save a fair bit of duplication and
>> extra state creation (and then removal) on the backend.
>> Eg.
>> for mount instead of having to provide 5 rules
>> part1 <perm>
>> part1\0part2 <perm>
>> part1\0part2\0part3 <perm>
>> part1\0part2\0part3\0part4 <perm>
>> part1\0part2\0part3\0part4\0part5 <perm>
>>
>> we could get away with ecoding a single rule
>> part1\<perm>\0part2\<perm>\0part3\<perm>\0part4\<perm>\0part5\<perm>
> 
> This seems plausible enough. Would it only make sense for mount?
> 
No, we actually do this kind of thing all over, link pairs, dbus, change_profile,
networking, labels, ipc, ...

>> I think there are 2 questions to answer, what set should we provide
>> for the pcre style syntax, and what subset for aare?
>>
>> Below are some notes a have from the last time I was looking at it
>> (not that they will really clear things up any)
>>
>> ---
>>
>> \@{variable}  variable reference
>> \^	?start regex
>> \$	?end regex (return to globbing)
> 
> Can we (ab)use \A and \Z here instead? Those match beginning and end of
> strings, so they'd presumably never appear in a regex.
> 
No, I actually want to be able to specify start and end for the wider regex
engine.

But I completely understand that \^ and \$ aren't ideal either. I am open
to alternatives

>> \#{perm}    ?embedded perm
> 
> This seems too easy to overlook when reviewing profiles; where would it be
> useful?
> 
its useful in encoding rules. I would actually expect to see it in aare or
in actual rules.

>> \-	?logical set operation minus?
>> \&	?logical set operation and?
>>
>>
>> see man pcrepattern
>>
>> \	general escape character
>> ^	assert start of string
>> $	assert end of string
> 
> When would start and end be useful?
> 
I don't know, it may not ever be useful. It might only be useful when/if the
backend is used outside its current context.

It might be useful for example in my plan to extend the parser to accept/ignore
new rule types that it has no knowledge of.

Where in policy you can define a keyword to start the rule and the pattern
that it matches. This will give us the ability to cleanly extend the parser
to at least ignore unknown policy rules.

>> .	match any char including newline
>> []	character class
>> [^]	negative character class
>> [x-y]  range
>> [[:xxx:]]	POSIX named set
>> [[:^xxx:]]	negative POSIX named set
> 
> Unicode character classes might also be a nice addition. Maybe.
> 
yes, lets just same reserved. Character classes are evil :)

>> ()	subpattern
>> (?)	extended mean for sub pattern
>> |	alternation
>> ?	0 or 1 match, greedy, equiv to {0,1}
>> +	1 or more, greedy, equiv to {1,}
>> *	0 or more, greedy, equiv to {0,}
>> {n}	min/max qualifier exactly n
>> {,n}	min/max qualifier up to n
>> {n,m}	min/max qualifier at least n, no more than m, greedy
>> {n,}	min/max qualifier n or more, greedy
> 
> Will we want to expose lazy and possessive quantifiers too?
> 
No? I'm willing to be convinced otherwise but possessive quantifiers
are nasty things that have two uses.
- tell the regex engine which order to evaluate match permutation,
  primarily for performance
- abused to eliminate certain matches

The problem is the order is not well defined, nor would we reliably
eliminate certain matches. PCRE is bases on a modified NFA, we are
using a DFA which will at some point be a pushdown HFA.

The DFA walks all permutations simultaneously, and even when we get
to being a pushdown HFA we will be walking multiple (usually all)
permutations simultaneously, with sets that are being walked
simultaneously being determined by the type of compression used,
which will govern the maximum number of delayed states (pushdown).

This is necessary as this allows us to have known worst case bounds
properties for the state machine, ie. no input can result in a blown
stack, or failed match when it shouldn't.

>>
>> \a	alarm - hex 07
>> \e	escape - hex 1B
>> \f	formfeed - hex 0C
>> \n	newline - hex 0A
>> \r	carriage return - hex 0D
>> \t	tab - hex 09
> 
> Maybe we can drop bell, escape, ff, backspace, tab, vertical tab, and so
> forth. It's the year 2000, people don't put those in filenames any more. :)
> I know they come practically free in the implementation, but it just feels
> so crufty to spend documentation on these oddballs.
> 
you are thinking this will only be used with filenames. We are talking arbitrary
match extensions. Where a trusted helper can match any data it deems is needed
or appropriate

>> \ddd	octal code
>> \xhh	hex code
>>
>> \cx	control-x where x is any ascii character
>>
>> .	any character including newline
>> \b	backspace
>> \d	decimal digit  [0-9]
>> \D	not decimal digit [^0-9]
>> \h	horizontal whitespace character
>> \H	not horizontal whitespace character
>> \N	not a newline
>> \s	white space character
>> \S	not a white space character
>> \v	vertical whitespace character
>> \V	not a vertical whitespace character
> 
> I'm pretty retrogrouchy but these seem a bit much :) hehehe
> 
yes some of them are, again this is just about trying to determine our needs. We
are trying to leverage what has been done before and people are familiar with but
we certainly won't implement all of pcre

>> \w	a "word" character
>> \W	not a "word" character
>> \l	lower case
>> \L	
>> \u
>> \U	upper case
>> \p	property
>> \P	not Property
>> \R	Unicode newline sequence
> 
> I love the Unicode here, though this does open a potential rat's nest:
> 
> How do we want patterns to be written?
> 
> - UTF8 only?
yes, maybe eventually. I have no plans to initially support this, just reserve
stuff so we can do clean extensions in the future if it makes sense.

> - UTF16? BE? LE?
> - UCS16? BE? LE?
> - UCS32? BE? LE?
> - UTF32? BE? LE?
> - Do we want to normalize:
>   - Filenames?
>   - Regular Expressions?
> 
> Do we want to internalize it all to codepoints to allow a UTF16 pattern
> to match a UTF8 filename? (At least I presume the kernel would forbid
> UCS- and UTF- -16 and -32 pathnames from the very start, what with "/"
> being difficult to express without a 0x00 byte in these encodings.)
> 
> The tables for upper-case and lower-case Unicode code points might be more
> kernel memory than we want to monopolize.
> 
if/when we support code points I would assume it would be optional and the
tables would never go in the kernel but I suppose they might be something
available to a trusted helper

> We can't assume all filenames are valid unicode; should we provide some
> mechanism to require Unicode or other encoding schemes? Since we've
> been providing bytestream-oriented interfaces up until now, it's been
> easy enough to ignore. But if we're going to provide more features like
> this, some encoding-enforcing feels like a natural next step.
> 
the kernel only guarantees a byte stream for filenames, we will never
enforce an encoding on filenames because of how the kernel works.

However other mediation points might

>>
>> (?= )	look ahead assertion
>> (?! )	negative look ahead assertion
>> (?<= )	look behind assertion
>> (?<! )	negative look behind assertion
>> (?(conditional)yes-pattern)
>> (?(conditional)yes-pattern|no-pattern)
> 
> Sounds good..
> 
>> ({ } )	callout to fn
>>
> 
> Interesting; I didn't know you had this in your plans for world
> domination. :) My encoding tests would fit naturally here.
> 
hehe, yes within limits I think this can make a lot of sense
depending on the type of data being matched. This likely won't
be exposed at the general level but could be quite useful just
like embedded perms

>>
>> \p and \P   reserved
>>
>>
>> NOTE: \n can NOT be used as a back reference
>>
>> \gn	back reference by number
>> \g{n}	back reference by number
>> \g{-n}	relative back reference by number
>> \k<name>	 back reference by name
>> \k'name'	 back reference by name
>> \g{name}	 back reference by name
>> \k{name}	 back reference by name (.Net)
>> (?P=name)	 back reference by name (Python)
> 
> I think I missed the grouping that assigns the names.
> 
> It seems odd to offer five different syntaxes to refer to captures by
> name; it's very pleasant and kind for authoring policy but doubtless more
> than one person will have to look up in the manpage if there is any
> difference among the different syntaxes. <> and '' is polite though.
> 
oh I don't want to offer all of these, I was just noting what has been used.
I would like us to choose one syntax