Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve set handling #55

Merged
merged 29 commits into from
Aug 28, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
35a5588
Make set members subexpressions instead
jaynetics Apr 13, 2018
4e63c3a
Merge branch 'master' into improve_set_handling
jaynetics Apr 22, 2018
26701ef
Make CharacterSet non-terminal
jaynetics Apr 22, 2018
b0dcf49
Extract char_type scanner and use it in set and main scanner
jaynetics Apr 27, 2018
55efa8b
Emit properties with type :property/:nonproperty in sets, too
jaynetics Apr 27, 2018
4b14a40
Replace member_hex and range_hex with std tokens
jaynetics Apr 27, 2018
70c9097
Replace set member token with literal that is not merged in Lexer
jaynetics Apr 28, 2018
a90a24e
Replace :set, :escape token by reusing escape_sequence scanner
jaynetics Apr 28, 2018
7aba19e
Remove :escape, :space as \s is always a char type
jaynetics Apr 28, 2018
9fee8d7
Handle props and char types in sets through shared escape scanner
jaynetics Apr 28, 2018
259cdf0
Add more tests
jaynetics Apr 28, 2018
bcf2213
Remove some unused tokens and expressions
jaynetics Apr 28, 2018
0014db2
Use std #unshift instead of overridden #insert
jaynetics Apr 28, 2018
2efd57e
Simplify whitelisting of warnings
jaynetics Apr 29, 2018
0be70c0
Introduce Intersection and Range Subexpression classes
jaynetics Apr 29, 2018
ab38475
Merge branch 'master' into improve_set_handling
jaynetics Apr 29, 2018
5270318
Revert merge mistake
jaynetics Apr 30, 2018
203dc06
Classify backspace in sets as escape sequence - more informative
jaynetics Apr 29, 2018
eea0bfe
Use new Property-like CharacterClass exp, not Literal, for [:...:]
jaynetics Apr 29, 2018
9205cc5
Add missing codepoint and use unused hex escape classes
jaynetics Apr 30, 2018
fdcc3d8
Remove wide hex escapes, not supported in Ruby >= 1.8.6
jaynetics Apr 30, 2018
e43e9d3
Sharpen set parse tests, extract #include? test to exp test file
jaynetics Apr 30, 2018
3d1a13c
Add SequenceOperation to match Alternation behavior in Intersection
jaynetics May 3, 2018
c9aa2f0
Let Expressions keep track of their overall #nesting_level ...
jaynetics May 4, 2018
a098c88
Fix and actually run CharacterClass tests, add #nesting_level test
jaynetics May 5, 2018
ce37e12
Rename CharacterClass->PosixClass, avoid confusion with character sets
jaynetics May 21, 2018
9b7a0bd
Emit previously unused EscapeSequence::Octal for :escape, :octal tokens
jaynetics May 21, 2018
7df8243
Prepare ChangeLog entry
jaynetics May 21, 2018
21b7e13
Merge branch 'master' into improve_set_handling
jaynetics Aug 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 16 additions & 3 deletions ChangeLog
Original file line number Diff line number Diff line change
@@ -1,13 +1,26 @@
UPCOMING

* Breaking changes to character set and property handling:
* Changed parsing of sets (a.k.a. character classes or "bracket expressions")
- see PR #55 / issue #47 for details
- sets are now parsed to expression trees like other nestable expressions
- #scan now emits the same tokens as outside sets (no longer :set, :member)
- new Range and Intersection classes represent corresponding syntax features
- a new PosixClass expression class represents e.g. [[:ascii:]]
- PosixClass instances behave like Property ones, e.g. support #negative?
- #scan emits :(non)posixclass, :<type> instead of :set, :char_(non)<type>
* Changed Expression emissions for some escape sequences
- EscapeSequence::Codepoint, CodepointList, Hex and Octal are now all used
- they already existed, but were all parsed as EscapeSequence::Literal
- e.g. \x97 is now EscapeSequence::Hex instead of EscapeSequence::Literal
* Changed naming of many property tokens (emitted for \p{...})
- if you work with these tokens, see PR #56 for details
* Added support for all previously missing properties (about 250)
* Added Expression::UnicodeProperty#shortcut (e.g. returns 'm' for '\p{mark}')
- e.g. :punct_dash is now :dash_punctuation
* Fixed ruby version mapping of some properties
* Fixed scanning of some property spellings, e.g. with dashes
* Fixed some incorrect property alias normalizations
* Improved the speed of the properties machine
* Added support for all previously missing properties (about 250)
* Added Expression::UnicodeProperty#shortcut (e.g. returns 'm' for '\p{mark}')
* Bumped version to XXX

Sun Apr 29 2018 Janosch Müller <janosch84@gmail.com>
Expand Down
7 changes: 6 additions & 1 deletion lib/regexp_parser/expression.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ module Regexp::Expression
class Base
attr_accessor :type, :token
attr_accessor :text, :ts
attr_accessor :level, :set_level, :conditional_level
attr_accessor :level, :set_level, :conditional_level, :nesting_level

attr_accessor :quantifier
attr_accessor :options
Expand All @@ -16,6 +16,7 @@ def initialize(token, options = {})
self.level = token.level
self.set_level = token.set_level
self.conditional_level = token.conditional_level
self.nesting_level = 0
self.quantifier = nil
self.options = options
end
Expand Down Expand Up @@ -169,6 +170,7 @@ def self.parsed(exp)
require 'regexp_parser/expression/quantifier'
require 'regexp_parser/expression/subexpression'
require 'regexp_parser/expression/sequence'
require 'regexp_parser/expression/sequence_operation'

require 'regexp_parser/expression/classes/alternation'
require 'regexp_parser/expression/classes/anchor'
Expand All @@ -179,7 +181,10 @@ def self.parsed(exp)
require 'regexp_parser/expression/classes/group'
require 'regexp_parser/expression/classes/keep'
require 'regexp_parser/expression/classes/literal'
require 'regexp_parser/expression/classes/posix_class'
require 'regexp_parser/expression/classes/property'
require 'regexp_parser/expression/classes/root'
require 'regexp_parser/expression/classes/set'
require 'regexp_parser/expression/classes/set/intersection'
require 'regexp_parser/expression/classes/set/range'
require 'regexp_parser/expression/classes/type'
33 changes: 5 additions & 28 deletions lib/regexp_parser/expression/classes/alternation.rb
Original file line number Diff line number Diff line change
@@ -1,33 +1,10 @@
module Regexp::Expression

# This is not a subexpression really, but considering it one simplifies
# the API when it comes to handling the alternatives.
class Alternation < Regexp::Expression::Subexpression
alias :alternatives :expressions

def starts_at
expressions.first.starts_at
end
alias :ts :starts_at

def <<(exp)
expressions.last << exp
end

def alternative(exp = nil)
expressions << (exp ? exp : Alternative.new(level, set_level, conditional_level))
end

def quantify(token, text, min = nil, max = nil, mode = :greedy)
alternatives.last.last.quantify(token, text, min, max, mode)
end

def to_s(format = :full)
alternatives.map{|e| e.to_s(format)}.join('|')
end
end

# A sequence of expressions, used by Alternation as one of its alternative.
class Alternative < Regexp::Expression::Sequence; end

class Alternation < Regexp::Expression::SequenceOperation
OPERAND = Alternative

alias :alternatives :expressions
end
end
4 changes: 2 additions & 2 deletions lib/regexp_parser/expression/classes/escape.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ class Bell < EscapeSequence::Base; end
class FormFeed < EscapeSequence::Base; end
class Newline < EscapeSequence::Base; end
class Return < EscapeSequence::Base; end
class Space < EscapeSequence::Base; end
class Tab < EscapeSequence::Base; end
class VerticalTab < EscapeSequence::Base; end

class Codepoint < EscapeSequence::Base; end
class CodepointList < EscapeSequence::Base; end
class Octal < EscapeSequence::Base; end
class Hex < EscapeSequence::Base; end
class HexWide < EscapeSequence::Base; end

class Control < EscapeSequence::Base; end
class Meta < EscapeSequence::Base; end
Expand Down
11 changes: 11 additions & 0 deletions lib/regexp_parser/expression/classes/posix_class.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
module Regexp::Expression
class PosixClass < Regexp::Expression::Base
def negative?
type == :nonposixclass
end

def name
token.to_s
end
end
end
139 changes: 51 additions & 88 deletions lib/regexp_parser/expression/classes/set.rb
Original file line number Diff line number Diff line change
@@ -1,110 +1,73 @@
module Regexp::Expression

class CharacterSet < Regexp::Expression::Base
attr_accessor :members
class CharacterSet < Regexp::Expression::Subexpression
attr_accessor :closed, :negative

alias :negative? :negative
alias :negated? :negative
alias :closed? :closed

def initialize(token, options = {})
@members = []
@negative = false
@closed = false
self.negative = false
self.closed = false
super
end

# Override base method to clone set members as well.
def clone
copy = super
copy.members = @members.map {|m| m.clone }
copy
end

def <<(member)
if @members.last.is_a?(CharacterSubSet) and not @members.last.closed?
@members.last << member
else
@members << member
end
end

def include?(member, directly = false)
@members.each do |m|
if m.is_a?(CharacterSubSet) and not directly
return true if m.include?(member)
else
return true if member == m.to_s
end
end; false
end

def each(&block)
@members.each {|m| yield m}
def negate
self.negative = true
end

def each_with_index(&block)
@members.each_with_index {|m, i| yield m, i}
def close
self.closed = true
end

def length
@members.length
def to_s(format = :full)
"#{text}#{'^' if negated?}#{expressions.join}]#{quantifier_affix(format)}"
end

def negate
if @members.last.is_a?(CharacterSubSet)
@members.last.negate
else
@negative = true
# TODO: these made more sense with string members. remove/replace in v1.0.0?
module LegacyCompatibilityMethods
def members
expressions.map { |exp| exp.is_a?(CharacterSet) ? exp : exp.to_s }
end
end

def negative?
@negative
end
alias :negated? :negative?

def close
if @members.last.is_a?(CharacterSubSet) and not @members.last.closed?
@members.last.close
else
@closed = true
# Returns an array of the members with any shorthand members like \d and \W
# expanded to either traditional form or unicode properties.
def expand_members(use_properties = false)
members.map do |member|
case member
when "\\d"
use_properties ? '\p{Digit}' : '0-9'
when "\\D"
use_properties ? '\P{Digit}' : '^0-9'
when "\\w"
use_properties ? '\p{Word}' : 'A-Za-z0-9_'
when "\\W"
use_properties ? '\P{Word}' : '^A-Za-z0-9_'
when "\\s"
use_properties ? '\p{Space}' : ' \t\f\v\n\r'
when "\\S"
use_properties ? '\P{Space}' : '^ \t\f\v\n\r'
when "\\h"
use_properties ? '\p{Xdigit}' : '0-9A-Fa-f'
when "\\H"
use_properties ? '\P{Xdigit}' : '^0-9A-Fa-f'
else
member
end
end
end
end

def closed?
@closed
end

# Returns an array of the members with any shorthand members like \d and \W
# expanded to either traditional form or unicode properties.
def expand_members(use_properties = false)
@members.map do |member|
case member
when "\\d"
use_properties ? '\p{Digit}' : '0-9'
when "\\D"
use_properties ? '\P{Digit}' : '^0-9'
when "\\w"
use_properties ? '\p{Word}' : 'A-Za-z0-9_'
when "\\W"
use_properties ? '\P{Word}' : '^A-Za-z0-9_'
when "\\s"
use_properties ? '\p{Space}' : ' \t\f\v\n\r'
when "\\S"
use_properties ? '\P{Space}' : '^ \t\f\v\n\r'
when "\\h"
use_properties ? '\p{Xdigit}' : '0-9A-Fa-f'
when "\\H"
use_properties ? '\P{Xdigit}' : '^0-9A-Fa-f'
else
member
def include?(member, directly = false)
members.any? do |m|
if m.is_a?(CharacterSet)
!directly && m.include?(member)
else
m == member
end
end
end
end

def to_s(format = :full)
"#{text}#{'^' if negative?}#{members.join}]#{quantifier_affix(format)}"
end
include LegacyCompatibilityMethods
end

class CharacterSubSet < CharacterSet
end

end # module Regexp::Expression
9 changes: 9 additions & 0 deletions lib/regexp_parser/expression/classes/set/intersection.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
module Regexp::Expression
class CharacterSet < Regexp::Expression::Subexpression
class IntersectedSequence < Regexp::Expression::Sequence; end

class Intersection < Regexp::Expression::SequenceOperation
OPERAND = IntersectedSequence
end
end
end
23 changes: 23 additions & 0 deletions lib/regexp_parser/expression/classes/set/range.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module Regexp::Expression
class CharacterSet < Regexp::Expression::Subexpression
class Range < Regexp::Expression::Subexpression
def starts_at
expressions.first.starts_at
end
alias :ts :starts_at

def <<(exp)
complete? && raise("Can't add more than 2 expressions to a Range")
super
end

def complete?
count == 2
end

def to_s(_format = :full)
expressions.join(text)
end
end
end
end
10 changes: 6 additions & 4 deletions lib/regexp_parser/expression/methods/strfregexp.rb
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,16 @@ def strfregexp(format = '%a', indent_offset = 0, index = nil)

part = {}

print_level = nesting_level > 0 ? nesting_level - 1 : nil

# Order is important! Fields that use other fields in their
# definition must appear before the fields they use.
part_keys = %w{a m b o i l x s e S y k c q Q z Z t ~t T >}
part.keys.each {|k| part[k] = "<?#{k}?>"}

part['>'] = level ? (' ' * (level + indent_offset)) : ''
part['>'] = print_level ? (' ' * (print_level + indent_offset)) : ''

part['l'] = level ? "#{'%d' % level}" : 'root'
part['l'] = print_level ? "#{'%d' % print_level}" : 'root'
part['x'] = "#{'%d' % index}" if have_index

part['s'] = starts_at
Expand Down Expand Up @@ -101,9 +103,9 @@ class Subexpression < Regexp::Expression::Base
def strfregexp_tree(format = '%a', include_self = true, separator = "\n")
output = include_self ? [self.strfregexp(format)] : []

output += map {|exp, index|
output += map do |exp, index|
exp.strfregexp(format, (include_self ? 1 : 0), index)
}
end

output.join(separator)
end
Expand Down
18 changes: 4 additions & 14 deletions lib/regexp_parser/expression/methods/tests.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,12 @@ class Base
# # is it a :group expression
# exp.type? :group
#
# # is it a :set, :subset, or :meta
# exp.type? [:set, :subset, :meta]
# # is it a :set, or :meta
# exp.type? [:set, :meta]
#
def type?(test_type)
case test_type
when Array
if test_type.include?(:*)
return (test_type.include?(type) or test_type.include?(:*))
else
return test_type.include?(type)
end
when Symbol
return (type == test_type or test_type == :*)
else
raise "Array or Symbol expected, #{test_type.class.name} given"
end
test_types = Array(test_type).map(&:to_sym)
test_types.include?(:*) || test_types.include?(type)
end

# Test if this expression has the given test_token, and optionally a given
Expand Down
Loading