Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to contains for regex needle #18028

Closed
wants to merge 4 commits into from

Conversation

dhoegh
Copy link
Contributor

@dhoegh dhoegh commented Aug 14, 2016

This PR defines contains with a regex search pattern. Besides the new method I have also widen the signature of the original method to include Char as: contains(haystack::AbstractString, needle::Union{AbstractString, Char}). Besides the additions to contains, I have:

  • Moved the docs for the functions in search inline
  • Moved the test for search(::String, ::Regex) from test/strings/search to test/regex.jl as the method is defined in regex.jl.
  • Added methods to contains and test for the methods plus additional tests for contains in general as it seemed not to be tested directly anywhere.
  • Fixed an reference to ASCII and UTF8 string.

It would be good to backport this to 0.5.1 as it increases test coverage and fixes docs. The additional methods for contains are probably not to be viewed as a "new" feature?

@tkelman tkelman added the strings "Strings!" label Aug 14, 2016
@ararslan
Copy link
Member

The additional methods for contains are probably not to be viewed as a "new" feature?

It seems to me like it would constitute a new feature, though that's of course up to one of the owners to decide.

This PR defines contains with a regex search pattern.

We already have this functionality with the ismatch function, so IMHO it would be kind of confusing to have it duplicated in contains, especially because the order of the arguments is reversed. (ismatch takes the regex first, your contains method takes the regex second.)

The method with Char seems okay to me, though for that we also have in.

Anyway, that's my $0.02.

@dhoegh
Copy link
Contributor Author

dhoegh commented Aug 15, 2016

@ararslan you are correct, the functionality are there today. But from my point of view the current state is cluttered, as a test if a string contains a substring/regex/char requires different functions, see example:

#For characters use `in`
'H' in "Hello"
# For strings 
contains("Hello", "He")
# For regex pattern
ismatch(r"He","Hello")

From my point of view, Julia should unify this interface better using multiple dispatch. I would propose these three tests should use the same function and then deprecate all others. Then this is definitely first in 0.6. I do not care for the name or argument order. But i think the name should reflect the order. in suggest the needle argument first, contains suggest needle last, ismatch do to me not reflect a certain order of the arguments.

in(::String, ::String) has implemented an error that states you should use contians. The error message was implemented in 143395e. The error was probly implemented due to a decision. cc @JeffBezanson.

From my perspective the best choice seem to unify on using in as:

#For characters use `in`
'H' in "Hello"
# For strings 
"He" in "Hello"
# For regex pattern
r"He" in "Hello"

Python do also use in for characters and strings, but not for regex.

@ararslan
Copy link
Member

Hm, I'm not so much a fan of using in that way. In a broader sense, in typically means "element in collection," and strings are essentially collections of Chars. So IMO "regex in collection" or even "string in collection" seems a little iffy based on that definition, as neither regular expressions nor strings are elements of strings.

With this in mind, I think the current separation of in, contains, and ismatch makes sense, as they all mean slightly different things.

  • a in b: Is a an element in the collection b?
  • contains(b, a): Does the string b contain exactly the string a?
  • ismatch(r, b): Does the string b have at least one match of the regular expression r?

And at least IMO, ismatch shouldn't be deprecated as it belongs in the match and matchall family.

But as I say often around these parts, my opinion here doesn't really matter since I'm just some guy, I just figured I'd state my position on it anyway, just in case. 😄

@kmsquire
Copy link
Member

But as I say often around these parts, my opinion here doesn't really matter since I'm just some guy, I just figured I'd state my position on it anyway, just in case. 😄

@ararslan, most of us are just "some guy/gal", and your opinion certainly counts, especially with your recent involvement. (IMHO, only uninformed and rudely expressed opinions count less, so I think you're safe...)

I actually agree that, in general, a streamlined interface makes sense. However, my preference is probably to get rid of contains all together.

If we decide to keep it, though, I think this is a good change.

@StefanKarpinski
Copy link
Member

I could see merging ismatch into contains since there's no other sensible way in which a string could contain a regex pattern, but as @ararslan points out, in is quite different and test for element containment.

@StefanKarpinski StefanKarpinski added speculative Whether the change will be implemented is speculative design Design of APIs or of the language itself labels Aug 16, 2016
@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Aug 16, 2016
@dhoegh dhoegh closed this Aug 16, 2016
@dhoegh
Copy link
Contributor Author

dhoegh commented Aug 16, 2016

Sorry hit the wrong button.

I see a regex as a way of specifying what you look for. As an example a regex could explain to look for a needle with a size of 4-5cm in the haystack. With this in mind I think it makes sense to use contains.

@dhoegh dhoegh reopened this Aug 16, 2016
function contains(haystack::Union{SubString{String}, String}, re::Regex)
compile(re)
PCRE.exec(re.regex, haystack, start(haystack)-1, re.match_options, re.match_data)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call ismatch(re, haystack)?

@simonbyrne
Copy link
Contributor

I hadn't seen this PR when I opened #19250. Is there any chance you could move the docs changes to a separate PR (which should be straightforward to merge), then we can hash out the changes here?

@tkelman tkelman modified the milestones: 1.0, 0.6.0 Dec 29, 2016
@JeffBezanson JeffBezanson added the collections Data structures holding multiple items, e.g. sets label Sep 11, 2017
@JeffBezanson
Copy link
Member

Bump. I think we should do two things here. (1) Merge ismatch and contains. (2) contains has another method (contains(eq::Function, itr, x)) that's unrelated to substrings and has a weird signature; it should be contains(predicate, itr). The signature should be changed, and the function possibly renamed.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Sep 11, 2017
@JeffBezanson
Copy link
Member

From triage: ismatch sounds like it requires the entire string to match the regex, when in fact it tests whether the string contains a match of the regex. So absorbing ismatch into contains seems like a good idea to me. (Plus as @simonbyrne points out in #19250, search and replace already accept both strings and regexes interchangeably.)

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Sep 14, 2017
@StefanKarpinski
Copy link
Member

Are we planning on replacing match as well? Python does actually use re.match(str) to mean that the regular expression matches the entire string rather than a substring of it.

@stevengj
Copy link
Member

If it were up to me, I'd be tempted to rename match to findfirst, since we already have the latter and "findfirst" is more descriptive of what it actually does.

@ararslan
Copy link
Member

findfirst returns an index though, whereas match returns a RegexMatch. I don't think findfirst would be an appropriate replacement.

@nalimilan
Copy link
Member

@dhoegh Given what seems to be the consensus above, could you rebase the PR and keep only the changes deprecating ismatch in favor of contains? Docstring changes would better go to another PR (and anyway I'm working on deprecating search altogether).

@nalimilan
Copy link
Member

I've added a commit merging ismatch into contains to PR #24673. Closing.

@nalimilan nalimilan closed this Dec 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets design Design of APIs or of the language itself search & find The find* family of functions speculative Whether the change will be implemented is speculative strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants