Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated email address matching pattern #9

Merged
merged 9 commits into from
Apr 22, 2023
Merged

Updated email address matching pattern #9

merged 9 commits into from
Apr 22, 2023

Conversation

hiteshbedre
Copy link
Contributor

@hiteshbedre hiteshbedre commented Apr 22, 2023

Handled most of the cases mentioned under "LegacyTests.cs" file.

Sample Email Correctly Matched
Mary&Jane@example.org
""test\""blah""@example.com
customer/department@example.com
$A12345@example.com
!def!xyz%abc@example.com
_Yosemite.Sam@example.com
Ima.Fool@example.com
foobar@x.com
foobar@c0m.com
foobar@c_m.com
foo@bar_.com
foo@666.com
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com

Invalid cases handled:

Invalid Email Address Regex Invalidating?
char)8 + "ar.com
char)9 + "ar.com
char)127 + "ar.com
.wooly@example.com
pootietang.@example.com
.@example.com
foo@bar
foo@bar.a

Pattern not being handled via Regex:

Email Description Handled via Code?
pootietang.@example.com dot_before_at
wo..oly@example.com consecutive_dots
foo@bar.1com tld_starting_with_number
foobar@_.com domain_with_underscore
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com email_of_256_chars

Fix for issue #5

Copy link

@SmithPlatts SmithPlatts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of \w here, instead of [a-zA-Z0-9], means that accented characters are also matched (unless .NET Regex does something special with [a-zA-Z0-9], otherwise it's an ASCII match.)

All round, a good change.

@KonajuGames
Copy link

Regexes are one of these where there is no one correct answer. Everyone has a different take, and that is just how they seem to be designed. And I will admit I am no magician in regex. I have generally tried to avoid them in my professional career.

I've got it down to one test fail (unescaped_double_quote_is_invalid), but it does need some code filtering along with the regex. The regex expression simply cannot do it all by itself.

        public List<string> ExtractAddresses(string content)
        {
            if (string.IsNullOrWhiteSpace(content))
                return new();
            // Expression allows some false positives through. These will be filtered later.
            string addressPattern = @"[\p{L}\d\-_\+\/\&\!\%""\\\.\*]+[^\.]@[\p{L}\d][\p{L}\d\.\-_]*\.[a-z]{2,}\b";
            var matches = Regex.Matches(content, addressPattern, RegexOptions.IgnoreCase);
            var uniqueAddresses = new HashSet<string>();

            foreach (Match match in matches)
            {
                var address = match.Value;
                // Filter out the false positives that the regex expression could not handle
                if (address[0] == '.')
                    continue;
                if (address.Contains('*'))
                    continue;
                if (address.Contains(".."))
                    continue;
                if (address.Length >= 256)
                    continue;
                uniqueAddresses.Add(match.Value.ToLower());
            }

            return uniqueAddresses.OrderBy(a => a).ToList();
        }

@hiteshbedre
Copy link
Contributor Author

Regexes are one of these where there is no one correct answer.

Completely agree with your statement.

@hiteshbedre
Copy link
Contributor Author

accented characters

Its a new learning for me. Thanks for sharing, updated the pull request accordingly.

@troyhunt troyhunt merged commit aafadd4 into HaveIBeenPwned:main Apr 22, 2023
@troyhunt
Copy link
Contributor

Good progress @hiteshbedre, thank you! Still got a few failing legacy tests, want to take a stab at those?

@hiteshbedre
Copy link
Contributor Author

Yeah sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants