Email Address Recognition with a Deterministic Finite Automaton

Introduction

This project uses a determinstic finite automaton(DFA) to recognize strings that are in a regular language of valid email addresses $L$, fully defined below. This markdown document contains the description of $L$, the illustration of the DFA for $L$, and the formal mathematical 5-tuple defintion for this DFA. The source file dfa.cpp contains an implementation of the DFA that allows a user to enter strings and view the real-time processing of them, and will determine if the string is or is not in the language. A more in depth descritpion of each item is contained below.

The Language $L$

Alphabet of $L$

The Language $L$ is defined over the alphabet $\Sigma$, which itself is defined as $\Sigma = \Psi \cup \Pi \cup \Phi$ with:

$\Psi =$ {a, b, c,. . .,z} being the set of all lower case roman letters
$\Pi =$ {.}
$\Phi =$ {@}

In short, the alphabet for this language is all lowercase roman numerals, the "." and the "@" symbol.

Definition of $L$

From here, we define the language $L$ to be the set of strings over the alphabet $\Sigma$ that are of the form: $L = S_{1}S_{2}^\star\Phi S_{1}S_{2}^\star(S_{3} \cup S_{4})$, where:

$S_{1} = \Psi\Psi^\star$, which defines strings of lowercase roman numerals of length at least 1
$S_{2} = \Pi\Psi\Psi^\star$, which defines strings like $S_{1}$, except that they begin with a dot
$S_{3} =$ {.gov}, one possible valid ending of email addresses in $L$
$S_{4} =$ {.gr}, the other possible valid ending of email addresses in $L$

To make this more grounded, two examples of strings that are defined over the Language L are: "abc@learner.gr" or "a.b.c.d@d.c.b.a.gov". Notice how every "." is followed by at least one letter, and there is only one "@" symbol. Strings like "my@website@example.gov" would not be accepted, even though this string ends properly in a ".gov", the use of two "@" symbols makes it incorrect. Strings like "abc@abc.com" are easier to identify as not being in the language, because stings in $L$ must end in ".gov" or ".gr", regardless of proper grammar otherwise.

$L$ is a Regular Language

By definition, a langauge is regular if there exist a Deterministic Finite Automaton that accepts every string in the language and at the same time rejects every string not in the language. As we will soon see, there is a DFA $M$ that recognizes every string in $L$ and rejects every string not in $L$. The existence of $M$ shows us in practice that $L$ is a regular language, by definition.

Deterministic Finite Automaton(DFA) $M$, for $L$

5-Tuple Definition for $M$

Formally, the DFA that recognizes the language $L$ can be defined as the 5-Tuple $M = (Q, \Sigma, \delta, q_{1}, F)$ with:

$Q =$ { $q_{1}, q_{2}, q_{3}, q_{4}, q_{5}, q_{6}, q_{7}, q_{8}, q_{9}, q_{10}$ } is the set of all states in $M$
$\Sigma = \Psi \cup \Pi \cup \Phi$ is the alphabet that $L$ is defined over, as described above

$\delta: Q \times \Sigma \rightarrow Q$ is the transition function, which is defined as:

Q	$\Psi_{-g, r, o, v}$	g	r	o	v	$\Pi$	$\Phi$
$q_{1}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{10}$	$q_{10}$
$q_{2}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{2}$	$q_{1}$	$q_{3}$
$q_{3}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{10}$	$q_{10}$
$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{5}$	$q_{10}$
$q_{5}$	$q_{4}$	$q_{6}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{10}$	$q_{10}$
$q_{6}$	$q_{4}$	$q_{4}$	$q_{7}$	$q_{8}$	$q_{4}$	$q_{5}$	$q_{10}$
$q_{7}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{5}$	$q_{10}$
$q_{8}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{9}$	$q_{5}$	$q_{10}$
$q_{9}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{4}$	$q_{5}$	$q_{10}$
$q_{10}$	$q_{10}$	$q_{10}$	$q_{10}$	$q_{10}$	$q_{10}$	$q_{10}$	$q_{10}$

$q_{1}$ is the starting state
$F =$ { $q_{7}$, $q_{9}$ } is the set of accepting states

Note

Once the DFA goes into state $q_{10}$, there is no way to leave it. This is because $q_{10}$ is a so-called "trap state", meaning that once the DFA enters it, it is trapped in it until it is done reading the string. Strings enter $q_{10}$ after there is a fatal grammar error. For example, reading a "." as the first character will send the DFA to $q_{10}$ for the rest of the string. However, being in $q_{10}$ does not cause the DFA to terminate immediately. This is because, by definition, the DFA must process the entire string, and then accept or reject it.

Drawing of $M$

Though useful for rigorously defining the DFA $M$, the 5-Tuple definition is very abstract and not easy to understand at a glance. Below is an illustration of $M$, to better show how it works. $M$ can be represented as a directed graph, with the nodes being the states from $Q$ and the edges being labeled with what symbol must be read to advance to the next state. States that have a double circle border are accepting states.

Here is an example of how the DFA would process the string "a.bc@ex.gov"

Begin in start state $q_{1}$
Read symbol a $\in \Psi$, move to $q_{2}$
Read symbol "." $\in \Pi$, move to $q_{1}$
Read symbol b $\in \Psi$, move to $q_{2}$
Read symbol c $\in \Psi$, move to $q_{2}$
Read symbol "@" $\in \Phi$, move to $q_{3}$
Read symbol e $\in \Psi$, move to $q_{4}$
Read symbol x $\in \Psi$, move to $q_{4}$
Read symbol "." $\in \Pi$, move to $q_{5}$
Read symbol g, move to $q_{6}$
Read symbol o, move to $q_{8}$
Read symbol v, move to $q_{9}$

The entire string has been processed, and $M$ ended in $q_{9} \in F$ which is an accepting state, so the string "a.bc@ex.gov" is accepted as being in the language $L$

Programmatic Implementation of $M$

To demonstrate the functionality of these concepts, the program dfa.cpp implements $M$ as a CPP program using a one switch statement as a state machine. The transitions and logic in the program are exactly the same as in the drawing of $M$, so I will not reiterate them here. The code itself is well documented and very easy to follow. Upon execution, the program will ask the user for a number of strings to enter, and then enter a loop wherein it asks the user to enter a string, processes it in the DFA, and prints out whether or not the DFA $M$ accepts or rejects the string. I would encourage anyone interested to compile and run the program for themselves. There is a test file: test_cases.txt, which gives 20 test cases, and can be easily "piped" into the program using Bash redirection by running a command like ./"Path-to-compiled-file" < test_cases.txt. In case you want to see the output without running the program, the file tests_output.txt contains the output of all of these tests being run.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
test_cases.txt		test_cases.txt
tests_output.txt		tests_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Address Recognition with a Deterministic Finite Automaton

Introduction

The Language $L$

Alphabet of $L$

Definition of $L$

$L$ is a Regular Language

Deterministic Finite Automaton(DFA) $M$, for $L$

5-Tuple Definition for $M$

Drawing of $M$

Programmatic Implementation of $M$

About

Releases

Packages

Languages

jackr276/Email-Address-Recognition-with-a-DFA

Folders and files

Latest commit

History

Repository files navigation

Email Address Recognition with a Deterministic Finite Automaton

Introduction

The Language $L$

Alphabet of $L$

Definition of $L$

$L$ is a Regular Language

Deterministic Finite Automaton(DFA) $M$, for $L$

5-Tuple Definition for $M$

Drawing of $M$

Programmatic Implementation of $M$

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages