By: Brett Balquist
This project was made for HackKu24 and was inspired by a talk given by NYU distinguished professor of mathematics Percy Deift about iterative algorithms and random data.
- As part of this project I created my own data structure which resembled DNA.
- The DNA class had inner classes of a right_backbone and left_backbone which consisted of quadruply linked sentinels which resembled the sugar phosphate backbone
- It also had AT and CG classes which consisted of quadruply linked nodes which resembled the Adenine and Thymine, and Cytosine and Guanine bases respectively.
- This made traversing, crossover, and mutation quite easy to code as a result of this data strucutre.
-
Crossover
- This function crosses two parents DNA and produces two children. This function occurs in nature during meiosis when chromosomes are lined up, which in this case are the
$\mathbf{AT}$ and$\mathbf{CG}$ vectors respectively.
- This function crosses two parents DNA and produces two children. This function occurs in nature during meiosis when chromosomes are lined up, which in this case are the
-
Intialize Population
- This creates a population with a predetermined population size populated with DNA strands.
-
Selection
- Selection takes the top 75% most fit of the population and creates a new population with them. This ensures that each next population is generally more fit than the previous generation.
-
Mutation
- Mutation uses a standard mutation rate of 5% or a random rate if smaller against random values to determine the percentage of the population that is mutated. If random.random() generates a number less than mutation_rate, the condition evaluates to True, indicating that a mutation should occur. If random.random() generates a number greater than or equal to mutation_rate, the condition evaluates to False, indicating that no mutation should occur. The mutation then performs a multiplication of the curent entry against a value between 0.1 and 1.0 which will keep it inclusive within the sin domain.
-
Evaluating Fitness
- Evaluating fitness stores the inner product value with the associated strand
Fitness is the function that you want to optuimize for. In stochastic algorithms this can mean a number of thngs including solving a maze, maximizing your ability to perform
a task based on certain parameters. In this scenario we are optimizing for inner product which in this scenario is the angle between two vectors,
Generall, the inner product is any calculation on a vector space s.t.
The inner product between two vectors
In this project
Where:
-
$\mathbf{x} \cdot \mathbf{y}$ denotes the dot product (or inner product) between vectors$\mathbf{x}$ and$\mathbf{y}$ . -
$|\mathbf{x}|$ and$|\mathbf{y}|$ represent the norms (or magnitudes) of vectors$\mathbf{x}$ and$\mathbf{y}$ , respectively.
The dot product
Where
The norms
This was implemented in this project by using angles closer to 0 to signify a relatedness in vectors that was positive.
This section utilizes all the previous methods and the DNA data structure to construct a genetic algorithm that uses generations to tend towards a more fit population.
The length of a DNA strand, the size of the population, the number of generations, and the mutation rate are all passed as parameters. From here an intial population is generated and then and then from that the fitness of each strand is evaluated and parents are generated. Next, the parents are bread to produce two children and the children are possibly mutated and then appended to the new population. From this new population we repreat this until the end of the selected number of generations.
Over this time the population grows more fit and the most fit is then selected from the final population completing the genetic algorithm.
My understanding of this topic has much room to improve and so could the data structure and algorithm that I wrote. This could be used in a number of ways including finding approximates for eigenvalues, maximizing data based on certain key metrics while retaining less information, or optimizing other vector metrics.