Genetic Algorithm Optimized Feature Selection for Random Forest Classifier

This project implements a genetic algorithm to optimize feature selection for a Random Forest classifier using a diabetes dataset. The goal is to enhance the accuracy of the classifier by selecting the most relevant features.

Introduction

Feature selection is a critical step in building a machine learning model, as it helps improve the model's performance by reducing overfitting, improving accuracy, and reducing training time. This project leverages a genetic algorithm to select the optimal subset of features for a Random Forest classifier to predict diabetes outcomes.

Dataset

The dataset used in this project is the Diabetes dataset, which contains the following columns:

Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunction
Age
Outcome

The target variable is Outcome, indicating whether a patient has diabetes.

Installation

To run this project, you need to have Python installed along with the following libraries:

numpy
pandas
scikit-learn
joblib

You can install the required libraries using pip:

pip install numpy pandas scikit-learn joblib

Usage

1.Clone this repository:

git clone https://github.com/yourusername/genetic-algorithm-feature-selection.git
cd genetic-algorithm-feature-selection

2.Ensure you have the diabetes.csv file in the same directory as the script.

3.Run the script:

python genetic_algorithm_feature_selection.py

Project Structure

genetic_algorithm_feature_selection.py: Main script containing the implementation of the genetic algorithm and the Random Forest classifier.

Functions Explained

calculate_fitness_v2
Calculates the fitness of an individual in the population. Fitness is measured as the accuracy of the Random Forest classifier using the selected features.

def calculate_fitness_v2(individual, X_train, X_test, y_train, y_test):
    X_train_selected = X_train.iloc[:, individual == 1]
    X_test_selected = X_test.iloc[:, individual == 1]

    model = RandomForestClassifier(random_state=48)

    model.fit(X_train_selected, y_train)

    y_pred = model.predict(X_test_selected)

    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

parallel_fitness
Evaluates the fitness of the entire population in parallel to speed up the process.

def parallel_fitness(population, X_train, X_test, y_train, y_test):
    fitness_values = Parallel(n_jobs=-1)(delayed(calculate_fitness_v2)(individual, X_train, X_test, y_train, y_test) for individual in population)
    return np.array(fitness_values)

genetic_algorithm_optimized
Main function to run the genetic algorithm. It iterates through the specified number of runs, performing selection, crossover, and mutation to evolve the population.

def genetic_algorithm_optimized(X_train, X_test, y_train, y_test, population_size, crossover_probability, mutation_probability, tournament_size, n_runs):
    population = np.random.randint(2, size=(population_size, X_train.shape[1]))

    for n_run in n_runs:
        for run in range(n_run):
            fitness_values = parallel_fitness(population, X_train, X_test, y_train, y_test)

            parents = tournament_selection(population, fitness_values, tournament_size)

            children = crossover(parents, crossover_probability)

            children = mutation(children, mutation_probability)

            new_population = np.vstack((population, children))

            new_fitness_values = parallel_fitness(new_population, X_train, X_test, y_train, y_test)

            best_individuals = fitness_based_selection(new_population, new_fitness_values, population_size)

            population = new_population[best_individuals]

        best_individual_index = np.argmax(fitness_values)
        best_individual = population[best_individual_index]
        best_accuracy = fitness_values[best_individual_index]

        print(f"Run {n_run}: Best individual: {best_individual}, Accuracy: {best_accuracy}")

tournament_selection
Selects parents for the next generation using tournament selection.

def tournament_selection(population, fitness_values, tournament_size):
    parents = []

    for _ in range(len(population)):
        tournament_indices = np.random.choice(len(population), tournament_size, replace=False)
        tournament_fitness_values = fitness_values[tournament_indices]
        best_parent_index = tournament_indices[np.argmax(tournament_fitness_values)]
        parents.append(population[best_parent_index])

    return np.array(parents)

crossover
Performs crossover between pairs of parents to produce children.

def crossover(parents, crossover_probability):
    children = []

    for i in range(0, len(parents), 2):
        parent1 = parents[i]
        parent2 = parents[i + 1]

        if np.random.rand() < crossover_probability:
            point = np.random.randint(len(parent1))
            child1 = np.hstack((parent1[:point], parent2[point:]))
            child2 = np.hstack((parent2[:point], parent1[point:]))
        else:
            child1, child2 = parent1.copy(), parent2.copy()

        children.append(child1)
        children.append(child2)

    return np.array(children)

mutation
Mutates the children based on the mutation probability.

def mutation(children, mutation_probability):
    for i in range(len(children)):
        for j in range(len(children[i])):
            if np.random.rand() < mutation_probability:
                children[i][j] = 1 - children[i][j]

    return children

fitness_based_selection
Selects the best individuals from the combined population of parents and children based on their fitness values.

def fitness_based_selection(population, fitness_values, new_population_size):
    best_individuals_index = np.argsort(fitness_values)[-new_population_size:]
    return best_individuals_index

Results

The output of the script provides the best individual (feature subset) and the corresponding accuracy for different numbers of runs:

Run 100: Best individual: [1 1 1 0 0 1 1 1], Accuracy: 0.7792207792207793
Run 500: Best individual: [1 1 1 0 1 1 1 1], Accuracy: 0.7792207792207793
Run 1000: Best individual: [1 1 1 0 1 1 1 1], Accuracy: 0.7792207792207793
Run 2000: Best individual: [1 1 1 0 0 1 1 1], Accuracy: 0.7792207792207793

Due to "Run 2000"

Pregnancies
Glucose
BloodPressure
BMI
DiabetesPedigreeFunction
Age
These features are selected by genetic algorithm.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
Feature Selection.ipynb		Feature Selection.ipynb
README.md		README.md
diabetes.csv		diabetes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genetic Algorithm Optimized Feature Selection for Random Forest Classifier

Table of Contents

Introduction

Dataset

Installation

Usage

Project Structure

Functions Explained

Results

Contributing

About

Releases

Packages

Languages

erenuito/Genetic-Algorithm-Feature-Selection

Folders and files

Latest commit

History

Repository files navigation

Genetic Algorithm Optimized Feature Selection for Random Forest Classifier

Table of Contents

Introduction

Dataset

Installation

Usage

Project Structure

Functions Explained

Results

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages