GAFeatureSelector

GAFeatureSelector#

class mlproject.training.feature_selection.GAFeatureSelector(X, y, model, num_features=25, population_size=50, generations=100, feature_names=None, cxpb=0.5, mutpb=0.2, early_stop_patience=5, mutation_boost_threshold=3, mutation_boost_factor=2.0, cv=5, scoring='r2', min_diversity=5, entropy_threshold=0.3, X_test=None, y_test=None, test_scoring=None, n_jobs=1, return_train_score=True, error_score=np.nan, sissopp_binary_path=None, mpi_tasks=8, sissopp_inputs=None)[source]#

Bases: object

Genetic Algorithm Feature Selector using DEAP.

Parameters:

X (pd.DataFrame or np.ndarray) – Feature matrix.
y (pd.Series or np.ndarray) – Target values.
model (sklearn-like estimator) – Model to evaluate feature subsets.
num_features (int, default=25) – Number of features to select.
population_size (int, default=50) – Size of the GA population.
generations (int, default=100) – Number of generations to run.
feature_names (list of str, optional) – Names of the features. If None, indices will be used.
cxpb (float, default=0.5) – Crossover probability.
mutpb (float, default=0.2) – Mutation probability.
early_stop_patience (int, default=5) – Generations to wait for improvement before stopping.
mutation_boost_threshold (int, default=3) – Generations without improvement to trigger mutation boost.
mutation_boost_factor (float, default=2.0) – Factor to increase mutation probability when boosting.
cv (int, default=5) – Number of cross-validation folds.
scoring (str, default="r2") – Scoring metric for evaluation.
min_diversity (int, default=5) – Minimum Hamming distance for diversity.
entropy_threshold (float, default=0.3) – Entropy threshold to trigger mutation boost.
X_test (pd.DataFrame or np.ndarray, optional) – Test feature matrix for diagnostic evaluation.
y_test (pd.Series or np.ndarray, optional) – Test target values for diagnostic evaluation.
test_scoring (str or callable, optional) – Scoring metric for test evaluation. Defaults to scoring.
n_jobs (int, default=1) – Number of parallel jobs for evaluation.
return_train_score (bool, default=True) – Whether to return training scores in cross-validation.
error_score (float, default=np.nan) – Value to assign to failed evaluations.
sissopp_binary_path (str, optional) – Path to SISSO++ binary for model evaluation.
mpi_tasks (int, default=8) – Number of MPI tasks for SISSO++.
sissopp_inputs (dict, optional) – Additional inputs for SISSO++.

evaluate_individual(individual)[source]#: Evaluate an individual using cross-validation.

evaluate_on_test(individual)[source]#: Evaluate the best individual on the test set.

differential_evolution_ga(pop, F=0.5, mutpb=0.1)[source]#

Differential Evolution GA offspring generation.

Parameters:

pop (list of Individuals) – Current population.
F (float, default=0.5) – Differential weight.
mutpb (float, default=0.1) – Mutation probability.

Returns:

New list of population after DE operation.

Return type:

list of np.ndarray

parse_postfix(model, selected_indices)[source]#

Extracts features and operators from postfix expression string.

Parameters:

model (SISSO model object) – The SISSO model containing features.
selected_indices (list of int) – Indices of selected features.

run(plot=True, strategy='standard')[source]#

Run the Genetic Algorithm for feature selection.

Parameters:

plot (bool, default=True) – Whether to plot the fitness history.
strategy (str, default="standard") – Offspring generation strategy: “standard” or “de” (differential evolution).

Returns:

Selected feature names.

Return type:

list of str

plot_fitness(strategy)[source]#

Plot the fitness history of the GA.

Parameters:: strategy (str) – Offspring generation strategy used.
Return type:: None

get_selected_feature_indices()[source]#

Get indices of selected features from the best individual.

Returns:: Indices of selected features.
Return type:: list of int

GAFeatureSelector

Contents

GAFeatureSelector#