GAFeatureSelector#

class mlproject.training.feature_selection.GAFeatureSelector(X, y, model, num_features=25, population_size=50, generations=100, feature_names=None, cxpb=0.5, mutpb=0.2, early_stop_patience=5, mutation_boost_threshold=3, mutation_boost_factor=2.0, cv=5, scoring='r2', min_diversity=5, entropy_threshold=0.3, X_test=None, y_test=None, test_scoring=None, n_jobs=1, return_train_score=True, error_score=np.nan, sissopp_binary_path=None, mpi_tasks=8, sissopp_inputs=None)[source]#

Bases: object

Genetic Algorithm Feature Selector using DEAP.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Feature matrix.

  • y (pd.Series or np.ndarray) – Target values.

  • model (sklearn-like estimator) – Model to evaluate feature subsets.

  • num_features (int, default=25) – Number of features to select.

  • population_size (int, default=50) – Size of the GA population.

  • generations (int, default=100) – Number of generations to run.

  • feature_names (list of str, optional) – Names of the features. If None, indices will be used.

  • cxpb (float, default=0.5) – Crossover probability.

  • mutpb (float, default=0.2) – Mutation probability.

  • early_stop_patience (int, default=5) – Generations to wait for improvement before stopping.

  • mutation_boost_threshold (int, default=3) – Generations without improvement to trigger mutation boost.

  • mutation_boost_factor (float, default=2.0) – Factor to increase mutation probability when boosting.

  • cv (int, default=5) – Number of cross-validation folds.

  • scoring (str, default="r2") – Scoring metric for evaluation.

  • min_diversity (int, default=5) – Minimum Hamming distance for diversity.

  • entropy_threshold (float, default=0.3) – Entropy threshold to trigger mutation boost.

  • X_test (pd.DataFrame or np.ndarray, optional) – Test feature matrix for diagnostic evaluation.

  • y_test (pd.Series or np.ndarray, optional) – Test target values for diagnostic evaluation.

  • test_scoring (str or callable, optional) – Scoring metric for test evaluation. Defaults to scoring.

  • n_jobs (int, default=1) – Number of parallel jobs for evaluation.

  • return_train_score (bool, default=True) – Whether to return training scores in cross-validation.

  • error_score (float, default=np.nan) – Value to assign to failed evaluations.

  • sissopp_binary_path (str, optional) – Path to SISSO++ binary for model evaluation.

  • mpi_tasks (int, default=8) – Number of MPI tasks for SISSO++.

  • sissopp_inputs (dict, optional) – Additional inputs for SISSO++.

evaluate_individual(individual)[source]#

Evaluate an individual using cross-validation.

evaluate_on_test(individual)[source]#

Evaluate the best individual on the test set.

differential_evolution_ga(pop, F=0.5, mutpb=0.1)[source]#

Differential Evolution GA offspring generation.

Parameters:
  • pop (list of Individuals) – Current population.

  • F (float, default=0.5) – Differential weight.

  • mutpb (float, default=0.1) – Mutation probability.

Returns:

New list of population after DE operation.

Return type:

list of np.ndarray

parse_postfix(model, selected_indices)[source]#

Extracts features and operators from postfix expression string.

Parameters:
  • model (SISSO model object) – The SISSO model containing features.

  • selected_indices (list of int) – Indices of selected features.

run(plot=True, strategy='standard')[source]#

Run the Genetic Algorithm for feature selection.

Parameters:
  • plot (bool, default=True) – Whether to plot the fitness history.

  • strategy (str, default="standard") – Offspring generation strategy: “standard” or “de” (differential evolution).

Returns:

Selected feature names.

Return type:

list of str

plot_fitness(strategy)[source]#

Plot the fitness history of the GA.

Parameters:

strategy (str) – Offspring generation strategy used.

Return type:

None

get_selected_feature_indices()[source]#

Get indices of selected features from the best individual.

Returns:

Indices of selected features.

Return type:

list of int