GAFeatureSelector#
- class mlproject.training.feature_selection.GAFeatureSelector(X, y, model, num_features=25, population_size=50, generations=100, feature_names=None, cxpb=0.5, mutpb=0.2, early_stop_patience=5, mutation_boost_threshold=3, mutation_boost_factor=2.0, cv=5, scoring='r2', min_diversity=5, entropy_threshold=0.3, X_test=None, y_test=None, test_scoring=None, n_jobs=1, return_train_score=True, error_score=np.nan, sissopp_binary_path=None, mpi_tasks=8, sissopp_inputs=None)[source]#
Bases:
objectGenetic Algorithm Feature Selector using DEAP.
- Parameters:
X (pd.DataFrame or np.ndarray) – Feature matrix.
y (pd.Series or np.ndarray) – Target values.
model (sklearn-like estimator) – Model to evaluate feature subsets.
num_features (int, default=25) – Number of features to select.
population_size (int, default=50) – Size of the GA population.
generations (int, default=100) – Number of generations to run.
feature_names (list of str, optional) – Names of the features. If None, indices will be used.
cxpb (float, default=0.5) – Crossover probability.
mutpb (float, default=0.2) – Mutation probability.
early_stop_patience (int, default=5) – Generations to wait for improvement before stopping.
mutation_boost_threshold (int, default=3) – Generations without improvement to trigger mutation boost.
mutation_boost_factor (float, default=2.0) – Factor to increase mutation probability when boosting.
cv (int, default=5) – Number of cross-validation folds.
scoring (str, default="r2") – Scoring metric for evaluation.
min_diversity (int, default=5) – Minimum Hamming distance for diversity.
entropy_threshold (float, default=0.3) – Entropy threshold to trigger mutation boost.
X_test (pd.DataFrame or np.ndarray, optional) – Test feature matrix for diagnostic evaluation.
y_test (pd.Series or np.ndarray, optional) – Test target values for diagnostic evaluation.
test_scoring (str or callable, optional) – Scoring metric for test evaluation. Defaults to scoring.
n_jobs (int, default=1) – Number of parallel jobs for evaluation.
return_train_score (bool, default=True) – Whether to return training scores in cross-validation.
error_score (float, default=np.nan) – Value to assign to failed evaluations.
sissopp_binary_path (str, optional) – Path to SISSO++ binary for model evaluation.
mpi_tasks (int, default=8) – Number of MPI tasks for SISSO++.
sissopp_inputs (dict, optional) – Additional inputs for SISSO++.
- differential_evolution_ga(pop, F=0.5, mutpb=0.1)[source]#
Differential Evolution GA offspring generation.
- Parameters:
pop (list of Individuals) – Current population.
F (float, default=0.5) – Differential weight.
mutpb (float, default=0.1) – Mutation probability.
- Returns:
New list of population after DE operation.
- Return type:
list of np.ndarray
- parse_postfix(model, selected_indices)[source]#
Extracts features and operators from postfix expression string.
- Parameters:
model (SISSO model object) – The SISSO model containing features.
selected_indices (list of int) – Indices of selected features.
- run(plot=True, strategy='standard')[source]#
Run the Genetic Algorithm for feature selection.
- Parameters:
plot (bool, default=True) – Whether to plot the fitness history.
strategy (str, default="standard") – Offspring generation strategy: “standard” or “de” (differential evolution).
- Returns:
Selected feature names.
- Return type:
list of str