get_relevant_features

get_relevant_features#

mlproject.training.feature_selection.get_relevant_features(X_train, y_train, const_feat_tol=0.95, collinearity_tol=0.9, grootcv_nfolds=5, grootcv_n_iter=20, grootcv_lgbm_objective='mae', **pipeline_kwargs)[source]#

Build and apply a feature selection pipeline to remove correlated and irrelevant features.

The pipeline applies the following steps:

DropConstantFeatures: Removes features with low variance (near-constant).
SmartCorrelatedSelection: Removes highly correlated features based on Pearson correlation.
GrootCV: Selects relevant features using cross-validation with a LightGBM-based model.

Parameters:

X_train (pd.DataFrame | np.ndarray) – Training feature matrix.
y_train (np.ndarray) – Target values corresponding to X_train. 1D numpy array
const_feat_tol (float, default=0.95) – Threshold for removing near-constant features. A feature is removed if a single value accounts for at least this proportion of observations.
collinearity_tol (float, default=0.9) – Correlation threshold for removing highly correlated features.
grootcv_nfolds (int, default=5) – Number of folds for cross-validation in GrootCV.
grootcv_n_iter (int, default=20) – Number of iterations for feature selection in GrootCV.
grootcv_lgbm_objective (str, default="mae") – Objective function for the LightGBM model inside GrootCV.
**pipeline_kwargs (dict) – Additional keyword arguments passed to the underlying Pipeline.

Returns:

Pipeline instance and a transformed training set with only the relevant features retained.

Return type:

pipeline, pd.DataFrame

get_relevant_features

Contents

get_relevant_features#