get_relevant_features

get_relevant_features#

mlproject.training.feature_selection.get_relevant_features(X_train, y_train, const_feat_tol=0.95, collinearity_tol=0.9, grootcv_nfolds=5, grootcv_n_iter=20, grootcv_lgbm_objective='mae', **pipeline_kwargs)[source]#

Build and apply a feature selection pipeline to remove correlated and irrelevant features.

The pipeline applies the following steps:

  1. DropConstantFeatures: Removes features with low variance (near-constant).

  2. SmartCorrelatedSelection: Removes highly correlated features based on Pearson correlation.

  3. GrootCV: Selects relevant features using cross-validation with a LightGBM-based model.

Parameters:
  • X_train (pd.DataFrame | np.ndarray) – Training feature matrix.

  • y_train (np.ndarray) – Target values corresponding to X_train. 1D numpy array

  • const_feat_tol (float, default=0.95) – Threshold for removing near-constant features. A feature is removed if a single value accounts for at least this proportion of observations.

  • collinearity_tol (float, default=0.9) – Correlation threshold for removing highly correlated features.

  • grootcv_nfolds (int, default=5) – Number of folds for cross-validation in GrootCV.

  • grootcv_n_iter (int, default=20) – Number of iterations for feature selection in GrootCV.

  • grootcv_lgbm_objective (str, default="mae") – Objective function for the LightGBM model inside GrootCV.

  • **pipeline_kwargs (dict) – Additional keyword arguments passed to the underlying Pipeline.

Returns:

Pipeline instance and a transformed training set with only the relevant features retained.

Return type:

pipeline, pd.DataFrame