RENT base functions¶
RENT base is an abstract class and contains the constructor of RENT. Furthermore, the class comprises methods and functions that are applicable for both classes, RENT_Classification and RENT_Regression.
-
class
RENT.RENT.RENT_Base(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, poly='OFF', testsize_range=(0.2, 0.6), K=100, scale=True, random_state=None, verbose=0)¶ The constructor initializes common variables of RENT_Classification and RENT_Regresson. Initializations that are specific for classification or regression are described in detail in RENT for binary classification and RENT for regression, respectively.
Parameters: - data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection shall be performed. Variable types must be numeric or integer.
- target (<numpy array> or <pandas dataframe>) – Response variable of data.
- feat_names (<list>) – List holding feature names. Preferably a list of string values.
If empty, feature names will be generated automatically.
Default:
feat_names=[]. - C (<list of int or float values>) – List with regularisation parameters for
Kmodels. The lower, the stronger the regularization is. Default:C=[1,10]. - l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For
pure l2 use 0, for pure l1 use 1. Default:
l1_ratios=[0.6]. - autoEnetParSel (<boolean>) –
- Cross-validated elastic net hyperparameter selection.
autoEnetParSel=True: peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.autoEnetParSel=False: perform RENT with each combination ofCandl1_ratios. Default:autoEnetParSel=True.
- poly (<str>) –
- Create non-linear features. Default:
poly='OFF'. poly='OFF': no feature interaction.poly='ON': feature interaction and squared features (2-polynoms).poly='ON_only_interactions': only feature interactions, no squared features.
- Create non-linear features. Default:
- testsize_range (<tuple float>) – Inside RENT,
Kmodels are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range oftestsize_rangefor each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default:testsize_range=(0.2, 0.6). - K (<int>) – Number of unique train-test splits. Default
K=100. - scale (<boolean>) – Columnwise standardization each of the K train datasets. Default
scale=True. - random_state (<None or int>) –
- Set a random state to reproduce your results. Default:
random_state=None. random_state=None: no random seed.random_state={0,1,2,...}: random seed set.
- Set a random state to reproduce your results. Default:
- verbose (<int>) – Track the train process if value > 1. If
verbose = 1, only the overview of RENT input will be shown. Default:verbose=0.
-
get_cv_matrices()¶ Three pandas data frames showing cross-validated result for all combinations of
Candl1_ratio. Only applicable ifautoEnetParSel=True.Returns: - dataFrame_1: average scores for predictive performance. The higher the score, the better the parameter combination.
- dataFrame_2: average percentage of how many feature weights are set to zero. The higher the average percentage, the stronger the feature selection with the corresponding paramter combination.
- dataFrame_3: harmonic means between normalized dataFrame_1 and normalized dataFrame_2. The parameter combination with the highest harmonic mean is selected.
Return type: <list> of <pandas dataframes>
-
get_enetParam_matrices()¶ Three pandas data frames showing result for all combinations of
l1_ratioandC.Returns: - dataFrame_1: holds average scores for predictive performance.
- dataFrame_2: holds average percentage of how many feature weights were set to zero.
- dataFrame_3: holds harmonic means between dataFrame_1 and dataFrame_2.
Return type: <list> of <pandas dataframes>
-
get_enet_params()¶ Get current hyperparameter combination of
Candl1_ratiothat is used in RENT analyses. By default it is the best combination found. If autoEnetParSel=False the user can change the combination withset_enet_params().Returns: A tuple (C, l1_ratio). Return type: <tuple>
-
get_runtime()¶ Total RENT training time in seconds.
Returns: Time. Return type: <numeric value>
-
get_scores_list()¶ Prediction scores over the
Kmodels.Returns: Scores list. Return type: <list>
-
get_summary_criteria()¶ Summary statistic of the selection criteria tau_1, tau_2 and tau_3 (described in
select_features()) for each feature. All three criteria are in [0,1] .Returns: Matrix where rows represent selection criteria and columns represent features. Return type: <pandas dataframe>
-
get_weight_distributions(binary=False)¶ In each of the
Kmodels, feature weights are fitted, i.e. an individiual weight is assigned feature 1 for model 1, model 2, up to modelK. This method returns the weight for every feature and model (1:K) combination.Parameters: binary (<boolean>) – - Default:
binary=False. binary=True: binary matrix where entry is 1 for each weight unequal to 0.binary=False: original weight matrix.
Returns: Weight matrix. Rows represent models (1:K), columns represents features. Return type: <pandas dataframe> - Default:
-
plot_elementary_models()¶ Two lineplots where the first curve shows the prediction score over
Kmodels. The second curve plots the percentage of weights set to 0, respectively.
-
plot_object_PCA(cl=0, comp1=1, comp2=2, problem='class', hoggorm=True, hoggorm_plots=[1, 2, 3, 4, 6], sel_vars=True)¶ PCA analysis. For classification problems, PCA can be computed either on a single class separately or on both classes. Different coloring possibilities for the scores are provided. Besides scores, loadings, correlation loadings, biplot, and explained variance plots are available.
Parameters: - cl (<int>, <str>) –
- Perform PCA on cl. Default:
cl=0. cl=0: Class 0.cl=1: Class 1.cl='both': All objects (incorrect predictions coloring).cl='continuous': All objects (gradient coloring). For classification problems, this is the only valid option.
- Perform PCA on cl. Default:
- comp1 (<int>) – First PCA component to plot. Default:
comp1=1. - comp2 (<int>) – Second PCA component to plot. Default:
comp2=2. - problem (<str>) –
- Classification or regression problem. Default:
problem='class'. problem='class': Classification problem. Can be used with all possibleclinputs.problem='regression': Regression problem. Can only be used withcl='continuous'.
- Classification or regression problem. Default:
- hoggorm (<boolean>) – To not use plots from hoggormplot package, set
hoggorm=False. Default:hoggorm=True. - hoggorm_plots (<list>) –
- Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options.
hoggorm=Truemust be set. Default:hoggorm_plots=[1,2,3,4,6]. - 1: scores plot
- 2: loadings plot
- 3: correlation loadings plot
- 4: biplot
- 6: explained variance plot
- Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options.
- sel_vars (<boolean>) – Only use the features selected with RENT for PCA. Default:
sel_vars=True.
- cl (<int>, <str>) –
-
plot_selection_frequency()¶ Barplot of tau_1 value for each feature.
-
plot_validation_study(test_data, test_labels, num_drawings, num_permutations, metric='mcc', alpha=0.05)¶ - Two validation studies based on a Student’s t-test. The null-hypotheses claim that
- RENT is not better than random feature selection.
- RENT performs equally well on the real and a randomly permutated target.
If
poly='ON'orpoly='ON_only_interactions'in the RENT initialization, the test data is automatically polynomially transformed.Parameters: - test_data (<numpy array> or <pandas dataframe>) – Dataset, used to evalute predictive models in the validation study. Must be independent of the data, RENT is computed on.
- test_lables (<numpy array> or <pandas dataframe>) – Response variable of test_data.
- num_drawings (<int>) – Number of independent feature subset drawings for VS1.
- num_permutations (<int>) – Number of independent test_labels permutations for VS2.
- metric (<str>) –
The metric to evaluate
Kmodels. Default:metric='mcc'. Only relevant for classification tasks. For regression R2-score is used.scoring='accuracy': Accuracyscoring='f1': F1-scorescoring='precision': Precisionscoring='recall': Recallscoring='mcc': Matthews Correlation Coefficient
- alpha (<float>) – Significance level for the t-test. Default
alpha=0.05.
-
select_features(tau_1_cutoff=0.9, tau_2_cutoff=0.9, tau_3_cutoff=0.975)¶ Selects features based on the cutoff values for tau_1_cutoff, tau_2_cutoff and tau_3_cutoff.
Parameters: - tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and
1. Default:
tau_1=0.9. - tau_2_cutoff (<float>) – Cutoff value for tau_2 criterion. Choose value between 0 and
1. Default:
tau_2=0.9. - tau_3_cutoff (<float>) – Cutoff value for tau_3 criterion. Choose value between 0 and
1. Default:
tau_3=0.975.
Returns: Array with selected features.
Return type: <numpy array>
- tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and
1. Default:
-
set_enet_params(C, l1_ratio)¶ Set hyperparameter combination of
Candl1_ratio, that is used for analyses. Only useful ifautoEnetParSel=False.Parameters: - C (<float>) – Regularization parameter.
- l1_ratio (<float>) – l1 ratio with value in [0,1].
-
train()¶ If
autoEnetParSel=False, this method trainsK*len(C)*len(l1_ratios)models in total. The number of models using the same hyperparamters isK. Otherwise, if the best parameter combination is selected with cross-validation, onlyKmodels are trained. For each model elastic net regularisation is applied for feature selection. Internally,train()calls therun_parallel()function for classification or regression, respectively.