RENT base functions¶
RENT base is an abstract class and contains the constructor of RENT. Furthermore, the class comprises methods and functions that are applicable for both classes, RENT_Classification and RENT_Regression.
-
class
RENT.RENT.
RENT_Base
(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, BIC=False, poly='OFF', testsize_range=(0.2, 0.6), K=100, scale=True, random_state=None, verbose=0)¶ The constructor initializes common variables of RENT_Classification and RENT_Regression. Initializations that are specific for classification or regression are described in detail in RENT for binary classification and RENT for regression, respectively.
Parameters: - data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection shall be performed. Variable types must be numeric or integer.
- target (<numpy array> or <pandas dataframe>) – Response variable of data.
- feat_names (<list>) – List holding feature names. Preferably a list of string values.
If empty, feature names will be generated automatically.
Default:
feat_names=[]
. - C (<list of int or float values>) – List with regularisation parameters for
K
models. The lower, the stronger the regularization is. Default:C=[1,10]
. - l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For
pure l2 use 0, for pure l1 use 1. Default:
l1_ratios=[0.6]
. - autoEnetParSel (<boolean>) –
- Cross-validated elastic net hyperparameter selection.
autoEnetParSel=True
: peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.autoEnetParSel=False
: perform RENT with each combination ofC
andl1_ratios
. Default:autoEnetParSel=True
.
- BIC (<boolean>) –
- Use the Bayesian information criterion to select hyperparameters.
BIC=True
: use BIC to select RENT hyperparameters.BIC=False
: no use of BIC.
- poly (<str>) –
- Create non-linear features. Default:
poly='OFF'
. poly='OFF'
: no feature interaction.poly='ON'
: feature interaction and squared features (2-polynoms).poly='ON_only_interactions'
: only feature interactions, no squared features.
- Create non-linear features. Default:
- testsize_range (<tuple float>) – Inside RENT,
K
models are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range oftestsize_range
for each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default:testsize_range=(0.2, 0.6)
. - K (<int>) – Number of unique train-test splits. Default
K=100
. - scale (<boolean>) – Columnwise standardization each of the K train datasets. Default
scale=True
. - random_state (<None or int>) –
- Set a random state to reproduce your results. Default:
random_state=None
. random_state=None
: no random seed.random_state={0,1,2,...}
: random seed set.
- Set a random state to reproduce your results. Default:
- verbose (<int>) – Track the train process if value > 1. If
verbose = 1
, only the overview of RENT input will be shown. Default:verbose=0
.
-
BIC_cutoff_search
(parameters)¶ Compute the Bayesian information criterion for each combination of tau1, tau2 and tau3.
Parameters: parameters (<dict> or) – Cutoff parameters to evaluate. Returns: Array wth the BIC values. Return type: <numpy array>
-
get_BIC_matrix
()¶ Dataframe with BIC value for each combination of
C
and11_ratio
. :returns: Dataframe of BIC values. :rtype: <pandas dataframes>
-
get_cv_matrices
()¶ Three pandas data frames showing cross-validated result for all combinations of
C
andl1_ratio
. Only applicable ifautoEnetParSel=True
.Returns: - dataFrame_1: average scores for predictive performance. The higher the score, the better the parameter combination.
- dataFrame_2: average percentage of how many feature weights are set to zero. The higher the average percentage, the stronger the feature selection with the corresponding paramter combination.
- dataFrame_3: harmonic means between normalized dataFrame_1 and normalized dataFrame_2. The parameter combination with the highest harmonic mean is selected.
Return type: <list> of <pandas dataframes>
-
get_enetParam_matrices
()¶ Three pandas data frames showing result for all combinations of
l1_ratio
andC
.Returns: - dataFrame_1: holds average scores for predictive performance.
- dataFrame_2: holds average percentage of how many feature weights were set to zero.
- dataFrame_3: holds harmonic means between dataFrame_1 and dataFrame_2.
Return type: <list> of <pandas dataframes>
-
get_enet_params
()¶ Get current hyperparameter combination of
C
andl1_ratio
that is used in RENT analyses. By default it is the best combination found. If autoEnetParSel=False the user can change the combination withset_enet_params()
.Returns: A tuple (C, l1_ratio). Return type: <tuple>
-
get_runtime
()¶ Total RENT training time in seconds.
Returns: Time. Return type: <numeric value>
-
get_scores_list
()¶ Prediction scores over the
K
models. :returns: Scores list. :rtype: <list>
-
get_summary_criteria
()¶ Summary statistic of the selection criteria tau_1, tau_2 and tau_3 (described in
select_features()
) for each feature. All three criteria are in [0,1] .Returns: Matrix where rows represent selection criteria and columns represent features. Return type: <pandas dataframe>
-
get_weight_distributions
(binary=False)¶ In each of the
K
models, feature weights are fitted, i.e. an individiual weight is assigned feature 1 for model 1, model 2, up to modelK
. This method returns the weight for every feature and model (1:K
) combination.Parameters: binary (<boolean>) – - Default:
binary=False
. binary=True
: binary matrix where entry is 1 for each weight unequal to 0.binary=False
: original weight matrix.
Returns: Weight matrix. Rows represent models (1:K), columns represents features. Return type: <pandas dataframe> - Default:
-
plot_elementary_models
()¶ Two lineplots where the first curve shows the prediction score over
K
models. The second curve plots the percentage of weights set to 0, respectively.
-
plot_object_PCA
(cl=0, comp1=1, comp2=2, problem='class', hoggorm=True, hoggorm_plots=[1, 2, 3, 4, 6], sel_vars=True)¶ PCA analysis. For classification problems, PCA can be computed either on a single class separately or on both classes. Different coloring possibilities for the scores are provided. Besides scores, loadings, correlation loadings, biplot, and explained variance plots are available.
Parameters: - cl (<int>, <str>) –
- Perform PCA on cl. Default:
cl=0
. cl=0
: Class 0.cl=1
: Class 1.cl='both'
: All objects (incorrect predictions coloring).cl='continuous'
: All objects (gradient coloring). For classification problems, this is the only valid option.
- Perform PCA on cl. Default:
- comp1 (<int>) – First PCA component to plot. Default:
comp1=1
. - comp2 (<int>) – Second PCA component to plot. Default:
comp2=2
. - problem (<str>) –
- Classification or regression problem. Default:
problem='class'
. problem='class'
: Classification problem. Can be used with all possiblecl
inputs.problem='regression'
: Regression problem. Can only be used withcl='continuous'
.
- Classification or regression problem. Default:
- hoggorm (<boolean>) – To not use plots from hoggormplot package, set
hoggorm=False
. Default:hoggorm=True
. - hoggorm_plots (<list>) –
- Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options.
hoggorm=True
must be set. Default:hoggorm_plots=[1,2,3,4,6]
. - 1: scores plot
- 2: loadings plot
- 3: correlation loadings plot
- 4: biplot
- 6: explained variance plot
- Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options.
- sel_vars (<boolean>) – Only use the features selected with RENT for PCA. Default:
sel_vars=True
.
- cl (<int>, <str>) –
-
plot_selection_frequency
()¶ Barplot of tau_1 value for each feature.
-
plot_validation_study
(test_data, test_labels, num_drawings, num_permutations, metric='mcc', alpha=0.05)¶ - Two validation studies based on a Student’s t-test. The null-hypotheses claim that
- RENT is not better than random feature selection.
- RENT performs equally well on the real and a randomly permutated target.
If
poly='ON'
orpoly='ON_only_interactions'
in the RENT initialization, the test data is automatically polynomially transformed.Parameters: - test_data (<numpy array> or <pandas dataframe>) – Dataset, used to evalute predictive models in the validation study. Must be independent of the data, RENT is computed on.
- test_lables (<numpy array> or <pandas dataframe>) – Response variable of test_data.
- num_drawings (<int>) – Number of independent feature subset drawings for VS1.
- num_permutations (<int>) – Number of independent test_labels permutations for VS2.
- metric (<str>) –
The metric to evaluate
K
models. Default:metric='mcc'
. Only relevant for classification tasks. For regression R2-score is used.scoring='accuracy'
: Accuracyscoring='f1'
: F1-scorescoring='precision'
: Precisionscoring='recall'
: Recallscoring='mcc'
: Matthews Correlation Coefficient
- alpha (<float>) – Significance level for the t-test. Default
alpha=0.05
.
-
select_features
(tau_1_cutoff=0.9, tau_2_cutoff=0.9, tau_3_cutoff=0.975)¶ Selects features based on the cutoff values for tau_1_cutoff, tau_2_cutoff and tau_3_cutoff.
Parameters: - tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and
1. Default:
tau_1=0.9
. - tau_2_cutoff (<float>) – Cutoff value for tau_2 criterion. Choose value between 0 and
1. Default:
tau_2=0.9
. - tau_3_cutoff (<float>) – Cutoff value for tau_3 criterion. Choose value between 0 and
1. Default:
tau_3=0.975
.
Returns: Array with selected features.
Return type: <numpy array>
- tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and
1. Default:
-
set_enet_params
(C, l1_ratio)¶ Set hyperparameter combination of
C
andl1_ratio
, that is used for analyses. Only useful ifautoEnetParSel=False
.Parameters: - C (<float>) – Regularization parameter.
- l1_ratio (<float>) – l1 ratio with value in [0,1].
-
train
()¶ If
autoEnetParSel=False
, this method trainsK
*len(C)
*len(l1_ratios)
models in total. The number of models using the same hyperparamters isK
. Otherwise, if the best parameter combination is selected with cross-validation, onlyK
models are trained. For each model elastic net regularisation is applied for feature selection. Internally,train()
calls therun_parallel()
function for classification or regression, respectively.