RENT base functions¶

RENT base is an abstract class and contains the constructor of RENT. Furthermore, the class comprises methods and functions that are applicable for both classes, RENT_Classification and RENT_Regression.

class RENT.RENT.RENT_Base(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, poly='OFF', testsize_range=(0.2, 0.6), K=100, scale=True, random_state=None, verbose=0)¶

The constructor initializes common variables of RENT_Classification and RENT_Regresson. Initializations that are specific for classification or regression are described in detail in RENT for binary classification and RENT for regression, respectively.

Parameters:

data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection shall be performed. Variable types must be numeric or integer.
target (<numpy array> or <pandas dataframe>) – Response variable of data.
feat_names (<list>) – List holding feature names. Preferably a list of string values. If empty, feature names will be generated automatically. Default: feat_names=[].
C (<list of int or float values>) – List with regularisation parameters for K models. The lower, the stronger the regularization is. Default: C=[1,10].
l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For pure l2 use 0, for pure l1 use 1. Default: l1_ratios=[0.6].
autoEnetParSel (<boolean>) –
Cross-validated elastic net hyperparameter selection.
- autoEnetParSel=True : peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.
- autoEnetParSel=False : perform RENT with each combination of C and l1_ratios. Default: autoEnetParSel=True.
poly (<str>) –
Create non-linear features. Default: poly='OFF'.
- poly='OFF' : no feature interaction.
- poly='ON' : feature interaction and squared features (2-polynoms).
- poly='ON_only_interactions' : only feature interactions, no squared features.
testsize_range (<tuple float>) – Inside RENT, K models are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range of testsize_range for each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default: testsize_range=(0.2, 0.6).
K (<int>) – Number of unique train-test splits. Default K=100.
scale (<boolean>) – Columnwise standardization each of the K train datasets. Default scale=True.
random_state (<None or int>) –
Set a random state to reproduce your results. Default: random_state=None.
- random_state=None : no random seed.
- random_state={0,1,2,...} : random seed set.
verbose (<int>) – Track the train process if value > 1. If verbose = 1, only the overview of RENT input will be shown. Default: verbose=0.

get_cv_matrices()¶

Three pandas data frames showing cross-validated result for all combinations of C and l1_ratio . Only applicable if autoEnetParSel=True.

Returns:

dataFrame_1: average scores for predictive performance. The higher the score, the better the parameter combination.
dataFrame_2: average percentage of how many feature weights are set to zero. The higher the average percentage, the stronger the feature selection with the corresponding paramter combination.
dataFrame_3: harmonic means between normalized dataFrame_1 and normalized dataFrame_2. The parameter combination with the highest harmonic mean is selected.

Return type: <list> of <pandas dataframes>

get_enetParam_matrices()¶

Three pandas data frames showing result for all combinations of l1_ratio and C.

Returns:	dataFrame_1: holds average scores for predictive performance. dataFrame_2: holds average percentage of how many feature weights were set to zero. dataFrame_3: holds harmonic means between dataFrame_1 and dataFrame_2.
Return type:	<list> of <pandas dataframes>

get_enet_params()¶

Get current hyperparameter combination of C and l1_ratio that is used in RENT analyses. By default it is the best combination found. If autoEnetParSel=False the user can change the combination with set_enet_params().

Returns:	A tuple (C, l1_ratio).
Return type:	<tuple>

get_runtime()¶

Total RENT training time in seconds.

Returns:	Time.
Return type:	<numeric value>

get_scores_list()¶

Prediction scores over the K models.

Returns:	Scores list.
Return type:	<list>

get_summary_criteria()¶

Summary statistic of the selection criteria tau_1, tau_2 and tau_3 (described in select_features()) for each feature. All three criteria are in [0,1] .

Returns:	Matrix where rows represent selection criteria and columns represent features.
Return type:	<pandas dataframe>

get_weight_distributions(binary=False)¶

In each of the K models, feature weights are fitted, i.e. an individiual weight is assigned feature 1 for model 1, model 2, up to model K. This method returns the weight for every feature and model (1:K) combination.

Parameters:	binary (<boolean>) – Default: `binary=False`. `binary=True` : binary matrix where entry is 1 for each weight unequal to 0. `binary=False` : original weight matrix.
Returns:	Weight matrix. Rows represent models (1:K), columns represents features.
Return type:	<pandas dataframe>

plot_elementary_models()¶: Two lineplots where the first curve shows the prediction score over K models. The second curve plots the percentage of weights set to 0, respectively.

plot_object_PCA(cl=0, comp1=1, comp2=2, problem='class', hoggorm=True, hoggorm_plots=[1, 2, 3, 4, 6], sel_vars=True)¶

PCA analysis. For classification problems, PCA can be computed either on a single class separately or on both classes. Different coloring possibilities for the scores are provided. Besides scores, loadings, correlation loadings, biplot, and explained variance plots are available.

Parameters:

cl (<int>, <str>) –
Perform PCA on cl. Default: cl=0.
- cl=0: Class 0.
- cl=1: Class 1.
- cl='both': All objects (incorrect predictions coloring).
- cl='continuous': All objects (gradient coloring). For classification problems, this is the only valid option.
comp1 (<int>) – First PCA component to plot. Default: comp1=1.
comp2 (<int>) – Second PCA component to plot. Default: comp2=2.
problem (<str>) –
Classification or regression problem. Default: problem='class'.
- problem='class': Classification problem. Can be used with all possible cl inputs.
- problem='regression': Regression problem. Can only be used with cl='continuous'.
hoggorm (<boolean>) – To not use plots from hoggormplot package, set hoggorm=False. Default: hoggorm=True.
hoggorm_plots (<list>) –
Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options. hoggorm=True must be set. Default: hoggorm_plots=[1,2,3,4,6].
- 1: scores plot
- 2: loadings plot
- 3: correlation loadings plot
- 4: biplot
- 6: explained variance plot
sel_vars (<boolean>) – Only use the features selected with RENT for PCA. Default: sel_vars=True.

plot_selection_frequency()¶: Barplot of tau_1 value for each feature.

plot_validation_study(test_data, test_labels, num_drawings, num_permutations, metric='mcc', alpha=0.05)¶

Two validation studies based on a Student’s t-test. The null-hypotheses claim that

RENT is not better than random feature selection.
RENT performs equally well on the real and a randomly permutated target.

If poly='ON' or poly='ON_only_interactions' in the RENT initialization, the test data is automatically polynomially transformed.

Parameters:

test_data (<numpy array> or <pandas dataframe>) – Dataset, used to evalute predictive models in the validation study. Must be independent of the data, RENT is computed on.
test_lables (<numpy array> or <pandas dataframe>) – Response variable of test_data.
num_drawings (<int>) – Number of independent feature subset drawings for VS1.
num_permutations (<int>) – Number of independent test_labels permutations for VS2.
metric (<str>) –
The metric to evaluate K models. Default: metric='mcc'. Only relevant for classification tasks. For regression R2-score is used.
- scoring='accuracy' : Accuracy
- scoring='f1' : F1-score
- scoring='precision' : Precision
- scoring='recall': Recall
- scoring='mcc' : Matthews Correlation Coefficient
alpha (<float>) – Significance level for the t-test. Default alpha=0.05.

select_features(tau_1_cutoff=0.9, tau_2_cutoff=0.9, tau_3_cutoff=0.975)¶

Selects features based on the cutoff values for tau_1_cutoff, tau_2_cutoff and tau_3_cutoff.

Parameters:	tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and 1. Default: `tau_1=0.9`. tau_2_cutoff (<float>) – Cutoff value for tau_2 criterion. Choose value between 0 and 1. Default:`tau_2=0.9`. tau_3_cutoff (<float>) – Cutoff value for tau_3 criterion. Choose value between 0 and 1. Default: `tau_3=0.975`.
Returns:	Array with selected features.
Return type:	<numpy array>

set_enet_params(C, l1_ratio)¶

Set hyperparameter combination of C and l1_ratio, that is used for analyses. Only useful if autoEnetParSel=False.

Parameters:	C (<float>) – Regularization parameter. l1_ratio (<float>) – l1 ratio with value in [0,1].

train()¶: If autoEnetParSel=False, this method trains K * len(C) * len(l1_ratios) models in total. The number of models using the same hyperparamters is K. Otherwise, if the best parameter combination is selected with cross-validation, only K models are trained. For each model elastic net regularisation is applied for feature selection. Internally, train() calls the run_parallel() function for classification or regression, respectively.

Table Of Contents

Previous topic

Next topic

RENT base functions¶