RENT base functions

RENT base is an abstract class and contains the constructor of RENT. Furthermore, the class comprises methods and functions that are applicable for both classes, RENT_Classification and RENT_Regression.

class RENT.RENT.RENT_Base(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, BIC=False, poly='OFF', testsize_range=(0.2, 0.6), K=100, scale=True, random_state=None, verbose=0)

The constructor initializes common variables of RENT_Classification and RENT_Regression. Initializations that are specific for classification or regression are described in detail in RENT for binary classification and RENT for regression, respectively.

Parameters:
  • data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection shall be performed. Variable types must be numeric or integer.
  • target (<numpy array> or <pandas dataframe>) – Response variable of data.
  • feat_names (<list>) – List holding feature names. Preferably a list of string values. If empty, feature names will be generated automatically. Default: feat_names=[].
  • C (<list of int or float values>) – List with regularisation parameters for K models. The lower, the stronger the regularization is. Default: C=[1,10].
  • l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For pure l2 use 0, for pure l1 use 1. Default: l1_ratios=[0.6].
  • autoEnetParSel (<boolean>) –
    Cross-validated elastic net hyperparameter selection.
    • autoEnetParSel=True : peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.
    • autoEnetParSel=False : perform RENT with each combination of C and l1_ratios. Default: autoEnetParSel=True.
  • BIC (<boolean>) –
    Use the Bayesian information criterion to select hyperparameters.
    • BIC=True : use BIC to select RENT hyperparameters.
    • BIC=False: no use of BIC.
  • poly (<str>) –
    Create non-linear features. Default: poly='OFF'.
    • poly='OFF' : no feature interaction.
    • poly='ON' : feature interaction and squared features (2-polynoms).
    • poly='ON_only_interactions' : only feature interactions, no squared features.
  • testsize_range (<tuple float>) – Inside RENT, K models are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range of testsize_range for each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default: testsize_range=(0.2, 0.6).
  • K (<int>) – Number of unique train-test splits. Default K=100.
  • scale (<boolean>) – Columnwise standardization each of the K train datasets. Default scale=True.
  • random_state (<None or int>) –
    Set a random state to reproduce your results. Default: random_state=None.
    • random_state=None : no random seed.
    • random_state={0,1,2,...} : random seed set.
  • verbose (<int>) – Track the train process if value > 1. If verbose = 1, only the overview of RENT input will be shown. Default: verbose=0.

Compute the Bayesian information criterion for each combination of tau1, tau2 and tau3.

Parameters:parameters (<dict> or) – Cutoff parameters to evaluate.
Returns:Array wth the BIC values.
Return type:<numpy array>
get_BIC_matrix()

Dataframe with BIC value for each combination of C and 11_ratio. :returns: Dataframe of BIC values. :rtype: <pandas dataframes>

get_cv_matrices()

Three pandas data frames showing cross-validated result for all combinations of C and l1_ratio . Only applicable if autoEnetParSel=True.

Returns:
  • dataFrame_1: average scores for predictive performance. The higher the score, the better the parameter combination.
  • dataFrame_2: average percentage of how many feature weights are set to zero. The higher the average percentage, the stronger the feature selection with the corresponding paramter combination.
  • dataFrame_3: harmonic means between normalized dataFrame_1 and normalized dataFrame_2. The parameter combination with the highest harmonic mean is selected.
Return type:<list> of <pandas dataframes>
get_enetParam_matrices()

Three pandas data frames showing result for all combinations of l1_ratio and C.

Returns:
  • dataFrame_1: holds average scores for predictive performance.
  • dataFrame_2: holds average percentage of how many feature weights were set to zero.
  • dataFrame_3: holds harmonic means between dataFrame_1 and dataFrame_2.
Return type:<list> of <pandas dataframes>
get_enet_params()

Get current hyperparameter combination of C and l1_ratio that is used in RENT analyses. By default it is the best combination found. If autoEnetParSel=False the user can change the combination with set_enet_params().

Returns:A tuple (C, l1_ratio).
Return type:<tuple>
get_runtime()

Total RENT training time in seconds.

Returns:Time.
Return type:<numeric value>
get_scores_list()

Prediction scores over the K models. :returns: Scores list. :rtype: <list>

get_summary_criteria()

Summary statistic of the selection criteria tau_1, tau_2 and tau_3 (described in select_features()) for each feature. All three criteria are in [0,1] .

Returns:Matrix where rows represent selection criteria and columns represent features.
Return type:<pandas dataframe>
get_weight_distributions(binary=False)

In each of the K models, feature weights are fitted, i.e. an individiual weight is assigned feature 1 for model 1, model 2, up to model K. This method returns the weight for every feature and model (1:K) combination.

Parameters:binary (<boolean>) –
Default: binary=False.
  • binary=True : binary matrix where entry is 1 for each weight unequal to 0.
  • binary=False : original weight matrix.
Returns:Weight matrix. Rows represent models (1:K), columns represents features.
Return type:<pandas dataframe>
plot_elementary_models()

Two lineplots where the first curve shows the prediction score over K models. The second curve plots the percentage of weights set to 0, respectively.

plot_object_PCA(cl=0, comp1=1, comp2=2, problem='class', hoggorm=True, hoggorm_plots=[1, 2, 3, 4, 6], sel_vars=True)

PCA analysis. For classification problems, PCA can be computed either on a single class separately or on both classes. Different coloring possibilities for the scores are provided. Besides scores, loadings, correlation loadings, biplot, and explained variance plots are available.

Parameters:
  • cl (<int>, <str>) –
    Perform PCA on cl. Default: cl=0.
    • cl=0: Class 0.
    • cl=1: Class 1.
    • cl='both': All objects (incorrect predictions coloring).
    • cl='continuous': All objects (gradient coloring). For classification problems, this is the only valid option.
  • comp1 (<int>) – First PCA component to plot. Default: comp1=1.
  • comp2 (<int>) – Second PCA component to plot. Default: comp2=2.
  • problem (<str>) –
    Classification or regression problem. Default: problem='class'.
    • problem='class': Classification problem. Can be used with all possible cl inputs.
    • problem='regression': Regression problem. Can only be used with cl='continuous'.
  • hoggorm (<boolean>) – To not use plots from hoggormplot package, set hoggorm=False. Default: hoggorm=True.
  • hoggorm_plots (<list>) –
    Choose which plots from hoggormplot are plotted. Only plots that are relevant for RENT are possible options. hoggorm=True must be set. Default: hoggorm_plots=[1,2,3,4,6].
    • 1: scores plot
    • 2: loadings plot
    • 3: correlation loadings plot
    • 4: biplot
    • 6: explained variance plot
  • sel_vars (<boolean>) – Only use the features selected with RENT for PCA. Default: sel_vars=True.
plot_selection_frequency()

Barplot of tau_1 value for each feature.

plot_validation_study(test_data, test_labels, num_drawings, num_permutations, metric='mcc', alpha=0.05)
Two validation studies based on a Student’s t-test. The null-hypotheses claim that
  • RENT is not better than random feature selection.
  • RENT performs equally well on the real and a randomly permutated target.

If poly='ON' or poly='ON_only_interactions' in the RENT initialization, the test data is automatically polynomially transformed.

Parameters:
  • test_data (<numpy array> or <pandas dataframe>) – Dataset, used to evalute predictive models in the validation study. Must be independent of the data, RENT is computed on.
  • test_lables (<numpy array> or <pandas dataframe>) – Response variable of test_data.
  • num_drawings (<int>) – Number of independent feature subset drawings for VS1.
  • num_permutations (<int>) – Number of independent test_labels permutations for VS2.
  • metric (<str>) –

    The metric to evaluate K models. Default: metric='mcc'. Only relevant for classification tasks. For regression R2-score is used.

    • scoring='accuracy' : Accuracy
    • scoring='f1' : F1-score
    • scoring='precision' : Precision
    • scoring='recall': Recall
    • scoring='mcc' : Matthews Correlation Coefficient
  • alpha (<float>) – Significance level for the t-test. Default alpha=0.05.
select_features(tau_1_cutoff=0.9, tau_2_cutoff=0.9, tau_3_cutoff=0.975)

Selects features based on the cutoff values for tau_1_cutoff, tau_2_cutoff and tau_3_cutoff.

Parameters:
  • tau_1_cutoff (<float>) – Cutoff value for tau_1 criterion. Choose value between 0 and 1. Default: tau_1=0.9.
  • tau_2_cutoff (<float>) – Cutoff value for tau_2 criterion. Choose value between 0 and 1. Default:tau_2=0.9.
  • tau_3_cutoff (<float>) – Cutoff value for tau_3 criterion. Choose value between 0 and 1. Default: tau_3=0.975.
Returns:

Array with selected features.

Return type:

<numpy array>

set_enet_params(C, l1_ratio)

Set hyperparameter combination of C and l1_ratio, that is used for analyses. Only useful if autoEnetParSel=False.

Parameters:
  • C (<float>) – Regularization parameter.
  • l1_ratio (<float>) – l1 ratio with value in [0,1].
train()

If autoEnetParSel=False, this method trains K * len(C) * len(l1_ratios) models in total. The number of models using the same hyperparamters is K. Otherwise, if the best parameter combination is selected with cross-validation, only K models are trained. For each model elastic net regularisation is applied for feature selection. Internally, train() calls the run_parallel() function for classification or regression, respectively.