RENT for binary classification

RENT feature selection for classification problems.

class RENT.RENT.RENT_Classification(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, BIC=False, poly='OFF', testsize_range=(0.2, 0.6), scoring='accuracy', classifier='logreg', K=100, scale=True, random_state=None, verbose=0)

This class carries out RENT on a given binary classification dataset.

Parameters:
  • data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection is performed. Variable types must be numeric or integer.
  • target (<numpy array> or <pandas dataframe>) – Response variable of data.
  • feat_names (<list>) – List holding feature names. Preferably a list of string values. If empty, feature names will be generated automatically. Default: feat_names=[].
  • C (<list of int or float values>) – List with regularisation parameters for K models. The lower, the stronger the regularization is. Default: C=[1,10].
  • l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For pure l2 use 0, for pure l1 use 1. Default: l1_ratios=[0.6].
  • autoEnetParSel (<boolean>) –
    Cross-validated elastic net hyperparameter selection.
    • autoEnetParSel=True : peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.
    • autoEnetParSel=False : perform RENT with each combination of C and l1_ratios. Default: autoEnetParSel=True.
  • poly (<str>) –
    Create non-linear features. Default: poly='OFF'.
    • poly='OFF' : no feature interaction.
    • poly='ON' : feature interaction and squared features (2-polynoms).
    • poly='ON_only_interactions' : only feature interactions, no squared features.
  • testsize_range (<tuple float>) – Inside RENT, K models are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range of testsize_range for each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default: testsize_range=(0.2, 0.6).
  • scoring (<str>) –
    The metric to evaluate K models. Default: scoring='mcc'.
    • scoring='accuracy' : Accuracy
    • scoring='f1' : F1-score
    • scoring='precision' : Precision
    • scoring='recall': Recall
    • scoring='mcc' : Matthews Correlation Coefficient
  • classifier (<str>) –
    Classifier with witch models are trained.
    • classifier='logreg' : Logistic Regression
  • K (<int>) – Number of unique train-test splits. Default: K=100.
  • scale (<boolean>) – Columnwise standardization of the K train datasets. Default: scale=True.
  • random_state (<None or int>) –
    Set a random state to reproduce your results. Default: random_state=None.
    • random_state=None : no random seed.
    • random_state={0,1,2,...} : random seed set.
  • verbose (<int>) – Track the train process if value > 1. If verbose = 1, only the overview of RENT input will be shown. Default: verbose=0.
Returns:

A class that contains the RENT classification model.

Return type:

<class>

get_object_probabilities()

Logistic Regression probabilities for each combination of object and model. The method can only be used if classifier='logreg'.

Returns:Matrix, where rows represent objects and columns represent logistic regression probability outputs (probability of belonging to class 1).
Return type:<pandas dataframe>
get_summary_objects()

Each object of the dataset is a certain number between 0 (never) and K (always) part of the test set inside RENT training. This method computes a summary of classification results for each sample across all models, where the sample was part of the test set. The summary contains information on how often a sample has been mis-classfied.

Returns:Data matrix. Rows represent objects, columns represent generated variables. The first column denotes how often the object was part of the test set, the second column reveals the true class of the object, the third column indicates how often the object was classified incorrectly and the fourth column shows the corresponding percentage of incorrectness.
Return type:<pandas dataframe>
plot_object_probabilities(object_id, binning='auto', lower=0, upper=1, kde=False, norm_hist=False)

Histograms of predicted probabilities from get_object_probabilities().

Parameters:
  • object_id (<list of int or str>) – Objects whoes histograms shall be plotted. Type depends on the index format of the dataframe.
  • lower (<float>) – Lower bound of the x-axis. Default: lower=0.
  • upper (<float>) – Upper bound of the x-axis. Default: upper=1.
  • kde (<boolean>) – Kernel density estimation, from seaborn distplot. Default: kde=False.
  • norm_hist (<boolean>) – Normalize the histogram, from seaborn distplot. Default: norm_hist=False.
run_parallel(K)

If autoEnetParSel=False, parallel computation of K * len(C) * len(l1_ratios) classification models. Otherwise, computation of K models. :param K: Range of train-test splits. The parameter cannot be set directly by the user but is used for an internal parallelization.

train()

If autoEnetParSel=False, this method trains K * len(C) * len(l1_ratios) models in total. The number of models using the same hyperparamters is K. Otherwise, if the best parameter combination is selected with cross-validation, only K models are trained. For each model elastic net regularisation is applied for feature selection. Internally, train() calls the run_parallel() function for classification or regression, respectively.