RENT for regression

RENT feature selection for regression problems.

class RENT.RENT.RENT_Regression(data, target, feat_names=[], C=[1, 10], l1_ratios=[0.6], autoEnetParSel=True, BIC=False, poly='OFF', testsize_range=(0.2, 0.6), K=100, scale=True, random_state=None, verbose=0)

This class carries out RENT on a given regression dataset.

Parameters:
  • data (<numpy array> or <pandas dataframe>) – Dataset on which feature selection shall be performed. Variable types must be numeric or integer.
  • target (<numpy array> or <pandas dataframe>) – Response variable of data.
  • feat_names (<list>) – List holding feature names. Preferably a list of string values. If empty, feature names will be generated automatically. Default: feat_names=[].
  • C (<list of int or float values>) – List with regularisation parameters for K models. The lower, the stronger the regularization is. Default: C=[1,10].
  • l1_ratios (<list of int or float values>) – List holding ratios between l1 and l2 penalty. Values must be in [0,1]. For pure l2 use 0, for pure l1 use 1. Default: l1_ratios=[0.6].
  • autoEnetParSel (<boolean>) –
    Cross-validated elastic net hyperparameter selection.
    • autoEnetParSel=True : peform a cross-validation pre-hyperparameter search, such that RENT runs only with one hyperparamter setting.
    • autoEnetParSel=False : perform RENT with each combination of C and l1_ratios. Default: autoEnetParSel=True.
  • BIC (<boolean>) –
    Use the Bayesian information criterion to select hyperparameters.
    • BIC=True : use BIC to select RENT hyperparameters.
    • BIC=False: no use of BIC.
  • poly (<str>) –
    Create non-linear features. Default: poly='OFF'.
    • poly='OFF' : no feature interaction.
    • poly='ON' : feature interaction and squared features (2-polynoms).
    • poly='ON_only_interactions' : only feature interactions, no squared features.
  • testsize_range (<tuple float>) – Inside RENT, K models are trained, where the testsize defines the proportion of train data used for testing of a single model. The testsize can either be randomly selected inside the range of testsize_range for each model or fixed by setting the two tuple entries to the same value. The tuple must be in range (0,1). Default: testsize_range=(0.2, 0.6).
  • K (<int>) – Number of unique train-test splits. Default K=100.
  • scale (<boolean>) – Columnwise standardization of the K train datasets. Default scale=True.
  • random_state (<None or int>) –
    Set a random state to reproduce your results. Default: random_state=None.
    • random_state=None : no random seed.
    • random_state={0,1,2,...} : random seed set.
  • verbose (<int>) – Track the train process if value > 1. If verbose = 1, only the overview of RENT input will be shown. Default: verbose=0.
Returns:

A class that contains the RENT regression model.

Return type:

<class>

get_object_errors()

Absolute errors for samples which were at least once in a test-set among K models.

Returns:Matrix. Rows represent objects, columns represent genrated variables.
Return type:<pandas dataframe>
get_summary_objects()

Each object of the dataset is a certain number between 0 (never) and K (always) part of th test set inside RENT training. This method computes a summary of the mean absolute errors for each sample across all models, where the sample was part of the test set.

Returns:Data matrix. Rows represent objects, columns represent generated variables. The first column denotes how often the object was part of the test set, the second column shows the average absolute error.
Return type:<pandas dataframe>
plot_object_errors(object_id, binning='auto', lower=0, upper=100, kde=False, norm_hist=False)

Histograms of absolute errors from get_object_errors().

Parameters:
  • object_id (<list of int or str>) – Objects whoes histograms shall be plotted. Type depends on the index format of the dataframe.
  • lower (<float>) – Lower bound of the x-axis. Default lower=0.
  • upper (<float>) – Upper bound of the x-axis. Default upper=100.
  • kde (<boolean>) – Kernel density estimation, from seaborn distplot. Default: kde=False.
  • norm_hist (<boolean>) – Normalize the histogram, from seaborn distplot. Default: norm_hist=False.
run_parallel(K)

If autoEnetParSel=False, parallel computation of K * len(C) * len(l1_ratios) linear regression models. Otherwise, computation of K models.

Parameters:K – Range of train-test splits. The parameter cannot be set directly by the user but is used for an internal parallelization.
train()

If autoEnetParSel=False, this method trains K * len(C) * len(l1_ratios) models in total. The number of models using the same hyperparamters is K. Otherwise, if the best parameter combination is selected with cross-validation, only K models are trained. For each model elastic net regularisation is applied for feature selection. Internally, train() calls the run_parallel() function for classification or regression, respectively.