Class UnivariateFeatureSelector

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, UnivariateFeatureSelectorParams, Params, HasFeaturesCol, HasLabelCol, HasOutputCol, DefaultParamsWritable, Identifiable, MLWritable, scala.Serializable

public final class UnivariateFeatureSelector extends Estimator<UnivariateFeatureSelectorModel> implements UnivariateFeatureSelectorParams, DefaultParamsWritable
Feature selector based on univariate statistical tests against labels. Currently, Spark supports three Univariate Feature Selectors: chi-squared, ANOVA F-test and F-value. User can choose Univariate Feature Selector by setting featureType and labelType, and Spark will pick the score function based on the specified featureType and labelType.

The following combination of featureType and labelType are supported: - featureType categorical and labelType categorical: Spark uses chi-squared, i.e. chi2 in sklearn. - featureType continuous and labelType categorical: Spark uses ANOVA F-test, i.e. f_classif in sklearn. - featureType continuous and labelType continuous: Spark uses F-value, i.e. f_regression in sklearn.

The UnivariateFeatureSelector supports different selection modes: numTopFeatures, percentile, fpr, fdr, fwe. - numTopFeatures chooses a fixed number of top features according to a hypothesis. - percentile is similar but chooses a fraction of all features instead of a fixed number. - fpr chooses all features whose p-value are below a threshold, thus controlling the false positive rate of selection. - fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold. - fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection mode is numTopFeatures.

See Also: