Class ChiSqSelector

Object
org.apache.spark.mllib.feature.ChiSqSelector
All Implemented Interfaces:
Serializable

public class ChiSqSelector extends Object implements Serializable
Creates a ChiSquared feature selector. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe. - numTopFeatures chooses a fixed number of top features according to a chi-squared test. - percentile is similar but chooses a fraction of all features instead of a fixed number. - fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. - fdr uses the [Benjamini-Hochberg procedure] (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. - fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50.
See Also:
  • Constructor Details

    • ChiSqSelector

      public ChiSqSelector()
    • ChiSqSelector

      public ChiSqSelector(int numTopFeatures)
      The is the same to call this() and setNumTopFeatures(numTopFeatures)
      Parameters:
      numTopFeatures - (undocumented)
  • Method Details

    • supportedSelectorTypes

      public static String[] supportedSelectorTypes()
      Set of selector types that ChiSqSelector supports.
    • numTopFeatures

      public int numTopFeatures()
    • percentile

      public double percentile()
    • fpr

      public double fpr()
    • fdr

      public double fdr()
    • fwe

      public double fwe()
    • selectorType

      public String selectorType()
    • setNumTopFeatures

      public ChiSqSelector setNumTopFeatures(int value)
    • setPercentile

      public ChiSqSelector setPercentile(double value)
    • setFpr

      public ChiSqSelector setFpr(double value)
    • setFdr

      public ChiSqSelector setFdr(double value)
    • setFwe

      public ChiSqSelector setFwe(double value)
    • setSelectorType

      public ChiSqSelector setSelectorType(String value)
    • fit

      public ChiSqSelectorModel fit(RDD<LabeledPoint> data)
      Returns a ChiSquared feature selector.

      Parameters:
      data - an RDD[LabeledPoint] containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function.
      Returns:
      (undocumented)