Package org.apache.spark.mllib.feature
Class ChiSqSelector
Object
org.apache.spark.mllib.feature.ChiSqSelector
- All Implemented Interfaces:
Serializable
Creates a ChiSquared feature selector.
The selector supports different selection methods:
numTopFeatures
, percentile
, fpr
,
fdr
, fwe
.
- numTopFeatures
chooses a fixed number of top features according to a chi-squared test.
- percentile
is similar but chooses a fraction of all features instead of a fixed number.
- fpr
chooses all features whose p-values are below a threshold, thus controlling the false
positive rate of selection.
- fdr
uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
- fwe
chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures
, with the default number of top features
set to 50.- See Also:
-
Constructor Summary
ConstructorDescriptionChiSqSelector
(int numTopFeatures) The is the same to call this() and setNumTopFeatures(numTopFeatures) -
Method Summary
Modifier and TypeMethodDescriptiondouble
fdr()
fit
(RDD<LabeledPoint> data) Returns a ChiSquared feature selector.double
fpr()
double
fwe()
int
double
setFdr
(double value) setFpr
(double value) setFwe
(double value) setNumTopFeatures
(int value) setPercentile
(double value) setSelectorType
(String value) static String[]
Set of selector types that ChiSqSelector supports.
-
Constructor Details
-
ChiSqSelector
public ChiSqSelector() -
ChiSqSelector
public ChiSqSelector(int numTopFeatures) The is the same to call this() and setNumTopFeatures(numTopFeatures)- Parameters:
numTopFeatures
- (undocumented)
-
-
Method Details
-
supportedSelectorTypes
Set of selector types that ChiSqSelector supports. -
numTopFeatures
public int numTopFeatures() -
percentile
public double percentile() -
fpr
public double fpr() -
fdr
public double fdr() -
fwe
public double fwe() -
selectorType
-
setNumTopFeatures
-
setPercentile
-
setFpr
-
setFdr
-
setFwe
-
setSelectorType
-
fit
Returns a ChiSquared feature selector.- Parameters:
data
- anRDD[LabeledPoint]
containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function.- Returns:
- (undocumented)
-