Package org.apache.spark.ml.feature
Class ChiSqSelector
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Estimator<T>
org.apache.spark.ml.feature.ChiSqSelector
- All Implemented Interfaces:
Serializable
,org.apache.spark.internal.Logging
,SelectorParams
,Params
,HasFeaturesCol
,HasLabelCol
,HasOutputCol
,DefaultParamsWritable
,Identifiable
,MLWritable
Deprecated.
use UnivariateFeatureSelector instead. Since 3.1.1.
Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods:
numTopFeatures
, percentile
, fpr
,
fdr
, fwe
.
- numTopFeatures
chooses a fixed number of top features according to a chi-squared test.
- percentile
is similar but chooses a fraction of all features instead of a fixed number.
- fpr
chooses all features whose p-value are below a threshold, thus controlling the false
positive rate of selection.
- fdr
uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
- fwe
chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures
, with the default number of top features
set to 50.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionDeprecated.Creates a copy of this instance with the same UID and some extra params.final DoubleParam
fdr()
The upper bound of the expected false discovery rate.Param for features column name.Deprecated.Fits a model to the input data.final DoubleParam
fpr()
The highest p-value for features to be kept.final DoubleParam
fwe()
The upper bound of the expected family-wise error rate.labelCol()
Param for label column name.static ChiSqSelector
Deprecated.final IntParam
Number of features that selector will select, ordered by ascending p-value.Param for output column name.final DoubleParam
Percentile of features that selector will select, ordered by ascending p-value.static MLReader<T>
read()
Deprecated.The selector type.setFdr
(double value) Deprecated.setFeaturesCol
(String value) Deprecated.setFpr
(double value) Deprecated.setFwe
(double value) Deprecated.setLabelCol
(String value) Deprecated.setNumTopFeatures
(int value) Deprecated.setOutputCol
(String value) Deprecated.setPercentile
(double value) Deprecated.setSelectorType
(String value) Deprecated.transformSchema
(StructType schema) Deprecated.Check transform validity and derive the output schema from the input schema.uid()
Deprecated.An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.PipelineStage
params
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write
Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
getFeaturesCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputCol
Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
Methods inherited from interface org.apache.spark.ml.util.MLWritable
save
Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
Methods inherited from interface org.apache.spark.ml.feature.SelectorParams
getFdr, getFpr, getFwe, getNumTopFeatures, getPercentile, getSelectorType
-
Constructor Details
-
ChiSqSelector
Deprecated. -
ChiSqSelector
public ChiSqSelector()Deprecated.
-
-
Method Details
-
load
Deprecated. -
read
Deprecated. -
uid
Deprecated.Description copied from interface:Identifiable
An immutable unique ID for the object and its derivatives.- Returns:
- (undocumented)
-
setNumTopFeatures
Deprecated. -
setPercentile
Deprecated. -
setFpr
Deprecated. -
setFdr
Deprecated. -
setFwe
Deprecated. -
setSelectorType
Deprecated. -
setFeaturesCol
Deprecated. -
setOutputCol
Deprecated. -
setLabelCol
Deprecated. -
fit
Deprecated.Description copied from class:Estimator
Fits a model to the input data.- Parameters:
dataset
- (undocumented)- Returns:
- (undocumented)
-
transformSchema
Deprecated.Description copied from class:PipelineStage
Check transform validity and derive the output schema from the input schema.We check validity for interactions between parameters during
transformSchema
and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled byParam.validate()
.Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
- Parameters:
schema
- (undocumented)- Returns:
- (undocumented)
-
copy
Deprecated.Description copied from interface:Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy()
. -
fdr
Description copied from interface:SelectorParams
The upper bound of the expected false discovery rate. Only applicable when selectorType = "fdr". Default value is 0.05.- Specified by:
fdr
in interfaceSelectorParams
- Returns:
- (undocumented)
-
featuresCol
Description copied from interface:HasFeaturesCol
Param for features column name.- Specified by:
featuresCol
in interfaceHasFeaturesCol
- Returns:
- (undocumented)
-
fpr
Description copied from interface:SelectorParams
The highest p-value for features to be kept. Only applicable when selectorType = "fpr". Default value is 0.05.- Specified by:
fpr
in interfaceSelectorParams
- Returns:
- (undocumented)
-
fwe
Description copied from interface:SelectorParams
The upper bound of the expected family-wise error rate. Only applicable when selectorType = "fwe". Default value is 0.05.- Specified by:
fwe
in interfaceSelectorParams
- Returns:
- (undocumented)
-
labelCol
Description copied from interface:HasLabelCol
Param for label column name.- Specified by:
labelCol
in interfaceHasLabelCol
- Returns:
- (undocumented)
-
numTopFeatures
Description copied from interface:SelectorParams
Number of features that selector will select, ordered by ascending p-value. If the number of features is less than numTopFeatures, then this will select all features. Only applicable when selectorType = "numTopFeatures". The default value of numTopFeatures is 50.- Specified by:
numTopFeatures
in interfaceSelectorParams
- Returns:
- (undocumented)
-
outputCol
Description copied from interface:HasOutputCol
Param for output column name.- Specified by:
outputCol
in interfaceHasOutputCol
- Returns:
- (undocumented)
-
percentile
Description copied from interface:SelectorParams
Percentile of features that selector will select, ordered by ascending p-value. Only applicable when selectorType = "percentile". Default value is 0.1.- Specified by:
percentile
in interfaceSelectorParams
- Returns:
- (undocumented)
-
selectorType
Description copied from interface:SelectorParams
The selector type. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"- Specified by:
selectorType
in interfaceSelectorParams
- Returns:
- (undocumented)
-