Package org.apache.spark.ml.feature
Class RegexTokenizer
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Transformer
org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
org.apache.spark.ml.feature.RegexTokenizer
- All Implemented Interfaces:
Serializable
,org.apache.spark.internal.Logging
,Params
,HasInputCol
,HasOutputCol
,DefaultParamsWritable
,Identifiable
,MLWritable
,scala.Serializable
public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
implements DefaultParamsWritable
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if
gaps
is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionCreates a copy of this instance with the same UID and some extra params.gaps()
Indicates whether regex splits on gaps (true) or matches tokens (false).boolean
getGaps()
int
boolean
static RegexTokenizer
Minimum token length, greater than or equal to 0.pattern()
static MLReader<T>
read()
setGaps
(boolean value) setMinTokenLength
(int value) setPattern
(String value) setToLowercase
(boolean value) final BooleanParam
Indicates whether to convert all characters to lowercase before tokenizing.toString()
uid()
An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.UnaryTransformer
inputCol, outputCol, setInputCol, setOutputCol, transform, transformSchema
Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform
Methods inherited from class org.apache.spark.ml.PipelineStage
params
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write
Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol
getInputCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputCol
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq
Methods inherited from interface org.apache.spark.ml.util.MLWritable
save
Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
-
Constructor Details
-
RegexTokenizer
-
RegexTokenizer
public RegexTokenizer()
-
-
Method Details
-
load
-
read
-
uid
Description copied from interface:Identifiable
An immutable unique ID for the object and its derivatives.- Specified by:
uid
in interfaceIdentifiable
- Returns:
- (undocumented)
-
minTokenLength
Minimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings- Returns:
- (undocumented)
-
setMinTokenLength
-
getMinTokenLength
public int getMinTokenLength() -
gaps
Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true- Returns:
- (undocumented)
-
setGaps
-
getGaps
public boolean getGaps() -
pattern
Regex pattern used to match delimiters ifgaps()
is true or tokens ifgaps()
is false. Default:"\\s+"
- Returns:
- (undocumented)
-
setPattern
-
getPattern
-
toLowercase
Indicates whether to convert all characters to lowercase before tokenizing. Default: true- Returns:
- (undocumented)
-
setToLowercase
-
getToLowercase
public boolean getToLowercase() -
copy
Description copied from interface:Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy()
.- Specified by:
copy
in interfaceParams
- Overrides:
copy
in classUnaryTransformer<String,
scala.collection.Seq<String>, RegexTokenizer> - Parameters:
extra
- (undocumented)- Returns:
- (undocumented)
-
toString
- Specified by:
toString
in interfaceIdentifiable
- Overrides:
toString
in classObject
-