Package org.apache.spark.ml.feature
Class RegexTokenizer
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Transformer
org.apache.spark.ml.UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
  
org.apache.spark.ml.feature.RegexTokenizer
- All Implemented Interfaces:
- Serializable,- org.apache.spark.internal.Logging,- Params,- HasInputCol,- HasOutputCol,- DefaultParamsWritable,- Identifiable,- MLWritable
public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
implements DefaultParamsWritable  
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
 the text (default) or repeatedly matching the regex (if 
gaps is false).
 Optional parameters also allow filtering tokens using a minimal length.
 It returns an array of strings that can be empty.- See Also:
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionCreates a copy of this instance with the same UID and some extra params.gaps()Indicates whether regex splits on gaps (true) or matches tokens (false).booleangetGaps()intbooleanstatic RegexTokenizerMinimum token length, greater than or equal to 0.pattern()static MLReader<T>read()setGaps(boolean value) setMinTokenLength(int value) setPattern(String value) setToLowercase(boolean value) final BooleanParamIndicates whether to convert all characters to lowercase before tokenizing.toString()uid()An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.UnaryTransformerinputCol, outputCol, setInputCol, setOutputCol, transform, transformSchemaMethods inherited from class org.apache.spark.ml.Transformertransform, transform, transformMethods inherited from class org.apache.spark.ml.PipelineStageparamsMethods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface org.apache.spark.ml.util.DefaultParamsWritablewriteMethods inherited from interface org.apache.spark.ml.param.shared.HasInputColgetInputColMethods inherited from interface org.apache.spark.ml.param.shared.HasOutputColgetOutputColMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContextMethods inherited from interface org.apache.spark.ml.util.MLWritablesaveMethods inherited from interface org.apache.spark.ml.param.Paramsclear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
- 
Constructor Details- 
RegexTokenizer
- 
RegexTokenizerpublic RegexTokenizer()
 
- 
- 
Method Details- 
load
- 
read
- 
uidDescription copied from interface:IdentifiableAn immutable unique ID for the object and its derivatives.- Specified by:
- uidin interface- Identifiable
- Returns:
- (undocumented)
 
- 
minTokenLengthMinimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings- Returns:
- (undocumented)
 
- 
setMinTokenLength
- 
getMinTokenLengthpublic int getMinTokenLength()
- 
gapsIndicates whether regex splits on gaps (true) or matches tokens (false). Default: true- Returns:
- (undocumented)
 
- 
setGaps
- 
getGapspublic boolean getGaps()
- 
patternRegex pattern used to match delimiters ifgaps()is true or tokens ifgaps()is false. Default:"\\s+"- Returns:
- (undocumented)
 
- 
setPattern
- 
getPattern
- 
toLowercaseIndicates whether to convert all characters to lowercase before tokenizing. Default: true- Returns:
- (undocumented)
 
- 
setToLowercase
- 
getToLowercasepublic boolean getToLowercase()
- 
copyDescription copied from interface:ParamsCreates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy().- Specified by:
- copyin interface- Params
- Overrides:
- copyin class- UnaryTransformer<String,- scala.collection.immutable.Seq<String>, - RegexTokenizer> 
- Parameters:
- extra- (undocumented)
- Returns:
- (undocumented)
 
- 
toString- Specified by:
- toStringin interface- Identifiable
- Overrides:
- toStringin class- Object
 
 
-