public class RegexTokenizer extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer> implements DefaultParamsWritable
gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.| Constructor and Description |
|---|
RegexTokenizer() |
RegexTokenizer(String uid) |
| Modifier and Type | Method and Description |
|---|---|
RegexTokenizer |
copy(ParamMap extra)
Creates a copy of this instance with the same UID and some extra params.
|
BooleanParam |
gaps()
Indicates whether regex splits on gaps (true) or matches tokens (false).
|
boolean |
getGaps() |
int |
getMinTokenLength() |
String |
getPattern() |
boolean |
getToLowercase() |
static RegexTokenizer |
load(String path) |
IntParam |
minTokenLength()
Minimum token length, greater than or equal to 0.
|
Param<String> |
pattern()
Regex pattern used to match delimiters if
gaps is true or tokens if gaps is false. |
static MLReader<T> |
read() |
RegexTokenizer |
setGaps(boolean value) |
RegexTokenizer |
setMinTokenLength(int value) |
RegexTokenizer |
setPattern(String value) |
RegexTokenizer |
setToLowercase(boolean value) |
BooleanParam |
toLowercase()
Indicates whether to convert all characters to lowercase before tokenizing.
|
String |
toString() |
String |
uid()
An immutable unique ID for the object and its derivatives.
|
inputCol, outputCol, setInputCol, setOutputCol, transform, transformSchematransform, transform, transformparamswritesavegetInputColgetOutputColclear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitializepublic RegexTokenizer(String uid)
public RegexTokenizer()
public static RegexTokenizer load(String path)
public static MLReader<T> read()
public String uid()
Identifiableuid in interface Identifiablepublic IntParam minTokenLength()
public RegexTokenizer setMinTokenLength(int value)
public int getMinTokenLength()
public BooleanParam gaps()
public RegexTokenizer setGaps(boolean value)
public boolean getGaps()
public Param<String> pattern()
gaps is true or tokens if gaps is false.
Default: "\\s+"public RegexTokenizer setPattern(String value)
public String getPattern()
public final BooleanParam toLowercase()
public RegexTokenizer setToLowercase(boolean value)
public boolean getToLowercase()
public RegexTokenizer copy(ParamMap extra)
ParamsdefaultCopy().copy in interface Paramscopy in class UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>extra - (undocumented)public String toString()
toString in interface IdentifiabletoString in class Object