RegexTokenizer (Spark 3.4.1 JavaDoc)

Object
- org.apache.spark.ml.PipelineStage
- - org.apache.spark.ml.Transformer
  - - org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
    - - org.apache.spark.ml.feature.RegexTokenizer

All Implemented Interfaces:

java.io.Serializable, org.apache.spark.internal.Logging, Params, HasInputCol, HasOutputCol, DefaultParamsWritable, Identifiable, MLWritable
```
public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
implements DefaultParamsWritable
```
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
  org.apache.spark.internal.Logging.SparkShellLoggingFilter

Constructor Summary

Constructors
Constructor and Description

RegexTokenizer()

RegexTokenizer(String uid)

Constructors
Constructor and Description
`RegexTokenizer()`
`RegexTokenizer(String uid)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`RegexTokenizer`	`copy(ParamMap extra)` Creates a copy of this instance with the same UID and some extra params.
`BooleanParam`	`gaps()` Indicates whether regex splits on gaps (true) or matches tokens (false).
`boolean`	`getGaps()`
`int`	`getMinTokenLength()`
`String`	`getPattern()`
`boolean`	`getToLowercase()`
`static RegexTokenizer`	`load(String path)`
`IntParam`	`minTokenLength()` Minimum token length, greater than or equal to 0.
`Param<String>`	`pattern()` Regex pattern used to match delimiters if `gaps` is true or tokens if `gaps` is false.
`static MLReader<T>`	`read()`
`RegexTokenizer`	`setGaps(boolean value)`
`RegexTokenizer`	`setMinTokenLength(int value)`
`RegexTokenizer`	`setPattern(String value)`
`RegexTokenizer`	`setToLowercase(boolean value)`
`BooleanParam`	`toLowercase()` Indicates whether to convert all characters to lowercase before tokenizing.
`String`	`toString()`
`String`	`uid()` An immutable unique ID for the object and its derivatives.

Methods inherited from class org.apache.spark.ml.UnaryTransformer
inputCol, outputCol, setInputCol, setOutputCol, transform, transformSchema

Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol
getInputCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputCol

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Methods inherited from interface org.apache.spark.internal.Logging
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize

- Constructor Detail
  - RegexTokenizer
```
public RegexTokenizer(String uid)
```
  - RegexTokenizer
```
public RegexTokenizer()
```
- Method Detail
  - load
```
public static RegexTokenizer load(String path)
```
  - read
```
public static MLReader<T> read()
```
  - uid
```
public String uid()
```
    Description copied from interface: Identifiable
    
    An immutable unique ID for the object and its derivatives.
    
    Specified by:
    
    uid in interface Identifiable
    
    Returns:
    
    (undocumented)
  - minTokenLength
```
public IntParam minTokenLength()
```
    Minimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings
    
    Returns:
    
    (undocumented)
  - setMinTokenLength
```
public RegexTokenizer setMinTokenLength(int value)
```
  - getMinTokenLength
```
public int getMinTokenLength()
```
  - gaps
```
public BooleanParam gaps()
```
    Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true
    
    Returns:
    
    (undocumented)
  - setGaps
```
public RegexTokenizer setGaps(boolean value)
```
  - getGaps
```
public boolean getGaps()
```
  - pattern
```
public Param<String> pattern()
```
    Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. Default: "\\s+"
    
    Returns:
    
    (undocumented)
  - setPattern
```
public RegexTokenizer setPattern(String value)
```
  - getPattern
```
public String getPattern()
```
  - toLowercase
```
public final BooleanParam toLowercase()
```
    Indicates whether to convert all characters to lowercase before tokenizing. Default: true
    
    Returns:
    
    (undocumented)
  - setToLowercase
```
public RegexTokenizer setToLowercase(boolean value)
```
  - getToLowercase
```
public boolean getToLowercase()
```
  - copy
```
public RegexTokenizer copy(ParamMap extra)
```
    Description copied from interface: Params
    
    Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
    
    Specified by:
    
    copy in interface Params
    
    Overrides:
    
    copy in class UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
    
    Parameters:
    
    extra - (undocumented)
    
    Returns:
    
    (undocumented)
  - toString
```
public String toString()
```
    Specified by:
    
    toString in interface Identifiable
    
    Overrides:
    
    toString in class Object

Class RegexTokenizer

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.UnaryTransformer

Methods inherited from class org.apache.spark.ml.Transformer

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class Object

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol

Methods inherited from interface org.apache.spark.ml.param.Params

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Detail

RegexTokenizer

RegexTokenizer

Method Detail

load

read

uid

minTokenLength

setMinTokenLength

getMinTokenLength

gaps

setGaps

getGaps

pattern

setPattern

getPattern

toLowercase

setToLowercase

getToLowercase

copy

toString