org.apache.spark.ml.feature
Class RegexTokenizer

Object
  extended by org.apache.spark.ml.PipelineStage
      extended by org.apache.spark.ml.Transformer
          extended by org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
              extended by org.apache.spark.ml.feature.RegexTokenizer
All Implemented Interfaces:
java.io.Serializable, Logging, Params

public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>

:: Experimental :: A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is true). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

See Also:
Serialized Form

Constructor Summary
RegexTokenizer()
           
RegexTokenizer(String uid)
           
 
Method Summary
 RegexTokenizer copy(ParamMap extra)
          Creates a copy of this instance with the same UID and some extra params.
 BooleanParam gaps()
          Indicates whether regex splits on gaps (true) or matches tokens (false).
 boolean getGaps()
           
 int getMinTokenLength()
           
 String getPattern()
           
 IntParam minTokenLength()
          Minimum token length, >= 0.
 Param<String> pattern()
          Regex pattern used to match delimiters if gaps is true or tokens if gaps is false.
 RegexTokenizer setGaps(boolean value)
           
 RegexTokenizer setMinTokenLength(int value)
           
 RegexTokenizer setPattern(String value)
           
 String uid()
           
 
Methods inherited from class org.apache.spark.ml.UnaryTransformer
setInputCol, setOutputCol, transform, transformSchema
 
Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 
Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, setDefault, shouldOwn, validateParams
 

Constructor Detail

RegexTokenizer

public RegexTokenizer(String uid)

RegexTokenizer

public RegexTokenizer()
Method Detail

uid

public String uid()

minTokenLength

public IntParam minTokenLength()
Minimum token length, >= 0. Default: 1, to avoid returning empty strings

Returns:
(undocumented)

setMinTokenLength

public RegexTokenizer setMinTokenLength(int value)

getMinTokenLength

public int getMinTokenLength()

gaps

public BooleanParam gaps()
Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true

Returns:
(undocumented)

setGaps

public RegexTokenizer setGaps(boolean value)

getGaps

public boolean getGaps()

pattern

public Param<String> pattern()
Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. Default: "\\s+"

Returns:
(undocumented)

setPattern

public RegexTokenizer setPattern(String value)

getPattern

public String getPattern()

copy

public RegexTokenizer copy(ParamMap extra)
Description copied from interface: Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly.

Specified by:
copy in interface Params
Overrides:
copy in class UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
Parameters:
extra - (undocumented)
Returns:
(undocumented)
See Also:
defaultCopy()