Class RegexTokenizer

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, Params, HasInputCol, HasOutputCol, DefaultParamsWritable, Identifiable, MLWritable

public class RegexTokenizer extends UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer> implements DefaultParamsWritable
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
See Also:
  • Constructor Details

    • RegexTokenizer

      public RegexTokenizer(String uid)
    • RegexTokenizer

      public RegexTokenizer()
  • Method Details

    • load

      public static RegexTokenizer load(String path)
    • read

      public static MLReader<T> read()
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable
      Returns:
      (undocumented)
    • minTokenLength

      public IntParam minTokenLength()
      Minimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings
      Returns:
      (undocumented)
    • setMinTokenLength

      public RegexTokenizer setMinTokenLength(int value)
    • getMinTokenLength

      public int getMinTokenLength()
    • gaps

      public BooleanParam gaps()
      Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true
      Returns:
      (undocumented)
    • setGaps

      public RegexTokenizer setGaps(boolean value)
    • getGaps

      public boolean getGaps()
    • pattern

      public Param<String> pattern()
      Regex pattern used to match delimiters if gaps() is true or tokens if gaps() is false. Default: "\\s+"
      Returns:
      (undocumented)
    • setPattern

      public RegexTokenizer setPattern(String value)
    • getPattern

      public String getPattern()
    • toLowercase

      public final BooleanParam toLowercase()
      Indicates whether to convert all characters to lowercase before tokenizing. Default: true
      Returns:
      (undocumented)
    • setToLowercase

      public RegexTokenizer setToLowercase(boolean value)
    • getToLowercase

      public boolean getToLowercase()
    • copy

      public RegexTokenizer copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      Overrides:
      copy in class UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
      Parameters:
      extra - (undocumented)
      Returns:
      (undocumented)
    • toString

      public String toString()
      Specified by:
      toString in interface Identifiable
      Overrides:
      toString in class Object