Class PrefixSpan

All Implemented Interfaces:
Serializable, Params, Identifiable, scala.Serializable

public final class PrefixSpan extends Object implements Params
A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth (see here). This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to run the PrefixSpan algorithm.

See Also:
  • Constructor Details

    • PrefixSpan

      public PrefixSpan(String uid)
    • PrefixSpan

      public PrefixSpan()
  • Method Details

    • copy

      public PrefixSpan copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      extra - (undocumented)
    • findFrequentSequentialPatterns

      public Dataset<Row> findFrequentSequentialPatterns(Dataset<?> dataset)
      Finds the complete set of frequent sequential patterns in the input sequences of itemsets.

      dataset - A dataset or a dataframe containing a sequence column which is
      type, T is the item type for the input dataset. @return A `DataFrame` that contains columns of sequence and corresponding frequency. The schema of it will be: - `sequence: ArrayType(ArrayType(T))` (T is the item type) - `freq: Long`
    • getMaxLocalProjDBSize

      public long getMaxLocalProjDBSize()
    • getMaxPatternLength

      public int getMaxPatternLength()
    • getMinSupport

      public double getMinSupport()
    • getSequenceCol

      public String getSequenceCol()
    • maxLocalProjDBSize

      public LongParam maxLocalProjDBSize()
      Param for the maximum number of items (including delimiters used in the internal storage format) allowed in a projected database before local processing (default: 32000000). If a projected database exceeds this size, another iteration of distributed prefix growth is run.
    • maxPatternLength

      public IntParam maxPatternLength()
      Param for the maximal pattern length (default: 10).
    • minSupport

      public DoubleParam minSupport()
      Param for the minimal support level (default: 0.1). Sequential patterns that appear more than (minSupport * size-of-the-dataset) times are identified as frequent sequential patterns.
    • params

      public Param<?>[] params()
      Description copied from interface: Params
      Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param.

      Specified by:
      params in interface Params
    • sequenceCol

      public Param<String> sequenceCol()
      Param for the name of the sequence column in dataset (default "sequence"), rows with nulls in this column are ignored.
    • setMaxLocalProjDBSize

      public PrefixSpan setMaxLocalProjDBSize(long value)
    • setMaxPatternLength

      public PrefixSpan setMaxPatternLength(int value)
    • setMinSupport

      public PrefixSpan setMinSupport(double value)
    • setSequenceCol

      public PrefixSpan setSequenceCol(String value)
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable