Class PowerIterationClustering

Object
org.apache.spark.ml.clustering.PowerIterationClustering
All Implemented Interfaces:
Serializable, PowerIterationClusteringParams, Params, HasMaxIter, HasWeightCol, DefaultParamsWritable, Identifiable, MLWritable

public class PowerIterationClustering extends Object implements PowerIterationClusteringParams, DefaultParamsWritable
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters method to run the PowerIterationClustering algorithm.

See Also:
  • Constructor Details

    • PowerIterationClustering

      public PowerIterationClustering()
  • Method Details

    • load

      public static PowerIterationClustering load(String path)
    • read

      public static MLReader<T> read()
    • k

      public final IntParam k()
      Description copied from interface: PowerIterationClusteringParams
      The number of clusters to create (k). Must be &gt; 1. Default: 2.
      Specified by:
      k in interface PowerIterationClusteringParams
      Returns:
      (undocumented)
    • initMode

      public final Param<String> initMode()
      Description copied from interface: PowerIterationClusteringParams
      Param for the initialization algorithm. This can be either "random" to use a random vector as vertex properties, or "degree" to use a normalized sum of similarities with other vertices. Default: random.
      Specified by:
      initMode in interface PowerIterationClusteringParams
      Returns:
      (undocumented)
    • srcCol

      public Param<String> srcCol()
      Description copied from interface: PowerIterationClusteringParams
      Param for the name of the input column for source vertex IDs. Default: "src"
      Specified by:
      srcCol in interface PowerIterationClusteringParams
      Returns:
      (undocumented)
    • dstCol

      public Param<String> dstCol()
      Description copied from interface: PowerIterationClusteringParams
      Name of the input column for destination vertex IDs. Default: "dst"
      Specified by:
      dstCol in interface PowerIterationClusteringParams
      Returns:
      (undocumented)
    • weightCol

      public final Param<String> weightCol()
      Description copied from interface: HasWeightCol
      Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
      Specified by:
      weightCol in interface HasWeightCol
      Returns:
      (undocumented)
    • maxIter

      public final IntParam maxIter()
      Description copied from interface: HasMaxIter
      Param for maximum number of iterations (&gt;= 0).
      Specified by:
      maxIter in interface HasMaxIter
      Returns:
      (undocumented)
    • params

      public Param<?>[] params()
      Description copied from interface: Params
      Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param.

      Specified by:
      params in interface Params
      Returns:
      (undocumented)
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable
      Returns:
      (undocumented)
    • setK

      public PowerIterationClustering setK(int value)
    • setInitMode

      public PowerIterationClustering setInitMode(String value)
    • setMaxIter

      public PowerIterationClustering setMaxIter(int value)
    • setSrcCol

      public PowerIterationClustering setSrcCol(String value)
    • setDstCol

      public PowerIterationClustering setDstCol(String value)
    • setWeightCol

      public PowerIterationClustering setWeightCol(String value)
    • assignClusters

      public Dataset<Row> assignClusters(Dataset<?> dataset)
      Run the PIC algorithm and returns a cluster assignment for each input vertex.

      Parameters:
      dataset - A dataset with columns src, dst, weight representing the affinity matrix, which is the matrix A in the PIC paper. Suppose the src column value is i, the dst column value is j, the weight column value is similarity s,,ij,, which must be nonnegative. This is a symmetric matrix and hence s,,ij,, = s,,ji,,. For any (i, j) with nonzero similarity, there should be either (i, j, s,,ij,,) or (j, i, s,,ji,,) in the input. Rows with i = j are ignored, because we assume s,,ij,, = 0.0.

      Returns:
      A dataset that contains columns of vertex id and the corresponding cluster for the id. The schema of it will be: - id: Long - cluster: Int
    • copy

      public PowerIterationClustering copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      Parameters:
      extra - (undocumented)
      Returns:
      (undocumented)