Class Word2VecModel

All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, Word2VecBase, Params, HasInputCol, HasMaxIter, HasOutputCol, HasSeed, HasStepSize, Identifiable, MLWritable

public class Word2VecModel extends Model<Word2VecModel> implements Word2VecBase, MLWritable
Model fitted by Word2Vec.
See Also:
  • Method Details

    • read

      public static MLReader<Word2VecModel> read()
    • load

      public static Word2VecModel load(String path)
    • vectorSize

      public final IntParam vectorSize()
      Description copied from interface: Word2VecBase
      The dimension of the code that you want to transform from words. Default: 100
      Specified by:
      vectorSize in interface Word2VecBase
      Returns:
      (undocumented)
    • windowSize

      public final IntParam windowSize()
      Description copied from interface: Word2VecBase
      The window size (context words from [-window, window]). Default: 5
      Specified by:
      windowSize in interface Word2VecBase
      Returns:
      (undocumented)
    • numPartitions

      public final IntParam numPartitions()
      Description copied from interface: Word2VecBase
      Number of partitions for sentences of words. Default: 1
      Specified by:
      numPartitions in interface Word2VecBase
      Returns:
      (undocumented)
    • minCount

      public final IntParam minCount()
      Description copied from interface: Word2VecBase
      The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5
      Specified by:
      minCount in interface Word2VecBase
      Returns:
      (undocumented)
    • maxSentenceLength

      public final IntParam maxSentenceLength()
      Description copied from interface: Word2VecBase
      Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength size. Default: 1000
      Specified by:
      maxSentenceLength in interface Word2VecBase
      Returns:
      (undocumented)
    • seed

      public final LongParam seed()
      Description copied from interface: HasSeed
      Param for random seed.
      Specified by:
      seed in interface HasSeed
      Returns:
      (undocumented)
    • stepSize

      public DoubleParam stepSize()
      Description copied from interface: HasStepSize
      Param for Step size to be used for each iteration of optimization (&gt; 0).
      Specified by:
      stepSize in interface HasStepSize
      Returns:
      (undocumented)
    • maxIter

      public final IntParam maxIter()
      Description copied from interface: HasMaxIter
      Param for maximum number of iterations (&gt;= 0).
      Specified by:
      maxIter in interface HasMaxIter
      Returns:
      (undocumented)
    • outputCol

      public final Param<String> outputCol()
      Description copied from interface: HasOutputCol
      Param for output column name.
      Specified by:
      outputCol in interface HasOutputCol
      Returns:
      (undocumented)
    • inputCol

      public final Param<String> inputCol()
      Description copied from interface: HasInputCol
      Param for input column name.
      Specified by:
      inputCol in interface HasInputCol
      Returns:
      (undocumented)
    • uid

      public String uid()
      Description copied from interface: Identifiable
      An immutable unique ID for the object and its derivatives.
      Specified by:
      uid in interface Identifiable
      Returns:
      (undocumented)
    • getVectors

      public Dataset<Row> getVectors()
    • findSynonyms

      public Dataset<Row> findSynonyms(String word, int num)
      Find "num" number of words closest in similarity to the given word, not including the word itself.
      Parameters:
      word - (undocumented)
      num - (undocumented)
      Returns:
      a dataframe with columns "word" and "similarity" of the word and the cosine similarities between the synonyms and the given word.
    • findSynonyms

      public Dataset<Row> findSynonyms(Vector vec, int num)
      Find "num" number of words whose vector representation is most similar to the supplied vector. If the supplied vector is the vector representation of a word in the model's vocabulary, that word will be in the results.
      Parameters:
      vec - (undocumented)
      num - (undocumented)
      Returns:
      a dataframe with columns "word" and "similarity" of the word and the cosine similarities between the synonyms and the given word vector.
    • findSynonymsArray

      public scala.Tuple2<String,Object>[] findSynonymsArray(Vector vec, int num)
      Find "num" number of words whose vector representation is most similar to the supplied vector. If the supplied vector is the vector representation of a word in the model's vocabulary, that word will be in the results.
      Parameters:
      vec - (undocumented)
      num - (undocumented)
      Returns:
      an array of the words and the cosine similarities between the synonyms given word vector.
    • findSynonymsArray

      public scala.Tuple2<String,Object>[] findSynonymsArray(String word, int num)
      Find "num" number of words closest in similarity to the given word, not including the word itself.
      Parameters:
      word - (undocumented)
      num - (undocumented)
      Returns:
      an array of the words and the cosine similarities between the synonyms given word vector.
    • setInputCol

      public Word2VecModel setInputCol(String value)
    • setOutputCol

      public Word2VecModel setOutputCol(String value)
    • transform

      public Dataset<Row> transform(Dataset<?> dataset)
      Transform a sentence column to a vector column to represent the whole sentence. The transform is performed by averaging all word vectors it contains.
      Specified by:
      transform in class Transformer
      Parameters:
      dataset - (undocumented)
      Returns:
      (undocumented)
    • transformSchema

      public StructType transformSchema(StructType schema)
      Description copied from class: PipelineStage
      Check transform validity and derive the output schema from the input schema.

      We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().

      Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.

      Specified by:
      transformSchema in class PipelineStage
      Parameters:
      schema - (undocumented)
      Returns:
      (undocumented)
    • copy

      public Word2VecModel copy(ParamMap extra)
      Description copied from interface: Params
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      Specified by:
      copy in interface Params
      Specified by:
      copy in class Model<Word2VecModel>
      Parameters:
      extra - (undocumented)
      Returns:
      (undocumented)
    • write

      public MLWriter write()
      Description copied from interface: MLWritable
      Returns an MLWriter instance for this ML instance.
      Specified by:
      write in interface MLWritable
      Returns:
      (undocumented)
    • toString

      public String toString()
      Specified by:
      toString in interface Identifiable
      Overrides:
      toString in class Object