Serializable, org.apache.spark.internal.Logging, Word2VecBase, Params, HasInputCol, HasMaxIter, HasOutputCol, HasSeed, HasStepSize, Identifiable, MLWritable

public class Word2VecModel extends Model<Word2VecModel> implements Word2VecBase, MLWritable
Model fitted by Word2Vec.
    • read

      public static MLReader<Word2VecModel> read()
    • load

      public static Word2VecModel load(String path)
    • vectorSize

      public final IntParam vectorSize()
      The dimension of the code that you want to transform from words. Default: 100
    • windowSize

      public final IntParam windowSize()
      The window size (context words from [-window, window]). Default: 5
    • numPartitions

      public final IntParam numPartitions()
      Number of partitions for sentences of words. Default: 1
    • minCount

      public final IntParam minCount()
      The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5
    • maxSentenceLength

      public final IntParam maxSentenceLength()
      Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength size. Default: 1000
    • seed

      public final LongParam seed()
      Param for random seed.
    • stepSize

      public DoubleParam stepSize()
      Param for Step size to be used for each iteration of optimization (&gt; 0).
    • maxIter

      public final IntParam maxIter()
      Param for maximum number of iterations (&gt;= 0).
    • outputCol

      public final Param<String> outputCol()
      Param for output column name.
    • inputCol

      public final Param<String> inputCol()
      Param for input column name.
    • uid

      public String uid()
      An immutable unique ID for the object and its derivatives.
    • getVectors

      public Dataset<Row> getVectors()
    • findSynonyms

      public Dataset<Row> findSynonyms(String word, int num)
      Find "num" number of words closest in similarity to the given word, not including the word itself.
      word - (undocumented)
      num - (undocumented)
      a dataframe with columns "word" and "similarity" of the word and the cosine similarities between the synonyms and the given word.
    • findSynonyms

      public Dataset<Row> findSynonyms(Vector vec, int num)
      Find "num" number of words whose vector representation is most similar to the supplied vector. If the supplied vector is the vector representation of a word in the model's vocabulary, that word will be in the results.
      vec - (undocumented)
      num - (undocumented)
      a dataframe with columns "word" and "similarity" of the word and the cosine similarities between the synonyms and the given word vector.
    • findSynonymsArray

      public scala.Tuple2<String,Object>[] findSynonymsArray(Vector vec, int num)
      Find "num" number of words whose vector representation is most similar to the supplied vector. If the supplied vector is the vector representation of a word in the model's vocabulary, that word will be in the results.
      vec - (undocumented)
      num - (undocumented)
      an array of the words and the cosine similarities between the synonyms given word vector.
    • findSynonymsArray

      public scala.Tuple2<String,Object>[] findSynonymsArray(String word, int num)
      Find "num" number of words closest in similarity to the given word, not including the word itself.
      word - (undocumented)
      num - (undocumented)
      an array of the words and the cosine similarities between the synonyms given word vector.
    • setInputCol

      public Word2VecModel setInputCol(String value)
    • setOutputCol

      public Word2VecModel setOutputCol(String value)
    • transform

      public Dataset<Row> transform(Dataset<?> dataset)
      Transform a sentence column to a vector column to represent the whole sentence. The transform is performed by averaging all word vectors it contains.
      dataset - (undocumented)
    • transformSchema

      public StructType transformSchema(StructType schema)
      Check transform validity and derive the output schema from the input schema.

      We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().

      Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.

      schema - (undocumented)
    • copy

      public Word2VecModel copy(ParamMap extra)
      Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
      copy in interface Params
      extra - (undocumented)
    • write

      public MLWriter write()
      Returns an MLWriter instance for this ML instance.
    • toString

      public String toString()
      toString in class Object