org.apache.spark.mllib.feature
Class IDF

Object
  extended by org.apache.spark.mllib.feature.IDF

public class IDF
extends Object

:: Experimental :: Inverse document frequency (IDF). The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t.

This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0.

param: minDocFreq minimum of documents in which a term should appear for filtering


Nested Class Summary
static class IDF.DocumentFrequencyAggregator
          Document frequency aggregator.
 
Constructor Summary
IDF()
           
IDF(int minDocFreq)
           
 
Method Summary
 IDFModel fit(JavaRDD<Vector> dataset)
          Computes the inverse document frequency.
 IDFModel fit(RDD<Vector> dataset)
          Computes the inverse document frequency.
 int minDocFreq()
           
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IDF

public IDF(int minDocFreq)

IDF

public IDF()
Method Detail

minDocFreq

public int minDocFreq()

fit

public IDFModel fit(RDD<Vector> dataset)
Computes the inverse document frequency.

Parameters:
dataset - an RDD of term frequency vectors
Returns:
(undocumented)

fit

public IDFModel fit(JavaRDD<Vector> dataset)
Computes the inverse document frequency.

Parameters:
dataset - a JavaRDD of term frequency vectors
Returns:
(undocumented)