Package org.apache.spark.mllib.feature
Class IDF
Object
org.apache.spark.mllib.feature.IDF
Inverse document frequency (IDF).
The standard formulation is used:
idf = log((m + 1) / (d(t) + 1))
, where m
is the total
number of documents and d(t)
is the number of documents that contain term t
.
This implementation supports filtering out terms which do not appear in a minimum number
of documents (controlled by the variable minDocFreq
). For terms that are not in
at least minDocFreq
documents, the IDF is found as 0, resulting in TF-IDFs of 0.
The document frequency is 0 as well for such terms
param: minDocFreq minimum of documents in which a term should appear for filtering
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Document frequency aggregator. -
Constructor Summary
-
Method Summary
-
Constructor Details
-
IDF
public IDF(int minDocFreq) -
IDF
public IDF()
-
-
Method Details
-
minDocFreq
public int minDocFreq() -
fit
Computes the inverse document frequency.- Parameters:
dataset
- an RDD of term frequency vectors- Returns:
- (undocumented)
-
fit
Computes the inverse document frequency.- Parameters:
dataset
- a JavaRDD of term frequency vectors- Returns:
- (undocumented)
-