Package org.apache.spark.mllib.util
Class MLUtils
Object
org.apache.spark.mllib.util.MLUtils
Helper methods to load, save and pre-process data used in MLLib.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionstatic VectorappendBias(Vector vector) Returns a new vector with1.0(bias) appended to the input vector.convertMatrixColumnsFromML(Dataset<?> dataset, String... cols) convertMatrixColumnsFromML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) convertMatrixColumnsToML(Dataset<?> dataset, String... cols) convertMatrixColumnsToML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) convertVectorColumnsFromML(Dataset<?> dataset, String... cols) convertVectorColumnsFromML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) convertVectorColumnsToML(Dataset<?> dataset, String... cols) convertVectorColumnsToML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) Return a k element array of pairs of RDDs with the first element of each pair containing the training data, a complement of the validation data and the second element, the validation data, containing a unique 1/kth of the data.Version ofkFold()taking a Long seed.Version ofkFold()taking a fold column name.static RDD<LabeledPoint>loadLabeledPoints(SparkContext sc, String dir) Loads labeled points saved usingRDD[LabeledPoint].saveAsTextFilewith the default number of partitions.static RDD<LabeledPoint>loadLabeledPoints(SparkContext sc, String path, int minPartitions) Loads labeled points saved usingRDD[LabeledPoint].saveAsTextFile.static RDD<LabeledPoint>loadLibSVMFile(SparkContext sc, String path) Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of features determined automatically and the default number of partitions.static RDD<LabeledPoint>loadLibSVMFile(SparkContext sc, String path, int numFeatures) Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of partitions.static RDD<LabeledPoint>loadLibSVMFile(SparkContext sc, String path, int numFeatures, int minPartitions) Loads labeled data in the LIBSVM format into an RDD[LabeledPoint].loadVectors(SparkContext sc, String path) Loads vectors saved usingRDD[Vector].saveAsTextFilewith the default number of partitions.loadVectors(SparkContext sc, String path, int minPartitions) Loads vectors saved usingRDD[Vector].saveAsTextFile.static org.apache.spark.internal.Logging.LogStringContextLogStringContext(scala.StringContext sc) static voidoptimizerFailed(org.apache.spark.ml.util.Instrumentation instr, Class<?> optimizerClass) static org.slf4j.Loggerstatic voidorg$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) static voidsaveAsLibSVMFile(RDD<LabeledPoint> data, String dir) Save labeled data in LIBSVM format.
- 
Constructor Details- 
MLUtilspublic MLUtils()
 
- 
- 
Method Details- 
convertVectorColumnsToMLConverts vector columns in an input Dataset from theVectortype to the newVectortype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of vector columns to be converted. New vector columns will be ignored. If unspecified, all old vector columns will be converted except nested ones.
- Returns:
- the input DataFramewith old vector columns converted to the new vector type
 
- 
convertVectorColumnsFromMLConverts vector columns in an input Dataset to theVectortype from the newVectortype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of vector columns to be converted. Old vector columns will be ignored. If unspecified, all new vector columns will be converted except nested ones.
- Returns:
- the input DataFramewith new vector columns converted to the old vector type
 
- 
convertMatrixColumnsToMLConverts Matrix columns in an input Dataset from theMatrixtype to the newMatrixtype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of matrix columns to be converted. New matrix columns will be ignored. If unspecified, all old matrix columns will be converted except nested ones.
- Returns:
- the input DataFramewith old matrix columns converted to the new matrix type
 
- 
convertMatrixColumnsFromMLConverts matrix columns in an input Dataset to theMatrixtype from the newMatrixtype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of matrix columns to be converted. Old matrix columns will be ignored. If unspecified, all new matrix columns will be converted except nested ones.
- Returns:
- the input DataFramewith new matrix columns converted to the old matrix type
 
- 
loadLibSVMFilepublic static RDD<LabeledPoint> loadLibSVMFile(SparkContext sc, String path, int numFeatures, int minPartitions) Loads labeled data in the LIBSVM format into an RDD[LabeledPoint]. The LIBSVM format is a text-based format used by LIBSVM and LIBLINEAR. Each line represents a labeled sparse feature vector using the following format:
 where the indices are one-based and in ascending order. This method parses each line into a {@link org.apache.spark.mllib.regression.LabeledPoint}, where the feature indices are converted to zero-based. @param sc Spark context @param path file or directory path in any Hadoop-supported file system URI @param numFeatures number of features, which will be determined from the input data if a nonpositive value is given. This is useful when the dataset is already split into multiple files and you want to load them separately, because some features may not present in certain files, which leads to inconsistent feature dimensions. @param minPartitions min number of partitions @return labeled data stored as an RDD[LabeledPoint]label index1:value1 index2:value2 ...- Parameters:
- sc- (undocumented)
- path- (undocumented)
- numFeatures- (undocumented)
- minPartitions- (undocumented)
- Returns:
- (undocumented)
 
- 
loadLibSVMFileLoads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of partitions.- Parameters:
- sc- (undocumented)
- path- (undocumented)
- numFeatures- (undocumented)
- Returns:
- (undocumented)
 
- 
loadLibSVMFileLoads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of features determined automatically and the default number of partitions.- Parameters:
- sc- (undocumented)
- path- (undocumented)
- Returns:
- (undocumented)
 
- 
saveAsLibSVMFileSave labeled data in LIBSVM format.- Parameters:
- data- an RDD of LabeledPoint to be saved
- dir- directory to save the data
- See Also:
- 
- org.apache.spark.mllib.util.MLUtils.loadLibSVMFile
 
 
- 
loadVectorsLoads vectors saved usingRDD[Vector].saveAsTextFile.- Parameters:
- sc- Spark context
- path- file or directory path in any Hadoop-supported file system URI
- minPartitions- min number of partitions
- Returns:
- vectors stored as an RDD[Vector]
 
- 
loadVectorsLoads vectors saved usingRDD[Vector].saveAsTextFilewith the default number of partitions.- Parameters:
- sc- (undocumented)
- path- (undocumented)
- Returns:
- (undocumented)
 
- 
loadLabeledPointsLoads labeled points saved usingRDD[LabeledPoint].saveAsTextFile.- Parameters:
- sc- Spark context
- path- file or directory path in any Hadoop-supported file system URI
- minPartitions- min number of partitions
- Returns:
- labeled points stored as an RDD[LabeledPoint]
 
- 
loadLabeledPointsLoads labeled points saved usingRDD[LabeledPoint].saveAsTextFilewith the default number of partitions.- Parameters:
- sc- (undocumented)
- dir- (undocumented)
- Returns:
- (undocumented)
 
- 
kFoldpublic static <T> scala.Tuple2<RDD<T>,RDD<T>>[] kFold(RDD<T> rdd, int numFolds, int seed, scala.reflect.ClassTag<T> evidence$1) Return a k element array of pairs of RDDs with the first element of each pair containing the training data, a complement of the validation data and the second element, the validation data, containing a unique 1/kth of the data. Where k=numFolds.- Parameters:
- rdd- (undocumented)
- numFolds- (undocumented)
- seed- (undocumented)
- evidence$1- (undocumented)
- Returns:
- (undocumented)
 
- 
kFoldpublic static <T> scala.Tuple2<RDD<T>,RDD<T>>[] kFold(RDD<T> rdd, int numFolds, long seed, scala.reflect.ClassTag<T> evidence$2) Version ofkFold()taking a Long seed.- Parameters:
- rdd- (undocumented)
- numFolds- (undocumented)
- seed- (undocumented)
- evidence$2- (undocumented)
- Returns:
- (undocumented)
 
- 
kFoldpublic static scala.Tuple2<RDD<Row>,RDD<Row>>[] kFold(Dataset<Row> df, int numFolds, String foldColName) Version ofkFold()taking a fold column name.- Parameters:
- df- (undocumented)
- numFolds- (undocumented)
- foldColName- (undocumented)
- Returns:
- (undocumented)
 
- 
appendBiasReturns a new vector with1.0(bias) appended to the input vector.- Parameters:
- vector- (undocumented)
- Returns:
- (undocumented)
 
- 
convertVectorColumnsToMLpublic static Dataset<Row> convertVectorColumnsToML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) Converts vector columns in an input Dataset from theVectortype to the newVectortype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of vector columns to be converted. New vector columns will be ignored. If unspecified, all old vector columns will be converted except nested ones.
- Returns:
- the input DataFramewith old vector columns converted to the new vector type
 
- 
convertVectorColumnsFromMLpublic static Dataset<Row> convertVectorColumnsFromML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) Converts vector columns in an input Dataset to theVectortype from the newVectortype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of vector columns to be converted. Old vector columns will be ignored. If unspecified, all new vector columns will be converted except nested ones.
- Returns:
- the input DataFramewith new vector columns converted to the old vector type
 
- 
convertMatrixColumnsToMLpublic static Dataset<Row> convertMatrixColumnsToML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) Converts Matrix columns in an input Dataset from theMatrixtype to the newMatrixtype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of matrix columns to be converted. New matrix columns will be ignored. If unspecified, all old matrix columns will be converted except nested ones.
- Returns:
- the input DataFramewith old matrix columns converted to the new matrix type
 
- 
convertMatrixColumnsFromMLpublic static Dataset<Row> convertMatrixColumnsFromML(Dataset<?> dataset, scala.collection.immutable.Seq<String> cols) Converts matrix columns in an input Dataset to theMatrixtype from the newMatrixtype under thespark.mlpackage.- Parameters:
- dataset- input dataset
- cols- a list of matrix columns to be converted. Old matrix columns will be ignored. If unspecified, all new matrix columns will be converted except nested ones.
- Returns:
- the input DataFramewith new matrix columns converted to the old matrix type
 
- 
optimizerFailedpublic static void optimizerFailed(org.apache.spark.ml.util.Instrumentation instr, Class<?> optimizerClass) 
- 
org$apache$spark$internal$Logging$$log_public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- 
org$apache$spark$internal$Logging$$log__$eqpublic static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) 
- 
LogStringContextpublic static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc) 
 
-