org.apache.spark.mllib.util.LinearDataGenerator

public class LinearDataGenerator extends Object

Generate sample data used for Linear Data. This class generates uniformly random values for every feature and adds Gaussian noise with mean eps to the response variable Y.

Constructor Summary

Constructors

Constructor

Description

LinearDataGenerator()
Method Summary

Modifier and Type

Method

Description

static scala.collection.Seq<LabeledPoint>

generateLinearInput(double intercept, double[] weights, double[] xMean, double[] xVariance, int nPoints, int seed, double eps)

static scala.collection.Seq<LabeledPoint>

generateLinearInput(double intercept, double[] weights, double[] xMean, double[] xVariance, int nPoints, int seed, double eps, double sparsity)

static scala.collection.Seq<LabeledPoint>

generateLinearInput(double intercept, double[] weights, int nPoints, int seed, double eps)

For compatibility, the generated data without specifying the mean and variance will have zero mean and variance of (1.0/3.0) since the original output range is [-1, 1] with uniform distribution, and the variance of uniform distribution is (b - a)^2^ / 12 which will be (1.0/3.0)

static List<LabeledPoint>

generateLinearInputAsList(double intercept, double[] weights, int nPoints, int seed, double eps)

Return a Java List of synthetic data randomly generated according to a multi collinear model.

static RDD<LabeledPoint>

generateLinearRDD(SparkContext sc, int nexamples, int nfeatures, double eps, int nparts, double intercept)

Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso, and unregularized variants.

static void

main(String[] args)

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- LinearDataGenerator
  
  public LinearDataGenerator()
Method Details
- generateLinearInputAsList
  
  public static List<LabeledPoint> generateLinearInputAsList(double intercept, double[] weights, int nPoints, int seed, double eps)
  
  Return a Java List of synthetic data randomly generated according to a multi collinear model.
  
  Parameters:
  
  intercept - Data intercept
  
  weights - Weights to be applied.
  
  nPoints - Number of points in sample.
  
  seed - Random seed
  
  eps - (undocumented)
  
  Returns:
  
  Java List of input.
- generateLinearInput
  
  public static scala.collection.Seq<LabeledPoint> generateLinearInput(double intercept, double[] weights, int nPoints, int seed, double eps)
  
  For compatibility, the generated data without specifying the mean and variance will have zero mean and variance of (1.0/3.0) since the original output range is [-1, 1] with uniform distribution, and the variance of uniform distribution is (b - a)^2^ / 12 which will be (1.0/3.0)
  
  Parameters:
  
  intercept - Data intercept
  
  weights - Weights to be applied.
  
  nPoints - Number of points in sample.
  
  seed - Random seed
  
  eps - Epsilon scaling factor.
  
  Returns:
  
  Seq of input.
- generateLinearInput
  
  public static scala.collection.Seq<LabeledPoint> generateLinearInput(double intercept, double[] weights, double[] xMean, double[] xVariance, int nPoints, int seed, double eps)
  
  Parameters:
  
  intercept - Data intercept
  
  weights - Weights to be applied.
  
  xMean - the mean of the generated features. Lots of time, if the features are not properly standardized, the algorithm with poor implementation will have difficulty to converge.
  
  xVariance - the variance of the generated features.
  
  nPoints - Number of points in sample.
  
  seed - Random seed
  
  eps - Epsilon scaling factor.
  
  Returns:
  
  Seq of input.
- generateLinearInput
  
  public static scala.collection.Seq<LabeledPoint> generateLinearInput(double intercept, double[] weights, double[] xMean, double[] xVariance, int nPoints, int seed, double eps, double sparsity)
  
  Parameters:
  
  intercept - Data intercept
  
  weights - Weights to be applied.
  
  xMean - the mean of the generated features. Lots of time, if the features are not properly standardized, the algorithm with poor implementation will have difficulty to converge.
  
  xVariance - the variance of the generated features.
  
  nPoints - Number of points in sample.
  
  seed - Random seed
  
  eps - Epsilon scaling factor.
  
  sparsity - The ratio of zero elements. If it is 0.0, LabeledPoints with DenseVector is returned.
  
  Returns:
  
  Seq of input.
- generateLinearRDD
  
  public static RDD<LabeledPoint> generateLinearRDD(SparkContext sc, int nexamples, int nfeatures, double eps, int nparts, double intercept)
  
  Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso, and unregularized variants.
  
  Parameters:
  
  sc - SparkContext to be used for generating the RDD.
  
  nexamples - Number of examples that will be contained in the RDD.
  
  nfeatures - Number of features to generate for each example.
  
  eps - Epsilon factor by which examples are scaled.
  
  nparts - Number of partitions in the RDD. Default value is 2.
  
  intercept - (undocumented)
  
  Returns:
  
  RDD of LabeledPoint containing sample data.
- main
  
  public static void main(String[] args)

Class LinearDataGenerator

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

LinearDataGenerator

Method Details

generateLinearInputAsList

generateLinearInput

generateLinearInput

generateLinearInput

generateLinearRDD

main