package graph
- Alphabetic
- Public
- Protected
Type Members
- class AppendOnceFlow extends ResolvedFlow
A Flow that reads source[s] completely and appends data to the target, just once.
- class BatchTableWrite extends FlowExecution
A
FlowExecution
that writes a batchDataFrame
to aTable
. - case class CircularDependencyException(downstreamTable: TableIdentifier, upstreamDataset: TableIdentifier) extends AnalysisException with Product with Serializable
Raised when there's a circular dependency in the current pipeline.
Raised when there's a circular dependency in the current pipeline. That is, a downstream table is referenced while creating a upstream table.
- class CompleteFlow extends ResolvedFlow
A Flow that declares exactly what data should be in the target table.
- class CoreDataflowNodeProcessor extends AnyRef
Processor that is responsible for analyzing each flow and sort the nodes in topological order
- case class DataflowGraph(flows: Seq[Flow], tables: Seq[Table], views: Seq[View]) extends GraphOperations with GraphValidations with Product with Serializable
DataflowGraph represents the core graph structure for Spark declarative pipelines.
DataflowGraph represents the core graph structure for Spark declarative pipelines. It manages the relationships between logical flows, tables, and views, providing operations for graph traversal, validation, and transformation.
- class DataflowGraphTransformer extends AutoCloseable
Resolves the DataflowGraph by processing each node in the graph.
Resolves the DataflowGraph by processing each node in the graph. This class exposes visitor functionality to resolve/analyze graph nodes. We only expose simple visitor abilities to transform different entities of the graph. For advanced transformations we also expose a mechanism to walk the graph over entity by entity.
Assumptions: 1. Each output will have at-least 1 flow to it. 2. Each flow may or may not have a destination table. If a flow does not have a destination table, the destination is a temporary view.
The way graph is structured is that flows, tables and sinks all are graph elements or nodes. While we expose transformation functions for each of these entities, we also expose a way to process to walk over the graph.
Constructor is private as all usages should be via DataflowGraphTransformer.withDataflowGraphTransformer.
- sealed trait ExecutionResult extends AnyRef
A flow's execution may complete for two reasons: 1.
A flow's execution may complete for two reasons: 1. it may finish performing all of its necessary work, or 2. it may be interrupted by a request from a user to stop it.
We use this result to disambiguate these two cases, using 'ExecutionResult.FINISHED' for the former and 'ExecutionResult.STOPPED' for the latter.
- case class FailureStoppingFlow(flowIdentifiers: Seq[TableIdentifier]) extends FailureStoppingOperation with Product with Serializable
Indicates that there was a failure while stopping the flow.
- abstract class FailureStoppingOperation extends RunFailure
Abstract class used to identify failures related to failures stopping an operation/timeouts.
- trait Flow extends GraphElement with Logging
A Flow is a node of data transformation in a dataflow graph.
A Flow is a node of data transformation in a dataflow graph. It describes the movement of data into a particular dataset.
- trait FlowExecution extends AnyRef
A
FlowExecution
specifies how to execute a flow and manages its execution. - sealed trait FlowFilter extends GraphFilter[ResolvedFlow]
Specifies how we should filter Flows.
- trait FlowFunction extends Logging
A wrapper for the lambda function that defines a Flow.
- case class FlowFunctionResult(requestedInputs: Set[TableIdentifier], batchInputs: Set[ResolvedInput], streamingInputs: Set[ResolvedInput], usedExternalInputs: Set[TableIdentifier], dataFrame: Try[classic.DataFrame], sqlConf: Map[String, String], analysisWarnings: Seq[AnalysisWarning] = Nil) extends Product with Serializable
Holds the DataFrame returned by a FlowFunction along with the inputs used to construct it.
Holds the DataFrame returned by a FlowFunction along with the inputs used to construct it.
- batchInputs
the complete inputs read by the flow
- streamingInputs
the incremental inputs read by the flow
- usedExternalInputs
the identifiers of the external inputs read by the flow
- dataFrame
the DataFrame expression executed by the flow if the flow can be resolved
- case class FlowNode(identifier: TableIdentifier, inputs: Set[TableIdentifier], output: TableIdentifier) extends Product with Serializable
- identifier
The identifier of the flow.
- inputs
The identifiers of nodes used as inputs to this flow.
- output
The identifier of the output that this flow writes to.
- class FlowPlanner extends AnyRef
Plans execution of
Flow
s in aDataflowGraph
by convertingFlow
s into 'FlowExecution's. - case class FlowsForTables(selectedTables: Set[TableIdentifier]) extends FlowFilter with Product with Serializable
Used in partial graph updates to select flows that flow to "selectedTables".
- trait GraphElement extends AnyRef
An element in a DataflowGraph.
- abstract class GraphExecution extends Logging
- sealed trait GraphFilter[E] extends AnyRef
Specifies how we should filter Graph elements.
- trait GraphOperations extends AnyRef
- class GraphRegistrationContext extends AnyRef
A mutable context for registering tables, views, and flows in a dataflow graph.
- trait GraphValidations extends Logging
Validations performed on a
DataflowGraph
. - trait Input extends GraphElement
Specifies an input that can be referenced by another Dataset's query.
- case class LoadTableException(name: String, cause: Option[Throwable]) extends SparkException with Product with Serializable
Exception raised when a flow fails to read from a table defined within the pipeline
Exception raised when a flow fails to read from a table defined within the pipeline
- name
The name of the table
- cause
The cause of the failure
- sealed trait Output extends AnyRef
Represents a node in a DataflowGraph that can be written to by a Flow.
Represents a node in a DataflowGraph that can be written to by a Flow. Must be backed by a file source.
- case class PersistedView(identifier: TableIdentifier, properties: Map[String, String], comment: Option[String], origin: QueryOrigin) extends View with Product with Serializable
Representing a persisted View in a DataflowGraph.
Representing a persisted View in a DataflowGraph.
- identifier
The identifier of this view within the graph.
- properties
Properties of the view
- comment
when defining a view
- class PipelineExecution extends AnyRef
Executes a DataflowGraph by resolving the graph, materializing datasets, and running the flows.
- case class PipelineTableProperty[T](key: String, default: String, fromString: (String) => T) extends Product with Serializable
- trait PipelineUpdateContext extends AnyRef
- class PipelineUpdateContextImpl extends PipelineUpdateContext
An implementation of the PipelineUpdateContext trait used in production.
- case class QueryContext(currentCatalog: Option[String], currentDatabase: Option[String]) extends Product with Serializable
Contains the catalog and database context information for query execution.
- case class QueryExecutionFailure(flowName: String, maxRetries: Int, cause: Option[Throwable]) extends RunFailure with Product with Serializable
Indicates that run has failed due to a query execution failure.
- case class QueryOrigin(language: Option[Language] = None, filePath: Option[String] = None, sqlText: Option[String] = None, line: Option[Int] = None, startPosition: Option[Int] = None, objectType: Option[String] = None, objectName: Option[String] = None) extends Product with Serializable
Records information used to track the provenance of a given query to user code.
Records information used to track the provenance of a given query to user code.
- language
The language used by the user to define the query.
- filePath
Path to the file of the user code that defines the query.
- sqlText
The SQL text of the query.
- line
The line number of the query in the user code. Line numbers are 1-indexed.
- startPosition
The start position of the query in the user code.
- objectType
The type of the object that the query is associated with. (Table, View, etc)
- objectName
The name of the object that the query is associated with.
- trait ResolutionCompletedFlow extends Flow
A Flow whose flow function has been invoked, meaning either:
A Flow whose flow function has been invoked, meaning either:
- Its output schema and dependencies are known.
- It failed to resolve.
- class ResolutionFailedFlow extends ResolutionCompletedFlow
A Flow whose flow function has failed to resolve.
- trait ResolvedFlow extends ResolutionCompletedFlow with Input
A Flow whose flow function has successfully resolved.
- case class ResolvedInput(input: Input, aliasIdentifier: AliasIdentifier) extends Product with Serializable
A wrapper for a resolved internal input that includes the alias provided by the user.
- case class RunCompletion() extends RunTerminationReason with Product with Serializable
Indicates that a triggered run has successfully completed execution.
- sealed abstract class RunFailure extends RunTerminationReason
Indicates that an run entered the failed state..
- case class RunTerminationException(reason: RunTerminationReason) extends Exception with Product with Serializable
Helper exception class that indicates that a run has to be terminated and tracks the associated termination reason.
- sealed trait RunTerminationReason extends AnyRef
- case class SomeTables(selectedTables: Set[TableIdentifier]) extends TableFilter with Product with Serializable
Used in partial graph updates to select "selectedTables".
- case class SqlGraphElementRegistrationException(msg: String, queryOrigin: QueryOrigin) extends AnalysisException with Product with Serializable
- class SqlGraphRegistrationContext extends AnyRef
SQL statement processor context.
SQL statement processor context. At any instant, an instance of this class holds the "active" catalog/schema in use within this SQL statement processing context, and tables/views/flows that have been registered from SQL statements within this context.
- class SqlGraphRegistrationContextState extends AnyRef
Data class for all state that is accumulated while processing a particular SqlGraphRegistrationContext.
- class StreamingFlow extends ResolvedFlow
A Flow that represents stateful movement of data to some target.
- trait StreamingFlowExecution extends FlowExecution with Logging
A 'FlowExecution' that processes data statefully using Structured Streaming.
- class StreamingTableWrite extends StreamingFlowExecution
A
StreamingFlowExecution
that writes a streamingDataFrame
to aTable
. - case class Table(identifier: TableIdentifier, specifiedSchema: Option[StructType], partitionCols: Option[Seq[String]], normalizedPath: Option[String], properties: Map[String, String] = Map.empty, comment: Option[String], baseOrigin: QueryOrigin, isStreamingTable: Boolean, format: Option[String]) extends TableInput with Output with Product with Serializable
A table representing a materialized dataset in a DataflowGraph.
A table representing a materialized dataset in a DataflowGraph.
- identifier
The identifier of this table within the graph.
- specifiedSchema
The user-specified schema for this table.
- partitionCols
What columns the table should be partitioned by when materialized.
- normalizedPath
Normalized storage location for the table based on the user-specified table path (if not defined, we will normalize a managed storage path for it).
- properties
Table Properties to set in table metadata.
- comment
User-specified comment that can be placed on the table.
- isStreamingTable
if the table is a streaming table, as defined by the source code.
- sealed trait TableFilter extends GraphFilter[Table]
Specifies how we should filter Tables.
- sealed trait TableInput extends Input
A type of Input where data is loaded from a table.
- case class TemporaryView(identifier: TableIdentifier, properties: Map[String, String], comment: Option[String], origin: QueryOrigin) extends View with Product with Serializable
Representing a temporary View in a DataflowGraph.
Representing a temporary View in a DataflowGraph.
- identifier
The identifier of this view within the graph.
- properties
Properties of the view
- comment
when defining a view
- case class TriggeredFailureInfo(lastFailTimestamp: Long, numFailures: Int, lastException: Throwable, lastExceptionAction: FlowExecutionAction) extends Product with Serializable
- class TriggeredGraphExecution extends GraphExecution
Executes all of the flows in the given graph in topological order.
Executes all of the flows in the given graph in topological order. Each flow processes all available data before downstream flows are triggered.
- class UncaughtExceptionHandler extends java.lang.Thread.UncaughtExceptionHandler
Uncaught exception handler which first calls the delegate and then calls the OnFailure function with the uncaught exception.
- case class UnexpectedRunFailure() extends RunFailure with Product with Serializable
Run could not be associated with a proper root cause.
Run could not be associated with a proper root cause. This is not expected and likely indicates a bug.
- case class UnionFlowFilter(oneFilter: FlowFilter, otherFilter: FlowFilter) extends FlowFilter with Product with Serializable
Returns a flow filter that is a union of two flow filters
- case class UnresolvedDatasetException(identifier: TableIdentifier) extends AnalysisException with Product with Serializable
Exception raised when a flow tries to read from a dataset that exists but is unresolved
Exception raised when a flow tries to read from a dataset that exists but is unresolved
- identifier
The identifier of the dataset
- case class UnresolvedFlow(identifier: TableIdentifier, destinationIdentifier: TableIdentifier, func: FlowFunction, queryContext: QueryContext, sqlConf: Map[String, String], comment: Option[String] = None, once: Boolean, origin: QueryOrigin) extends Flow with Product with Serializable
A Flow whose output schema and dependencies aren't known.
- case class UnresolvedPipelineException(graph: DataflowGraph, directFailures: Map[TableIdentifier, Throwable], downstreamFailures: Map[TableIdentifier, Throwable], additionalHint: Option[String] = None) extends AnalysisException with Product with Serializable
Exception raised when a pipeline has one or more flows that cannot be resolved
Exception raised when a pipeline has one or more flows that cannot be resolved
- directFailures
Mapping between the name of flows that failed to resolve (due to an error in that flow) and the error that occurred when attempting to resolve them
- downstreamFailures
Mapping between the name of flows that failed to resolve (because they failed to read from other unresolved flows) and the error that occurred when attempting to resolve them
- trait View extends GraphElement
Representing a view in the DataflowGraph.
- case class VirtualTableInput(identifier: TableIdentifier, specifiedSchema: Option[StructType], incomingFlowIdentifiers: Set[TableIdentifier], availableFlows: Seq[ResolvedFlow] = Nil) extends TableInput with Logging with Product with Serializable
A type of TableInput that returns data from a specified schema or from the inferred Flows that write to the table.
Value Members
- case object AllFlows extends FlowFilter with Product with Serializable
Used in full graph update to select all flows.
- case object AllTables extends TableFilter with Product with Serializable
Used in full graph updates to select all tables.
- object DataflowGraph extends Serializable
- object DataflowGraphTransformer
- object DatasetManager extends Logging
DatasetManager
is responsible for materializing tables in the catalog based on the given graph.DatasetManager
is responsible for materializing tables in the catalog based on the given graph. For each table in the graph, it will create a table if none exists (or if this is a full refresh), or merge the schema of an existing table to match the new flows writing to it. - object ExecutionResult
- object FlowAnalysis
- object FlowExecution
- object GraphElementTypeUtils
- object GraphErrors
Collection of errors that can be thrown during graph resolution / analysis.
- object GraphExecution extends Logging
- object GraphIdentifierManager
Responsible for properly qualify the identifiers for datasets inside or referenced by the dataflow graph.
- object GraphRegistrationContext
- object IdentifierHelper
- case object NoFlows extends FlowFilter with Product with Serializable
Used to specify that no flows should be refreshed.
- case object NoTables extends TableFilter with Product with Serializable
Used to select no tables.
- object PartitionHelper
- object PipelinesErrors extends Logging
- object PipelinesTableProperties
Interface for validating and accessing Pipeline-specific table properties.
- object QueryOrigin extends Logging with Serializable
- object SqlGraphElementRegistrationException extends Serializable
- object SqlGraphRegistrationContext
- object TriggeredGraphExecution
- object UncaughtExceptionHandler
- object ViewHelpers