SparkPipelineFramework
SparkPipelineFramework
PyPi: https://pypi.org/project/sparkpipelineframework/
Source Code: https://github.com/icanbwell/SparkPipelineFramework
Documentation: https://icanbwell.github.io/SparkPipelineFramework/index.html
This package makes it easy to create data pipelines from individual steps, and provides a number of Spark Transformers for common operations. Much of the functionality is provided as standard Spark Transformers that can be easily combined to create a Pipeline or you can run just one transformer.
Here’s a short summary of DataFrames, Transformers and Pipelines in Apache Spark (Databricks) however you can refer to Spark or Databricks documentation for detailed descriptions:
DataFrames
- DataFrames are distributed collections of data organized into named columns, similar to a table in a relational database or a data frame in R/Python.
- DataFrames support various data formats like CSV, JSON, Parquet, and Avro and allow operations such as filtering, grouping, and aggregation.
Transformers
- A Transformer is an algorithm which can transform one DataFrame into another DataFrame.
- This can include aggregations, mapping data, reading/writing and more.
Estimators
- An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
- Estimators are typically used for AI and machine learning where an estimator trains on data and then can transform new data e.g., by creating predictions.
Pipelines
- A Pipeline chains multiple Transformers and Estimators together to specify a data or AI workflow.
- A pipeline consists of a sequence of stages, where each stage is either a transformer or an estimator.
- By using a pipeline, we can stitch together various transformers to do a complete workflow. The same transformers can be used in other pipelines which provide sharing of code between pipelines.
We currently don’t open source our estimators although in the future we may do so.
See Spark Documentation for details:
DataFrame: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
Transformer: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Transformer.html
Estimator: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Estimator.html
Pipeline: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html
from spark_pipeline_framework.transformers.fhir_receiver.v2.fhir_receiver import (
FhirReceiver,
)
with ProgressLogger() as progress_logger:
print("Running FhirReceiver")
FhirReceiver(
resource=resource_type,
server_url=fhir_server_url,
file_path=temp_patients_path,
progress_logger=progress_logger,
run_synchronously=False,
parameters=parameters,
auth_client_id=client_id,
auth_client_secret=client_secret,
auth_well_known_url=auth_well_known_url,
auth_scopes=auth_scopes,
view="results_view",
error_view="error_view",
limit=limit,
schema=schema,
use_data_streaming=True,
log_level="DEBUG",
).transform(df)
print("Finished FhirReceiver")Updated 9 months ago
