A data lake makes structured and unstructured data massively available within a company. Spark connectors allow users to retrieve data from the data lake.
A data lake is a reliable collection of transformed data coming from and at destination of various businesses. But a data lake is worth nothing, if users cannot query the data it holds. Users want to get the data, crunch it and visualize it. That is why connectors from Spark to data visualization formats — such as Tableau — are a necessary step in data engineering.
Connectors in Spark are interfaces used to write Resilient Distributed Databases (aka RDDs, aka Spark’s distributed dataframes) to external storage systems.
Spark comes with a lot of pre-packaged connectors within the Spark SQL API. For instance, write
connector makes it very easy to write a dataframe to CSV using a single line of code.
dataframe.write.csv('mycsv.csv')
Other supported standard formats are:
- text files
- JSON or CSV
- Parquet
- ORC
- Cassandra
- Elasticsearch
In some cases, however, you will be compelled to work with more exotic formats. Tableau, Alteryx, Microsoft: major data software companies developed their own formats for big data. Tableau’s coorporate solution Tableau Server for instance, uses either Tableau Data Extract (.tde
) or — more recently — Hyper (.hyper
), as storage formats for its data tables.
A connector divides in three parts:
Let’s dive into each part of the connector!
1 — Convert a Spark dataframe to your target format
The proper way to convert a dataframe to your target format is to proceed partition by partition. RDDs are distributed in partitions which are not directly accessible to cluster’s driver where Python code is running.
For each partition of the RDD, collect
data within that partition to the driver. You can then go through the collected partition and insert it to your target file row by row.
The function convert_and_insert
will be provided by data vendors. For Tableau formats, for instance, you can refer to Tableau SDK (for .tde
) or Tableau API 2.0 (for .hyper
), that both have C++, Java and Python APIs.
This method supposes that you have full control:
- on the driver’s memory (which can be set for the current Spark session through
spark.driver.memory
), because all partitions that will be collected to that driver one after another should fit in, and - on the partitioning of the source dataframe, because no partition should exceed the driver’s memory in order to avoid
OutOfMemory
errors.
2 — Export the source file to the cloud
Once data is converted to the proper format, it can be exported to the cloud and made available to users. For instance, I use Tableau REST API to publish Tableau files to Tableau Server. Every vendor provides developers with dedicated APIs to publish data to the cloud.
Not all APIs, however, are well-documented, so my advice is to directly clone the project and dive in the code to see if the one feature you need is already implemented. If features are still missing, you will even be able to submit a pull request.
3 — Make the Spark connector easy to use for your users
I built a command-line interface in Python on the top of my connector. The user can choose a source environment and a target environment (as represented in the figure 1 above). These options are parsed, mapped to a configuration file, and both services described above in part 1 (convert
) and part 2 (export
) are sequentially triggered to publish formatted data to the cloud.
Thanks to Arnaud, Irina Stolbova, Florian Carra, and Nicolas Jean.
If you are looking for Data Engineering experts, don't hesitate to contact us !