JSON is typically schemaless. This allows each JSON object in a collection to differ in terms of its structure and data types, but this poses challenges for Parquet files which require a fixed schema.
In most cases and provided data is returned from an API or No SQL database, we can automatically infer a schema when we write to a Parquet file.
However, if no data is returned because the query matched no records, this creates a problem: there's no data from which a schema can be inferred so the Parquet file can't be written. This could mean that a later step in your analysis may fail.
An example
Let's say that you retrieve a list of Customers from a Microsoft SQL Server database and store the result in customers.parquet.
You then create sales.parquet by querying a MongoDb instance for any sales invoices for the set of customers in the first Parquet file.
Finally, you use a DuckDb query to join the two Parquet datasets together and sum the total value of sales by customer:
SELECT CustomerReference,
SUM (InvoiceTotal) AS SalesTotal
FROM 'customers.parquet' AS c
JOIN 'sales.parquet' AS s ON c.CustomerReference = s.CustomerReference
GROUP BY CustomerReference
Without defining a schema, the second Parquet file would fail to be written when there's no matching results as there's no way to infer a schema.
But failure isn't the correct outcome here: what should happen is that the Customer is given a SalesTotal value of 0.00.
Providing a schema allows us to create an empty Parquet dataset in the event that your query returns no data. This will allow subsequent steps to complete.
Defining a schema in YAML
For any JSON-based datasource, you can specify the columns for the output Parquet file and their data types: