Defining schemas for JSON
JSON is typically schemaless. This allows each JSON object in a collection to differ in terms of its structure and data types, but this poses challenges for Parquet files which require a fixed schema.
In most cases and provided data is returned from an API or No SQL database, we can automatically infer a schema when we write flattened JSON data to a Parquet file.
However, if no data is returned because the query matched no records, this creates a problem: there's no data from which a schema can be inferred so the Parquet file can't be written. This could mean that a later step in your analysis may fail.
An example
Let's say that you retrieve a list of Customers from a Microsoft SQL Server database and store the result in customers.parquet
.
You then create sales.parquet
by querying a MongoDb instance for any sales invoices for the set of customers in the first Parquet file.
Finally, you use a DuckDb query to join the two Parquet datasets together and sum the total value of sales by customer:
SELECT CustomerReference,
SUM (InvoiceTotal) AS SalesTotal
FROM 'customers.parquet' AS c
JOIN 'sales.parquet' AS s ON c.CustomerReference = s.CustomerReference
GROUP BY CustomerReference
Without defining a schema, the second Parquet file would fail to be written when there's no matching results as there's no way to infer a schema.
But failure isn't the correct outcome here: what should happen is that the Customer is given a SalesTotal
value of 0.00
.
Providing a schema allows us to create an empty Parquet dataset in the event that your query returns no data. This will allow subsequent steps to complete.
Defining a schema in YAML
For any JSON-based datasource, you can specify the columns for the output Parquet file and their data types:
columns:
- name: FirstName
type: string
- name: LastName
type: string
- name: Age
type: int64
- name: Salary
type: double
- name: Children
type: bool
- name: BirthDate
type: timestamp[s]
All fields will be considered to be nullable.
Creating a schema automatically
We can create a column schema for you!
Simply paste a handful of JSON records in our schema tool and we'll give you some columns
YAML in return.
Make sure you provide flat / flattened / transformed JSON to the schema tool.
All data types
The full list of available data types available to you is as follows:
Type
Description
Further Detail
bool
Boolean
true
or false
int8
Char
-128 to 127
int16
Short
-32,768 to 32,767
int32
Int
2,147,483,648 to 2,147,483,647
int64
Long
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
uint8
Unsigned Char
0 to 255
uint16
Unsigned Short
0 to 65,535
uint32
Unsigned Int
0 to 4,294,967,295
uint64
Unsigned Long
0 to 18,446,744,073,709,551,615
float16
Half-precision floating point
~5.96e−8 (6.10e−5) to 65504 with 4 significant decimal digits precision
float32
Single-precision floating point
~1.18e-38 to ~3.40e38 with 6–9 significant decimal digits precision
float64
Double-precision floating point
~2.23e-308 … ~1.80e308 with full 15–17 decimal digits precision
time32(s)
Time of day in seconds
time32(ms)
Time of day in milliseconds
time64(us)
Time of day in microseconds
time64(ns)
Time of day in nanoseconds
timestamp(s)
Timestamp with seconds precision
timestamp(ms)
Timestamp with millisecond precision
timestamp(us)
Timestamp with microsecond precision
timestamp(ns)
Timestamp with nanosecond precision
date32
Date (days)
Days since Unix epoch (1 January 1970)
date64
Date (ms)
Milliseconds since Unix epoch (1 January 1970)
binary
Variable length binary
string
Variable length string
UTF-8 encoding
large_binary
Large variable length binary
large_string
Large variable length string
UTF-8 encoding
decimal128(p,s)
Decimal with precision and scale
decimal128(int precision, int scale=0)
Last updated
Was this helpful?