Defining schemas for JSON

JSON is typically schemaless. This allows each JSON object in a collection to differ in terms of its structure and data types, but this poses challenges for Parquet files which require a fixed schema.

In most cases and provided data is returned from an API or No SQL database, we can automatically infer a schema when we write flattened JSON data to a Parquet file.

However, if no data is returned because the query matched no records, this creates a problem: there's no data from which a schema can be inferred so the Parquet file can't be written. This could mean that a later step in your analysis may fail.

An example

Let's say that you retrieve a list of Customers from a Microsoft SQL Server database and store the result in customers.parquet.

You then create sales.parquet by querying a MongoDb instance for any sales invoices for the set of customers in the first Parquet file.

Finally, you use a DuckDb query to join the two Parquet datasets together and sum the total value of sales by customer:

SELECT    CustomerReference,
          SUM (InvoiceTotal) AS SalesTotal
FROM      'customers.parquet' AS c
JOIN      'sales.parquet' AS s ON c.CustomerReference = s.CustomerReference
GROUP BY  CustomerReference

But failure isn't the correct outcome here: what should happen is that the Customer is given a SalesTotal value of 0.00.

Defining a schema in YAML

For any JSON-based datasource, you can specify the columns for the output Parquet file and their data types:

columns:
- name: FirstName
  type: string
- name: LastName
  type: string
- name: Age
  type: int64
- name: Salary
  type: double
- name: Children
  type: bool
- name: BirthDate
  type: timestamp[s]

All fields will be considered to be nullable.

Creating a schema automatically

All data types

The full list of available data types available to you is as follows:

Type

Description

Further Detail

bool

Boolean

true or false

int8

Char

-128 to 127

int16

Short

-32,768 to 32,767

int32

Int

2,147,483,648 to 2,147,483,647

int64

Long

-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

uint8

Unsigned Char

0 to 255

uint16

Unsigned Short

0 to 65,535

uint32

Unsigned Int

0 to 4,294,967,295

uint64

Unsigned Long

0 to 18,446,744,073,709,551,615

float16

Half-precision floating point

~5.96e−8 (6.10e−5) to 65504 with 4 significant decimal digits precision

float32

Single-precision floating point

~1.18e-38 to ~3.40e38 with 6–9 significant decimal digits precision

float64

Double-precision floating point

~2.23e-308 … ~1.80e308 with full 15–17 decimal digits precision

time32(s)

Time of day in seconds

time32(ms)

Time of day in milliseconds

time64(us)

Time of day in microseconds

time64(ns)

Time of day in nanoseconds

timestamp(s)

Timestamp with seconds precision

timestamp(ms)

Timestamp with millisecond precision

timestamp(us)

Timestamp with microsecond precision

timestamp(ns)

Timestamp with nanosecond precision

date32

Date (days)

Days since Unix epoch (1 January 1970)

date64

Date (ms)

Milliseconds since Unix epoch (1 January 1970)

binary

Variable length binary

string

Variable length string

UTF-8 encoding

large_binary

Large variable length binary

large_string

Large variable length string

UTF-8 encoding

decimal128(p,s)

Decimal with precision and scale

decimal128(int precision, int scale=0)

Last updated

Was this helpful?