Exploring Complex Types in BigQuery
Introduction
Google’s BigQuery has support for complex types (arrays & structs) which are relatively new in analytical databases. While the ideas and of arrays and structs aren’t unique to BigQuery some of the syntax and capabilities are unique. In this post I’ll be going over what I’ve found to be the most useful patterns and tricks.
Arrays
Put plainly an array is a series of values of the same type stored within a single value.
You can create array literals via brackets []
as demonstrated by the following snippet.
select [1, 2, 3] as array_of_ints
You can also explicitly declare the type of an array as follows.
select
['2018-01-01', '2018-02-01', '2018-03-01'] as array_of_string,
array<date>['2018-01-01', '2018-02-01', '2018-03-01'] as array_of_date
Structs
A struct is a grouping of values that need not be of the same type and is very similar to the concept of tuples. They are commonly used to group related values together. You can create struct literals using the function struct
select struct(1 as id, 2 as value) as user_info
Purpose
Arrays and structs allow for a more compact organization of related data which makes writing and reading many queries easier. So while BigQuery is the engine that I’m covering today I expect these concepts to spread to other databases as knowledge of their utility spreads.
Basic Usage
Example DataSet
I’ll be using the following CTE to demonstrate various functions for arrays & structs.
with data_sample as (
select
1 as id_race,
date'2018-08-01' as date_race,
[3, 4] as id_participants,
[ struct(7.0 as distance, 1 as lap_number, [struct(3 as id_participant, 1 as position),
struct(4 as id_participant, 2 as position)] as finish_order),
struct(6.5 as distance, 2 as lap_number, [struct(3 as id_participant, 1 as position),
struct(4 as id_participant, 2 as position)] as finish_order),
struct(7.2 as distance, 3 as lap_number, [struct(4 as id_participant, 1 as position),
struct(3 as id_participant, 2 as position)] as finish_order)
] as race_laps
union all
select
2 as id_race,
date'2018-08-08' as date_race,
[3, 5] as id_participants,
[ struct(7.5 as distance, 1 as lap_number, [struct(5 as id_participant, 1 as position),
struct(3 as id_participant, 2 as position)] as finish_order),
struct(7.4 as distance, 2 as lap_number, [struct(5 as id_participant, 1 as position),
struct(3 as id_participant, 2 as position)] as finish_order),
struct(7.3 as distance, 3 as lap_number, [struct(5 as id_participant, 1 as position),
struct(3 as id_participant, 2 as position)] as finish_order)
] as race_laps
)
Access an individual element from an array
with data_sample as (
--See above
--Note: these lines will be omitted from subsequent examples
)
select
ds.id_participants[offset(0)] as first_participant, --zero based
ds.id_participants[ordinal(1)] as first_participant_also --one based
from data_sample as ds
Determine the length of an array
select
array_length(ds.id_participants) as number_of_participants
from data_sample as ds
Access values & structs within an array
There are a couple of different ways to interact with arrays in BigQuery. The following three examples show different ways to access the example data structure and calculate the total distance for each race.
by joining the lap array
select
ds.id_race,
sum(rl.distance) as race_distance
from data_sample as ds
join ds.race_laps as rl
group by 1
by unnesting the lap array
select
ds.id_race,
sum(rl.distance) as race_distance
from data_sample as ds,
unnest(ds.race_laps) as rl
group by 1
by using at inline query
select
ds.id_race,
(select sum(rl.distance) from unnest(ds.race_laps) as rl) as race_distance
from data_sample as ds
by joining multiple arrays
The following query returns a list of race ids & participants ids & and a comma separated string showing that participants place each lap of each race.
select
ds.id_race,
fo.id_participant,
string_agg(cast(fo.position as string), ', ' order by rl.lap_number) lap_positions
from data_sample as ds
join ds.race_laps as rl
join rl.finish_order as fo
group by 1, 2
Filtering by contents of an array
select
ds.id_race
from data_sample as ds
where 3 in unnest(ds.id_participants)