Evolving schema with Parquet files in hive

Different versions of parquet used in different tools (presto, spark, hive) may handle schema changes slightly
differently, causing a lot of headaches.

Parquet basically only supports the addition of new columns, but what if we have a change like the following :
– renaming of a column
– changing the type of a column, including from scalar to array
– not changing the type, but the meaning of a column
(for instance, switching from pounds to Kilos)

We want to support compatibility both ways, that is, forward and backward compatibility

That is, after we upgrade the schema, we should have –
 1. Backward compatibility: old queries would still run and be correct
 2. Forward compatibility : old data will work with new queries

Demonstration

— schema, v1
DROP TABLE IF EXISTS people_data_v1;
CREATE TABLE people_data_v1 (
name string,
weight double,
gender string
)
PARTITIONED BY (ingest_date string)
STORED AS PARQUET;

— some test values
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO TABLE people_data_v1 PARTITION(ingest_date=’2018-12-24′) VALUES
(‘Eddard Stark’, 180.0, “Male”),
(‘Rob Stark’, 126.0, “Male”);

INSERT INTO TABLE people_data_v1 PARTITION(ingest_date=’2018-12-25′) VALUES
(‘Arya Stark’, 60.0, “Female”),
(‘Sansa Stark’, 80.0, “Female”);

select * from people_data_v1 where ingest_date=’2018-12-24′;

We are going to create a new table for each schema, to make sure we don’t unintentionally mix data with
different schemata. We are going to call :
– the tables with the pattern people_data_v??
– the views as people_v??

with the implicit contract that under a view version, data would look like that version. This way we can
keep backward compatibility. We can keep an extra top view ( master view ) called people
that points to the most recent view

Create a Master always pointing to latest version

DROP VIEW IF EXISTS people;
CREATE VIEW IF NOT EXISTS people( name, weight, gender, ingest_date)
PARTITIONED ON (ingest_date) AS select * from people_data_v1;

–Check data
select * from people;

Note that the table is partitioned by date. It is really important for partition pruning in hive to work that the views are aware of the partitioning schema of the underlying tables. Hive will do the right thing, when querying using the partition, it will go through the views and use the partitioning information to limit the amount of data it will read from disk.

Schema alteration

Let’s suppose that a new column age is to be added

— schema, v2
DROP TABLE IF EXISTS people_data_v2;
CREATE TABLE people_data_v2 (
name string,
weight double,
gender string,
age int
)
PARTITIONED BY (ingest_date string)
STORED AS PARQUET;

— some test values
INSERT INTO TABLE people_data_v2 PARTITION(ingest_date=’2018-12-26′) VALUES
(‘Aegon Targaryen’, 200.0, “Male”,60),
(‘Daenerys Targaryen’, 90.0, “Female”,24),
(‘Jon Snow’, 135.0, “Male”,24);

INSERT INTO TABLE people_data_v2 PARTITION(ingest_date=’2018-12-27′) VALUES
(‘Tywin Lannister’, 160.0, “Male”,65),
(‘Jamie Lannister’, 140.0, “Male”,41),
(‘Tyrion Lannister’, 60.0, “Male”,38),
(‘Cersei Lannister’, 80.0, “Female”,40);

select * from people_data_v2 where ingest_date=’2018-12-27′;

Schema Evolution for backward and forward compatibility

Create a new partitioned view for forward compatibility

Objective is to make new data look like old. Old consumers will see the new data but will not see new columns

— create view for backward compatibility
— make new data look like old one
DROP VIEW IF EXISTS people_v1;
CREATE VIEW people_v1 (name, weight, gender, ingest_date)
PARTITIONED ON (ingest_date) AS
SELECT * from people_data_v1
UNION ALL
SELECT name, weight, gender, ingest_date from people_data_v2 ;

— Check
select * from people_v1;

This also satisfies our rule that under a view version, data would look like that version. So under view
people_v1 data looks like people_data_v1

Create a new partitioned view for backward compatibility

Objective is to make old data look like new. New consumers will see the new data along with old data
with newer field having default values

— forward compatibility
— old data will look like new data
DROP VIEW IF EXISTS people_v2;
CREATE VIEW people_v2 (name, weight, gender, age, ingest_date)
PARTITIONED ON (ingest_date) AS
SELECT name,weight, gender, NULL AS age, ingest_date from people_data_v1
UNION ALL
SELECT * from people_data_v2 ;

— Check
select * from people_v2;

This also satisfies our rule that under a view version, data would look like that version. So under view
people_v2 data looks like people_data_v2

Lastly, set master view to the most recent one

DROP VIEW IF EXISTS people;
CREATE VIEW IF NOT EXISTS people(name, weight, gender, age, ingest_date)
PARTITIONED ON (ingest_date) AS select * from people_v2;

–Check
Select * from people;

Using views to abstract schema changes is cheap and extremely easy to roll back in case your users are not ready to upgrade to the new schema

Summary

  • Create versioned data tables to reflect the point-in-time schema
  • Create sub-views to accommodate forward and backward compatibility
  • Create a master view, and update it whenever the schema changes

Reference

This post is just a modified version of : this post