PETL

petl is a Python package for Extracting, Transforming and Loading tables of data.

In this tutorial we will show how to use it to simulate a complete ETL pipeline

Data structure

We will use the data present in https://opendata.ecdc.europa.eu/covid19/casedistribution/json/. The structure of the json file is the following:

{

 "records": [ 
 { 
   "dateRep": "21/04/2020", 
   "day": "21",
   "month": "4",
   "year": "2020",
   "cases": "35",
   "deaths": "2",
   "countriesAndTerritories": "Afghanistan",
   "geoId": "AF",
   "countryterritoryCode": "AFG",
   "popData2018": "37172386",
   "continentExp": "Asia"
 }, {
 ...
 }
 ]

So it is not possibile to load json file directly into petl

Then we will use a python list object containing the "records" array in json file

In [1]:
import petl as etl
import requests
import json
import os.path # To check if a file already exists
import datetime # For create a new file every day
import matplotlib.pyplot as plt # only for example
import pandas as pd #  only as an example
import math
In [2]:
today = datetime.date.today()
filename = "covid19-%s.json" % datetime.date.strftime(today, "%Y%m%d")

Extract data from opendata.ecdc.europa.eu

In [3]:
# Check if today file exists
if not os.path.isfile(filename):
    # if not download data and write it into file
    conn = requests.get("https://opendata.ecdc.europa.eu/covid19/casedistribution/json/")
    jsondata = conn.text
    f = open(filename, "w")
    f.write(jsondata)
    f.close()

Load json file

In [4]:
f = open(filename)
dict_data = json.load(f)
f.close()

Transform data into a petl table

In [5]:
table = etl.fromdicts(dict_data["records"]) # create a petl table from a dict array
table
Out[5]:
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018 continentExp
21/04/2020 21 4 2020 35 2 Afghanistan AF AFG 37172386 Asia
20/04/2020 20 4 2020 88 3 Afghanistan AF AFG 37172386 Asia
19/04/2020 19 4 2020 63 0 Afghanistan AF AFG 37172386 Asia
18/04/2020 18 4 2020 51 1 Afghanistan AF AFG 37172386 Asia
17/04/2020 17 4 2020 10 4 Afghanistan AF AFG 37172386 Asia

...

Problems with data

Data present in opendata.ecdc.europa.eu table have some problems

  1. All data are stored in strings
  2. dateRep is in a bad format (no ISO)
  3. Some columns are redundant (day, month, year, ...)
In [6]:
print(etl.typeset(table, "cases")) # cases column is in string format
print(table["dateRep"][0]) # bad format, impossible to sort
{'str'}
21/04/2020

Apply conversions

In [7]:
table = etl.convert(table, ["cases", "deaths", "day", "month", "year", "popData2018"], float)  # convert to integer
print(etl.typeset(table, "cases")) # now cases column is in int format
{'float'}

Apply transformation: add a new calculated field date

In [8]:
### Create date as a transformation of dateRep (yyyy/mm/dd)
table = etl.addfield(table, "date", lambda row: row["dateRep"][6:]+row["dateRep"][2:6]+row["dateRep"][0:2])
print(table["date"][0]) # check correct ISO form printing first element of column
2020/04/21

Sorting data by geoId and date

In [9]:
table = etl.sort(table, ['geoId', 'date'])

New transformation: add a ratio (cases/population) column

In this case we encounter a problem because in some records information about population is missing.

Then we must check it before calculate the ratio

We introduce here a lambda function (inline function)

The function fun check if popData2018 and cases fields are both float then compute ratio, else set ratio as NaN

In [10]:
fun = lambda row:  row["cases"] / row["popData2018"] if type(row["cases"]) == float and type(row["popData2018"]) == float else math.nan
table = etl.addfield(table, "casesratio", fun)

Remove some columns

year, month and day are not necessary

In [11]:
table = etl.cutout(table, "day")
table = etl.cutout(table, "month")
table = etl.cutout(table, "year")

Extraxt some data row from table

Select only european countries using a selection

Also in this case we will use a lambda function (inline function)

In [12]:
europe = etl.select(table, lambda row: row["continentExp"] == "Europe")

Extract (and reorder) some column from table

In [13]:
europe = etl.cut(europe, ["date", "countriesAndTerritories","popData2018", "cases", "deaths", "casesratio"])

Load data in csv file

In [14]:
etl.tocsv(europe, 'europe.csv')

Use pandas to compute some statistical values

In [15]:
df = etl.todataframe(table)  # petl can load file also in a <i>pandas</i> <b>DataFrame</b>
italy = df["geoId"] == "IT"

pandas statistical functions

pandas can compute statistical function directly from Serie

  1. cumsum
  2. shift
  3. diff
  4. ...
In [16]:
italy =   df[(df["geoId"] == "IT") & (df["date"] > '2020/03/01')]
france =  df[(df["geoId"] == "FR") & (df["date"] > '2020/03/01')]
germany = df[(df["geoId"] == "DE") & (df["date"] > '2020/03/01')]
uk =      df[(df["geoId"] == "UK") & (df["date"] > '2020/03/01')]
spain =   df[(df["geoId"] == "ES") & (df["date"] > '2020/03/01')]
plt.figure(figsize=(16, 9))
plt.plot(italy["date"], italy["cases"].cumsum(), label = 'Italy')
plt.plot(france["date"], france["cases"].cumsum().shift(-10), label = 'France')
plt.plot(germany["date"], germany["cases"].cumsum().shift(-8), label = 'Germany')
plt.plot(uk["date"], uk["cases"].cumsum().shift(-16), label = 'UK')
plt.plot(spain["date"], spain["cases"].cumsum().shift(-6), label = 'Spain')
plt.xticks(range(0, len(italy), 7), rotation = 45)
plt.legend()
plt.show()

Advanced

Using petl to add a cumsum column to a table

We want to add a column with cumulative sum using petl library

Start we europe table:

In [17]:
europe
Out[17]:
date countriesAndTerritories popData2018 cases deaths casesratio
2020/03/03 Andorra 77006.0 1.0 0.0 1.2986001090824092e-05
2020/03/14 Andorra 77006.0 1.0 0.0 1.2986001090824092e-05
2020/03/16 Andorra 77006.0 3.0 0.0 3.8958003272472275e-05
2020/03/17 Andorra 77006.0 9.0 0.0 0.00011687400981741682
2020/03/18 Andorra 77006.0 0.0 0.0 0.0

...

Now:

  1. transform table in array of tuples
  2. create vector v with cumsum values
  3. add a new column cumsum using vector v
  4. use the field to draw a graph
In [18]:
arr = etl.toarray(europe)
cumsum = int(arr[0][3])  # first value
# remember table is well erdered
v = [cumsum]  # first element of new columns
for i in range(1, len(arr)):
    if arr[i][1] != arr[i-1][1]:
        cumsum = int(arr[i][3]) # if country changes cumsum starts from beginning
    else:
        cumsum = cumsum + int(arr[i][3]) # otherwise continue to cumulate data
    v.append(cumsum)

Add a column and create a new table

In [19]:
t3 = etl.addcolumn(europe, 'cumsum', v)
In [20]:
t3
Out[20]:
date countriesAndTerritories popData2018 cases deaths casesratio cumsum
2020/03/03 Andorra 77006.0 1.0 0.0 1.2986001090824092e-05 1
2020/03/14 Andorra 77006.0 1.0 0.0 1.2986001090824092e-05 2
2020/03/16 Andorra 77006.0 3.0 0.0 3.8958003272472275e-05 5
2020/03/17 Andorra 77006.0 9.0 0.0 0.00011687400981741682 14
2020/03/18 Andorra 77006.0 0.0 0.0 0.0 14

...

Draw chart

Select Italy and France...

In [21]:
italy = etl.select(t3, lambda row: row["countriesAndTerritories"] == "Italy")
france = etl.select(t3, lambda row: row["countriesAndTerritories"] == "France")

Plot data

In [22]:
plt.figure(figsize=(16, 9))
plt.plot(italy["date"], italy["cumsum"])
plt.plot(france["date"], france["cumsum"])
plt.xticks(range(0, len(italy), 7), rotation=45)
plt.show()