Post 1 (Interactive Plots)

Let’s make some interactive visualizations for the NOAA-GHCN (global temperature) dataset!

Here is the post’s general layout:

  1. Preparation
  2. Create Database Tables
  3. Before We Plot
  4. First Plot
  5. Second Plot
  6. Third Plot

Let’s dive into it!

1. Preparation

First, we import necessary packages for this project.

# packages for data manipulation
import pandas as pd
import sqlite3
import numpy as np

# packages for ploting and data analysis
from plotly import express as px
from plotly.io import write_html
from sklearn.linear_model import LinearRegression

2. Create Database Tables

Now we can create database tables with the sqlite3 package!

Before Creating Tables

We first connect to a temporary database.

conn = sqlite3.connect("temp.db")

Note: if temp.db did not exist before connection, it will be created.
This will be the database where we store the 3 tables we’re going to create!

a. temperatures Table

We first create a table for the temperatures data.

Before we jump in, we want to take a look at the dataset, and decide whether it needs cleaning.\ Since the dataset is huge, we will use this trick of chunksize, in which you read the data a chunk at a time.

# reading csv by chunksize of 100,000 rows at a time
df_iter = pd.read_csv("temps.csv", chunksize = 100000)

df_iter works as a iterator, thus it also has the __next__() function. We can use this to look at the dataframe chunk we read.

# looking at the dataframe chunk we read
df_iter.__next__()

temp_raw.png

We see that the each month’s data is in a separate column. But it is hard to plot with data like that. Thus, we can use a trick to first make it multi-index, then use the stack function to stack these columns into a single column, and finally reset the index.
Since we’ll be doing the cleaning for every chunk of data, we will enclose the cleaning steps in a function.

# write a funciton to clean the raw temperature data
def cleaning_temp(df):
    """
    This function takes in a piece of dataframe,
    and returns a cleaned version of the dataframe.
    
    It rearranges columns from horizontal (by month) to vertical with an extra month column, 
    put temperature in celcius, 
    and rename the columns to make them more make sense
    """
    # reorder columns
    df = df.set_index(keys=["ID", "Year"]).stack().reset_index()
    # rename columns
    df = df.rename(columns = {"level_2"  : "Month" , 0 : "Temperature"})
    # get months
    df["Month"] = df["Month"].str[5:].astype(int)
    # put temperature in celsius
    df["Temperature"]  = df["Temperature"] / 100
    return (df)

Let’s take a look at how the cleaned data would look like:

cleaning_temp(df_iter.__next__())

temp_cleaned


It looks like what we want! We can now create the table, with the chunksize trick, the cleaning function we just created, and the to_sql function in Pandas:

# read raw data by chunks
df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
    # clean each chunk
    df = cleaning_temp(df)
    # append each chunk to the table
    df.to_sql("temperatures", conn, if_exists = "append", index = False)



b. stations Table

Since the stations dataset is small, we can just read it directly.

# read raw data
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)

Likewise, let’s take a look at the raw data and see if we need to clean anything.

stations

stations_raw

Everything looks good to us, therefore we can create the table now, with the to_sql function we used earlier.

# create table
stations.to_sql("stations", conn, if_exists = "replace", index = False)



c. countries Table

Similarly, since the countries dataset is small, we can just read it directly.

# read raw data
url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(url)

And again let’s take a look at the raw data and see if we need to clean anything.

countries

countries_raw

We don’t need the “ISO 3166” column, and for the sake of sql commend, let’s get get a cleaner name for “FIPS 10-4”.

# clean the raw data
countries = countries.rename(columns = {"FIPS 10-4": "FIPS"})
countries = countries.drop(["ISO 3166"], axis=1)

# look at the cleaned data
countries

countries_cleaned

It looks good to us, thus we can create the table now.

# create table
countries.to_sql("countries", conn, if_exists = "replace", index = False)



After Creating Tables

It is important to close the connection to the database once we’re done.

# close the database connection
conn.close()



3. Before We Plot

We will be doing some data cleaning and analysis to the data we’re going to query. Thus, we first define some helper functions here for the steps that we’ll be doing multiple times. What each function does is explained in the documentation inside the functions.

A new function we used here is transform: this function works similar to apply, but instead of calculating one value for the datagroup, it generates the same value for each original row of the datagroup.

This helper function filters rows based on the number of observations.

def filtering(df, min_obs):
    """
    This function takes in a piece of dataframe,
    and returns a filtered version of the dataframe.
    """
    
    # create a count column based on the number of observations for a specific station at a specific month
    df["counts"] = df.groupby(["NAME", "Month"])["Year"].transform('count')
    # filters each row based on the count column
    df_filtered = df[df['counts'] >= min_obs]

    return df_filtered

This function calculates the coefficient from the regression of x and y in the datagroup.

def coef(data_group):
    """
    This function takes in a data group,
    and returns the coefficient from the regression of x and y.
    """
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)	    # fit the regression
    return LR.coef_[0]     # returns the first coefficient from the regression

This function uses the coef function above, to calculate coefficient for each datagroup inside a dataframe, then create a new dataframe of coefficients.

def coef_col(df): 
    """
    This function takes in a piece of dataframe, 
    calculates coeffients for each datagroup,
    then returns a dataframe of the coefficients.
    """
    # calculating coefficient from linear regression
    coefs = df.groupby(["NAME", "Month"]).apply(coef)
    # rearrange the index 
    coefs = coefs.reset_index()
    # round each coefficient to 4 decimal
    coefs[0] = coefs[0].round(4)
    # rename the coefficient column
    coefs = coefs.rename(columns={0: "Estimated Yearly Increase (C°)"})
    return coefs

Now we can finally start plotting!
For each plot, we will be doing 2 steps:

  1. write a query function to query the information we need
  2. write a plotting function so that it’s easier for user to change parameters


4. First Plot

a. query function

We will be using pd.read_sql_query(). This function allows us to query using SQL command from a local database. It works like below:

# cmd: a string, consisting of SQL command
# conn: connection to database
df = pd.read_sql_query(cmd, conn)

Let’s also learn some SQL query basics:

SELECT: the colomns of a table you wish to select

FROM: the table you wish to get data from

LEFT JOIN and ON: joins two or more tables together, and on decides which column the two tables should agree on

WHERE: add conditions to the query

AS: gives your column/table a nickname.

Finally, the command should end with a semicolon.

Now we can write our query function! We wish to write a function that gives us

  • The station name.
  • The latitude of the station.
  • The longitude of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature at the specified station during the specified year and month.
    The function looks like the following:
    def query_climate_database(country, year_begin, year_end, month):
      """
      This function takes 4 arguments: 
      country: a string giving the name of a country for which data should be returned.
      year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
      month: an integer giving the month of the year for which should be returned.
        
      It returns a Pandas dataframe of temperature readings for the specified country, 
      in the specified date range, in the specified month of the year.
      """
        
      # sql command for the query
      cmd = \
      f" SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Name AS Country, T.Year, T.Month, T.Temperature AS Temp\
      FROM temperatures T \
      LEFT JOIN stations S \
      ON T.ID=S.ID \
      LEFT JOIN countries C \
      ON C.FIPS=SUBSTRING(T.ID, 1, 2)\
      WHERE Country='{country}' AND T.Year>={year_begin} AND T.Year<={year_end} AND T.Month={month};"
    
      # connect to the database created
      conn = sqlite3.connect("temp.db")
      # create a dataframe based on the information queried
      df = pd.read_sql_query(cmd, conn)
      # close the database connection
      conn.close()
        
      return df
    

Let’s look at whether this function works:

df1 = query_climate_database(country = "India", 
                       		year_begin = 1980, 
                       		year_end = 2020,
                       		month = 1)
df1

query1

It works as we desired! Now we can use this function to create our plotting function.

b. plotting function

We’ll be using plotly.express to make our interactive visualization. It looks similar to the matplotlib we’ve been using, but it starts by specifying a dataframe. Then, x, y, and several other arguments are interpreted as columns of the supplied dataframe.

To make a scatter plot on a map, we choose to use the scatter_mapbox function. As we will see below, we use it like this:

px.scatter_mapbox(final_pd, 
                  lat = "LATITUDE",
                  lon = "LONGITUDE", 
                  hover_name = "NAME", 
                  color = "Estimated Yearly Increase (C°)",
                  title=f"Estimates of Yearly Increase in Temperature <br> \
                  		in month {month}, stations in {country}, \
                  		year {year_begin}-{year_end}",
                  **kwargs)

lat: the column in the dataframe that represents latitude

lon: the column in the dataframe that represents longitude

hover_name: the message that shows up when we hover over a dot

color: the color of each dot depends on this column’s value

title: plot title

As we have created the query function, the filtering function, and function to calculate coefficient, we can write a clean plotting function using these helper functions. \

Steps we will take:

  1. use the query_climate_database function to get the dataframe we want, with the arguments passed in from the plotting function
  2. use the filtering function to select data only for stations with at least min_obs years worth of data
  3. use the coef_col function to calculate coefficient from linear regression
  4. use merge to add the longitude and latitude information to the coefs dataframe
  5. use plotly.express.scatter_mapbox to plot!
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    This function takes 5 arguments and an undetermined number of keyword arguments.
    country: a string giving the name of a country for which data should be returned.
    year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
    month: an integer giving the month of the year for which should be returned.
    min_obs: the minimum required number of years of data for any given station. 
    **kwargs, additional keyword arguments passed to px.scatter_mapbox(). 
    
    This function plots an interactive geographic scatterplot, 
    constructed using Plotly Express, with a point for each station, 
    such that the color of the point reflects an estimate of the yearly 
    change in temperature during the specified month and time period at that station.
    """

    # creating dataframe with the arguments passed in
    df = query_climate_database(country, year_begin, year_end, month)
    
    # select data only for stations with at least min_obs years worth of data
    df_filtered = filtering(df, min_obs)
    
    # calculating coefficient from linear regression
    coefs = coef_col(df_filtered)
    
    # add latitude and longitude to the coefficient dataframe
    final_pd = pd.merge(coefs, 
                        df_filtered[["NAME", "LATITUDE", "LONGITUDE"]].drop_duplicates(), 
                        on=["NAME"])
    
    # plot with plotly
    fig = px.scatter_mapbox(final_pd, 
                            lat = "LATITUDE",
                            lon = "LONGITUDE", 
                            hover_name = "NAME", 
                            color = "Estimated Yearly Increase (C°)",
                            title=f"Estimates of Yearly Increase in Temperature <br>in month {month}, stations in {country}, year {year_begin}-{year_end}",
                            **kwargs)
    
    return (fig)

Let’s look at the output:

color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map, 
                                   width=800)
fig.show()

It looks fantastic!

This step is optional. We do this in order to put the plot in our blog.

write_html(fig, "blog1-plot1.html")




5. Second Plot

a. query function

Similarly, we will be using pd.read_sql_query. We want to make a choropleth this time, thus we need every country’s data. And since we’ll have the geojson data, we don’t need the latitude and longitude data anymore. And we will average the temperatures of all stations in that country. We wish to write a function that gives us

  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature among the stations in a specific country during the specified year and month.
    The function looks like the following:
def query_climate_database2(year_begin, year_end, month):
    """
    This function takes 3 arguments: 
    year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
    month: an integer giving the month of the year for which should be returned.
    
    It returns a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year.
    """
    
    # sql command for the query
    cmd = \
    f" SELECT C.Name AS NAME, T.Year, T.Month, AVG(T.Temperature) AS Temp\
    FROM temperatures T \
    LEFT JOIN countries C \
    ON C.FIPS=SUBSTRING(T.ID, 1, 2)\
    WHERE NAME IS NOT NULL AND T.Year>={year_begin} AND T.Year<={year_end} AND T.Month={month} \
    GROUP BY NAME, T.Year, T.Month;"

    # connect to the database created
    conn = sqlite3.connect("temp.db")
    # create a dataframe based on the information queried
    df = pd.read_sql_query(cmd, conn)
    # close the database connection
    conn.close()
    
    return df

Let’s look at whether this function works:

df2 = query_climate_database2(year_begin = 1980, year_end = 2020, month = 1)
df2

query2

It works as we desired! Now we can use this function to create our plotting function.

b. plotting function

First, we need to import the geojson data.

from urllib.request import urlopen
import json

countries_gj_url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/countries.geojson"

with urlopen(countries_gj_url) as response:
    countries_gj = json.load(response)

Now we have the geojson data for each country in variable countries_gj.

To make a choropleth on a map, we choose to use the choropleth function. As we will see below, we use it like this:

px.choropleth(coefs, 
              geojson=countries_gj,
              locations = "NAME",
              locationmode = "country names",
              color = "Estimated Yearly Increase (C°)",
              title=f"Estimates of Yearly Increase in Temperature <br> \
              		for all countires in month {month}, \
              		year {year_begin}-{year_end}",
              **kwargs)

geojson: the geographic data for each country

locations: the column in the dataframe that represents country name

locationmode: how the geographic data would match up with the dataframe

color: the color of each dot depends on this column’s value

title: plot title

Steps we will take are similar to the first plot, except that we won’t be merging for the longitude and latitude data (since we don’t need them).

def temperature_coefficient_plot2(year_begin, year_end, month, min_obs, **kwargs):
    """
    This function takes 4 arguments and an undetermined number of keyword arguments.
    year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
    month: an integer giving the month of the year for which should be returned.
    min_obs: the minimum required number of years of data for any given station. 
    **kwargs, additional keyword arguments passed to px.scatter_mapbox(). 
    
    This function plots an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station.
    """
    
    # creating dataframe with the arguments passed in
    df = query_climate_database2(year_begin, year_end, month)
    
    # select data only for stations with at least min_obs years worth of data
    df_filtered = filtering(df, min_obs)
    
    # calculating coefficient from linear regression
    coefs = coef_col(df_filtered)
    
    fig = px.choropleth(coefs, 
                        geojson=countries_gj,
                        locations = "NAME",
                        locationmode = "country names",
                        color = "Estimated Yearly Increase (C°)",
                        title=f"Estimates of Yearly Increase in Temperature <br> \
                        		for all countires in month {month}, \
                        		year {year_begin}-{year_end}",
                        **kwargs)
    
    return (fig)

Let’s look at the output:

fig = temperature_coefficient_plot2(1980, 2020, 1, min_obs = 10, 
                                    color_continuous_scale=color_map, 
                                    width=800)
fig.show()

Now we can easily tell each country/region’s average temperature increase in a specific month!

6. Third Plot

a. query function

Finally, let’s make the third plot.

We want to make a histogram of each station’s average temperature increase in a specific month, and use facet to compare different months, thus we need every month’s data. And since we won’t be plotting on a map, we don’t need the latitude and longitude data.

Thus, we wish to write a function that gives us

  • The station’s name
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature in a specific station during the specified year and month.
    The function looks like the following:
def query_climate_database3(country, year_begin, year_end, month_begin, month_end):
    """
    This function takes 5 arguments: 
    country: a string giving the name of a country for which data should be returned.
    year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
    
    It returns a Pandas dataframe of temperature readings for the specified country, in the specified date range.
    """
    
    # sql command for the query
    cmd = \
    f" SELECT S.NAME, C.Name AS Country, T.Year, T.Month, T.Temperature AS Temp\
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.ID=S.ID \
    LEFT JOIN countries C \
    ON C.FIPS=SUBSTRING(T.ID, 1, 2)\
    WHERE Country='{country}' AND T.Year>={year_begin} AND T.Year<={year_end} AND T.Month>={month_begin} AND T.Month<={month_end};"

    # connect to the database created
    conn = sqlite3.connect("temp.db")
    # create a dataframe based on the information queried
    df = pd.read_sql_query(cmd, conn)
    # close the database connection
    conn.close()
    
    return df

Let’s look at whether this function works:

df3 = query_climate_database3(country="India", year_begin = 1980, year_end = 2020, 
                             month_begin = 1, month_end = 3)
df3

query3

It works as we desired! Now we can use this function to create our plotting function.

b. plotting function

Here we choose to use the histogram function. As we will see below, we use it like this:

px.histogram(coefs, 
             x = "Estimated Yearly Increase (C°)",
             hover_name = "NAME",
             facet_row = "Month", 
             barmode='stack', 
             title=f"Counts of Stations for Average Temperature Increase <br> \
             		month {month_begin}-{month_end}, stations in {country}, \
             		year {year_begin}-{year_end}",
             **kwargs)

facet_row: what to separate each subplot by

barmode: how each bar of counts is generated

Steps we will take are similar to the second plot.

def temperature_coefficient_plot3(country, year_begin, year_end, 
                                  month_begin, month_end, min_obs, **kwargs):
    """
    This function takes 6 arguments and an undetermined number of keyword arguments.
    country: a string giving the name of a country for which data should be returned.
    year_begin and year_end: two integers giving the earliest and latest years for which should be returned.
    month_begin and month_end: two integers giving the earliest and latest month of the year for which should be returned.
    min_obs: the minimum required number of years of data for any given station. 
    **kwargs, additional keyword arguments passed to px.scatter_mapbox(). 
    
    This function plots an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station.
    
    """
    df = query_climate_database3(country, year_begin, year_end, month_begin, month_end)
    
    # select data only for stations with at least min_obs years worth of data
    df_filtered = filtering(df, min_obs)
    
    # calculating coefficient from linear regression
    coefs = coef_col(df_filtered)
    
    fig = px.histogram(coefs, 
                       x = "Estimated Yearly Increase (C°)",
                       hover_name = "NAME",
                       facet_row = "Month", 
                       barmode='stack', 
                       title=f"Distribution of Stations for Average Temperature Increase <br>month {month_begin}-{month_end}, stations in {country}, year {year_begin}-{year_end}",
                       **kwargs)
    
    return (fig)

Let’s look at the output:

fig = temperature_coefficient_plot3("India", 1980, 2020, 1, 3, 
                                    min_obs = 10, 
                                    opacity = 0.5, 
                                    nbins = 30,
                                    width=500)
fig.show()

Now we can easily tell the distrubution of the average temperature increase in a specific country and in a specific month!



We’re done with the interactive visualization! Feel free to leave a comment below if you have any question/sugeestions.



Written on October 6, 2021