Exploring global weather data

Machine learning is being used on many kinds of data, being especially successful in areas with lots of data like vision. However, we see quite little of it in meteorology, where big Numerical Weather Prediction simulations run daily that try to simulate the entire atmosphere up to weeks ahead. This takes supercomputers and lots of tweaking of the parameters to achieve good results.

Funny enough we know from The Bitter Lesson that data + compute always wins. Currently we take the approach of putting more and more human knowledge (physics, approximations) into these NWP models, and by and large this is a big success story.

The question is, however, not if but when we will hit the limit where data + compute + more general ML methods will beat these amazing numerical simulations.

In this post we will do some exploratory analysis of the NOAA "Global Surface summary Of the Day" dataset, which is available as BigQuery public dataset.

You can run this notebook in Colab! Simply download here and upload to Colab.

Basic imports¶

We set up plotly and gcloud libraries.

In [0]:

PROJECT = '{YOUR GCLOUD PROJECT}'

In [0]:

# We want plotly 4.0 as it has much better render support
!pip install --upgrade plotly>=4.0.0

In [0]:

from datetime import datetime

import plotly.graph_objs as go

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import Delaunay

from google.colab import auth
auth.authenticate_user()

from google.cloud.bigquery import magics

magics.context.project = PROJECT

from google.cloud import bigquery

bqc = bigquery.Client(project=PROJECT)

1. Retrieve and plot station position¶

Here we retrieve the station positions, triangulate it (for nice plotting of mesh), then plot them on a globe. We only use stations that are active during the nine years from 2010 to 2019.

In [0]:

# CONSTANTS
# Must be str
ds_begin = '20100101'
ds_end = '20190101'

In [5]:

# Use pandas directly here, could also use bigquery cell magic
df = pd.io.gbq.read_gbq(f'''
SELECT lon, lat, elev, begin, d.end
FROM `bigquery-public-data.noaa_gsod.stations` as d
WHERE begin < '{ds_begin}' AND d.end > '{ds_end}'
''', project_id=PROJECT, dialect='standard')
# Convert elevation to numeric
df.elev = pd.to_numeric(df.elev)
df.sample(3)

Out[5]:

	lon	lat	elev	begin	end
1286	15.681	58.406	52.4	19951012	20190725
3166	32.733	35.167	20.0	20090701	20190725
947	152.500	-32.183	5.0	20010915	20190724

In [6]:

pts = df[['lon', 'lat']].values

lat = np.deg2rad(df['lat'].values)
x = np.cos(np.deg2rad(df['lon'].values)) * np.cos(lat)
y = np.sin(np.deg2rad(df['lon'].values)) * np.cos(lat)
z = np.sin(lat)
xyz = np.stack((x, y, z)).T

# Do triangulation in spherical coordinates
tri = Delaunay(pts)

tri.simplices.shape

Out[6]:

(23569, 3)

In [7]:

layout = dict(
    title = f'Weather stations active during period {ds_begin} - {ds_end}<br>(Click and drag to rotate)',
    showlegend = False,
    width=1000,
    height=1000,
    geo = dict(
        showland = True,
        showlakes = True,
        showcountries = True,
        showocean = True,
        countrywidth = 0.5,
        landcolor = 'rgb(230, 145, 56)',
        lakecolor = 'rgb(0, 255, 255)',
        oceancolor = 'rgb(0, 255, 255)',
        projection = dict( 
            type = 'orthographic',
            rotation = dict(
                lon = -100,
                lat = 40,
                roll = 0
            )            
        ),
        lonaxis = dict( 
            showgrid = True,
            gridcolor = 'rgb(102, 102, 102)',
            gridwidth = 0.5
        ),
        lataxis = dict( 
            showgrid = True,
            gridcolor = 'rgb(102, 102, 102)',
            gridwidth = 0.5
        )
    )
)
    
go.Figure(data=[
    go.Scattergeo(lon=pts[:,0], lat=pts[:,1], mode='markers')
], layout=layout)

Station positions as graph¶

So we have all these points on the globe, but how can we feed this to an ML algorithm? The grid is not regular, not even on the sphere. One idea can be to encode these positions in a graph, enabling us to connect the stations to one another.

In [8]:

def plotly_trisurf(xyz, simplices, plot_edges=None):
    #x, y, z are lists of coordinates of the triangle vertices 
    #simplices are the simplices that define the triangularization;
    #simplices  is a numpy array of shape (no_triangles, 3)
    #insert here the  type check for input data

    lines = []

    for pt1, pt2, pt3 in tri.simplices:
        lines.extend([xyz[pt1].tolist(),
                      xyz[pt2].tolist(),
                      xyz[pt3].tolist(),
                      xyz[pt1].tolist(),
                      [None, None, None]
                     ])
    
    lines = np.array(lines)

    return go.Scatter3d(x=lines[:,0].tolist(),
                        y=lines[:,1].tolist(),
                        z=lines[:,2].tolist(),
                        mode='lines',
#                         line=dict(color= 'rgb(50,50,50)', width=1.5)
                       )

data1 = plotly_trisurf(xyz, tri.simplices)

layout = go.Layout(
    width=1000, height=1000,
    title=f'Weather stations plot as mesh'
)

go.Figure(data=[data1], layout=layout)

2. Establishing baselines¶

When we start training crazy triple loopback LSTM-distilled manifold-free machine learning models on this data later on, we need to know if the performance we are getting is good or not. If we predict temperature for tomorrow, is a MAE of 1.5*C good, or not?

To get a better "feel" for these numbers, we can use some (very) simple models and see how well they perform.

Model 1: Persistence¶

One of the simplest models is just taking yesterday's value, which is usually surprisingly good as the weather tomorrow has high correlation with the weather today. Let's see what MAE this gives us.

In [0]:

%%bigquery df --params {"ds_begin": "20100101", "ds_end": "20190101"}
-- Select station identifier and location, and the average absolute temperature difference between two consecutive days
SELECT stn, lon, lat, elev, COUNT(*) as cnt,
    AVG(ABS(temp_c - lag_1_temp_c)) as avg_temp_diff_lag_1,
    AVG(ABS(temp_c - lag_2_temp_c)) as avg_temp_diff_lag_2,
    AVG(ABS(temp_c - lag_3_temp_c)) as avg_temp_diff_lag_3,
    AVG(ABS(temp_c - lag_7_temp_c)) as avg_temp_diff_lag_7,
    AVG(ABS(temp_c - lag_14_temp_c)) as avg_temp_diff_lag_14
FROM (
  SELECT *,
    LAG(temp_c) OVER (PARTITION BY stn ORDER BY d) as lag_1_temp_c,
    LAG(temp_c, 2) OVER (PARTITION BY stn ORDER BY d) as lag_2_temp_c,
    LAG(temp_c, 3) OVER (PARTITION BY stn ORDER BY d) as lag_3_temp_c,
    LAG(temp_c, 7) OVER (PARTITION BY stn ORDER BY d) as lag_7_temp_c,
    LAG(temp_c, 14) OVER (PARTITION BY stn ORDER BY d) as lag_14_temp_c
  FROM (
    SELECT *,
      -- From F to C
      (g.temp-32)*5/9 as temp_c,
      -- They use a funny format where year, mo and da are strings
      CAST(CONCAT(g.year, '-', g.mo, '-', g.da) as DATE) as d
    FROM `bigquery-public-data.noaa_gsod.gsod2018` as g
    LEFT JOIN `bigquery-public-data.noaa_gsod.stations` as d
    ON g.stn = d.usaf
    WHERE d.begin < @ds_begin AND d.end > @ds_end
  )
)
-- Need to group by lon/lat/elev as we want to select them, but stn alone is enough really
GROUP BY stn, lon, lat, elev
-- Order for display purposes
ORDER BY avg_temp_diff_lag_1 DESC

In [10]:

df.head()

Out[10]:

	stn	lon	lat	elev	cnt	avg_temp_diff_lag_1	avg_temp_diff_lag_2	avg_temp_diff_lag_3	avg_temp_diff_lag_7	avg_temp_diff_lag_14
0	421480	79.467	29.033	+0233.0	2	17.833333	NaN	NaN	NaN	NaN
1	111420	13.183	47.317	+0647.0	2	14.944444	NaN	NaN	NaN	NaN
2	422370	72.367	27.133	+0234.0	2	12.555556	NaN	NaN	NaN	NaN
3	412161	52.464	24.074	+0015.2	2	9.277778	NaN	NaN	NaN	NaN
4	999999	-79.924	32.775	+0003.1	76403	8.809345	8.809345	8.809345	8.809345	8.809345

We immediately see that a handful of stations stand out, with Pantnagar Airport in India having extremely high day-to-day differences. Let's take a closer look and see what's up there.

In [11]:

%%bigquery
-- Select station identifier and location, and the average absolute temperature difference between two consecutive days
SELECT CAST(CONCAT(g.year, '-', g.mo, '-', g.da) as DATE) as d,
       (g.temp-32)*5/9 as temp_c
FROM `bigquery-public-data.noaa_gsod.gsod2018` as g
WHERE g.stn="421480"
ORDER BY d

Out[11]:

	d	temp_c
0	2018-08-18	32.055556
1	2018-12-11	14.222222

Aha! It has just two readings in 2018. Let's see how many readings the other stations have. I've secretly included that in the previous query, so we can graph it right away.

We exclude station "999999" as it seems to be some kind of "unknown station" code that has lots of readings. (Check df.loc[df.cnt.idxmax()] to find out).

In [12]:

plt.semilogy()
plt.title("Number of rows per station")
df[df.stn!='999999'].cnt.hist(bins=100);

This is kind of strange, as we would expect most stations to have 365 readings (one per day) and some to have less, which is clearly the case, but there is also a small amount of stations that have more than 365 readings. So we just filter the stations with readings almost every day, which is the bulk of them anyway.

In [13]:

%%bigquery
SELECT COUNT(*) as n
FROM (
    SELECT stn, COUNT(*) as cnt, COUNT(DISTINCT dt) as cnt_days
    FROM (
        SELECT *, CAST(CONCAT(g.year, '-', g.mo, '-', g.da) as DATE) as dt
        FROM `bigquery-public-data.noaa_gsod.gsod2018` as g
        LEFT JOIN `bigquery-public-data.noaa_gsod.stations` as d
        ON g.stn = d.usaf
        WHERE d.begin < '20100101' AND d.end > '20190101' AND stn != '999999'
    )
    -- Need to group by lon/lat/elev as we want to select them, but stn alone is enough really
    GROUP BY stn
    HAVING cnt != cnt_days
)

Out[13]:

	n
0	147

We see that there are just 147 stations with more readings than 365, so we can safely ignore them in further analysis. At some point we can revisit this and include stations where readings are not ambiguous.

We have now basically retrieved the MAE per station over the whole year of this very simple model, and we've done it for a few different days to get more feel of how our persistence model does over the span of two weeks.

Let's graph the results.

In [14]:

# Choose stations with lots of samples (this does not decrease our dataset that much)
df_subset = df[(350 < df.cnt) & (df.cnt <= 365)]
plt.title("Average absolute temperature difference")
sns.violinplot(data=df_subset[[
    "avg_temp_diff_lag_1",
    "avg_temp_diff_lag_2",
    "avg_temp_diff_lag_3",
    "avg_temp_diff_lag_7",
    "avg_temp_diff_lag_14"
]]);

We can see that for one day ahead this is approx. 2*C on average, increasing to about 4*C if we would forecast 14 days ahead using persistence. The lumps in the violin plot are probably due to discretization (and not at rounded numbers as we converted from F to C).

We are starting to get a "feel" for these numbers, and can now intuïtively say that a model that has MAE of 1*C for dayahead forecast on the same subset is doing quite well compared to this "persistence" baseline model.

Let's have some more fun and find the station where the temperature tomorrow is most similar to today's, and where the temperature in 2 weeks is most similar to today's.

In [15]:

df_subset.loc[[
    df_subset.avg_temp_diff_lag_1.idxmin(),
    df_subset.avg_temp_diff_lag_14.idxmin()
]]

Out[15]:

	stn	lon	lat	elev	cnt	avg_temp_diff_lag_1	avg_temp_diff_lag_2	avg_temp_diff_lag_3	avg_temp_diff_lag_7	avg_temp_diff_lag_14
10728	895780	77.02	-76.420	+2830.0	358	0.094927	0.156211	0.226604	0.497942	0.949451
10709	914900	-157.35	1.986	+0001.5	358	0.343137	0.442260	0.462598	0.521051	0.541990

The first is a station in the antarctic, the second Kiribati island in the pacific near the equator. This again makes sense, as the antarctic probably has less weather systems and energy influx to change the temperature drastically overnight, while on an island at the equator surrounded by sea we have the more stable long-term temperature.

So now we've done some good basic analysis on this dataset, and are ready to dive in deeper. Next post will be about exporting this dataset and doing some epic ML on it.