Data Cleansing and Records Calculations

Part 2 of 3

This notebook follows sequentially from NOAA-CO-OPS-data in which we downloaded the latest data for a particular NOAA CO-OPS weather and tide station. The data record and corresponding metadata were written to file. Here we use those data to calculate several daily and monthly statistics and records. This is done in two steps:

Filter the data: We do not perform any quality assurance or quality control checks, but we do remove from the records any days missing a specified amount of data and any months missing a specified number of days of data.
Calculate records:
- Daily and monthly averages
- Record high daily and monthly averages^*
- Record low daily and monthly averages^*
- Average daily and monthly high
- Lowest daily and monthly high^*
- Record daily and monthly high^*
- Average daily and monthly low
- Highest daily and monthly low^*
- Record daily and monthly low^*

Years are also noted for those records marked by an asterisk (*).

Packages and configurations

First we import the packages we need.

import pandas as pd
import xarray as xr
import numpy as np
import calendar
import yaml
import os

Functions

Next, we define a number of functions that will come in handy later:

camel: This will use the site location to create a directory name in camelCase (e.g., “virginiaKeyFl”) so that we do not have to do it manually
DOY: Determine day of year (DOY) out of 366 for each day in the data record
count_missing_hours: Return the number of hours of missing data for a given day
count_missing_days: Return the number of days of missing data for a given month
filter_hours: Filter out days with too many hours of missing data
filter_days: Filter out months with too many days of missing data
Functions for all of the highs, lows, and averages (see their docstrings for more information)

Helper functions

def camel(text):
    """Convert 'text' to camel case"""
    s = text.replace(',','').replace("-", " ").replace("_", " ")
    s = s.split()
    if len(text) == 0:
        return text
    return s[0].lower() + ''.join(i.capitalize() for i in s[1:])

def DOY(df):
    """Determine year day out of 366"""
    # Day of year as integer
    df['YearDay'] = df.index.day_of_year.astype(int)
    # Years that are NOT leap years
    leapInd = [not calendar.isleap(i) for i in df.index.year]
    mask = (leapInd) & (df.index.month > 2)
    # Advance by one day everything after February 28 
    df.loc[mask, 'YearDay'] += 1
    return df

Filtering data

def count_missing_hours(group, threshold=4):
    """Return True if the number of hours in a day with good data is 
    greater than or equal to 24-'threshold' (i.e., a 'good' day) and False 
    otherwise.
    """
    num_obs = (~group.resample('1h').mean().isna()).sum()
    good_threshold = 24 - threshold
    return num_obs >= good_threshold

def count_missing_days(group, threshold=2):
    """Return True if the number of days in a month with good data 
    is greater than or equal to the number of days in the month minus 'theshold' (i.e., a 'good' month) and False
    otherwise.
    """
    try:
        days_in_month = pd.Period(group.index[0].strftime(format='%Y-%m-%d')).days_in_month
        good_days = (~group.resample('1D').mean().isna()).sum()
        good_threshold = days_in_month - threshold
        missing_days_flag = good_days > good_threshold
        return good_days >= good_threshold
    except IndexError:
        pass

def filter_hours(data, hr_threshold=4):
    """Filter data to remove days with more than 'hr_threshold' missing
    hours of data.
    """
    # Filter out fillVals==31.8
    filtered = data.replace(31.8, np.nan)
    # Filter out days missing more than <hr_threshold> hours
    filtered = filtered.groupby(pd.Grouper(freq='1D')).filter(
        lambda x: count_missing_hours(group=x, threshold=hr_threshold))
    return filtered

def filter_days(data, hr_threshold=4, day_threshold=2):
    """Filter months with more than 'day_threshold' days of missing
    data by first filtering data to remove days with more than 
    'hr_threshold' missing hours of data.
    """
    # Filter out fillVals==31.8
    filtered = data.replace(31.8, np.nan)
    # Filter out days missing more than <hr_threshold> hours
    filtered = filter_hours(filtered, hr_threshold=hr_threshold)
    # Filter out months missing more than <day_threshold> days
    filtered = filtered.groupby(pd.Grouper(freq='1M')).filter(
        lambda x: count_missing_days(group=x, threshold=day_threshold))
    return filtered

Calculate records

def daily_highs(df):
    """Daily highs"""
    return df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).max(numeric_only=True)
              
def daily_lows(df):
    """Daily lows"""
    return df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).min(numeric_only=True)

def daily_avgs(df, true_average=False):
    """Daily averages by calendar day. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    if true_average:
        results = df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).mean(numeric_only=True)
    else:
        dailyHighs = daily_highs(df)
        dailyLows = daily_lows(df)
        results = (dailyHighs + dailyLows) / 2
    return results

def mon_daily_highs(df):
    """Daily highs using data filtered by days"""
    return df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).max(numeric_only=True)

def mon_daily_lows(df):
    """Daily lows using data filtered by days"""
    return df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).min(numeric_only=True)

def mon_daily_avgs(df, true_average=False):
    """Daily averages by calendar day using data filtered by day. If
    'true_average' is True, all measurements from each 24-hour day will be
    used to calculate the average. Otherwise, only the maximum and minimum
    observations are used. Defaults to False (meteorological standard).
    """
    if true_average:
        return df.groupby(pd.Grouper(freq='1D', closed='left', label='left', dropna=True)).mean(numeric_only=True)
    else:
        dailyHighs = mon_daily_highs(df)
        dailyLows = mon_daily_lows(df)
        results = (dailyHighs + dailyLows) / 2
        return results

def daily_avg(df, true_average=False):
    """Daily averages. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    dailyAvgs = daily_avgs(df, true_average=true_average)
    dailyAvg = dailyAvgs.groupby('YearDay').mean(numeric_only=True)
    dailyAvg.index = dailyAvg.index.astype(int)
    results = xr.DataArray(dailyAvg, dims=['yearday', 'variable'])
    results.name = 'Daily Average'
    return results

def monthly_highs(df, true_average=False):
    """Monthly highs. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    dailyAvgs = daily_avgs(df, true_average=true_average)
    monthHighs = dailyAvgs.groupby(pd.Grouper(freq='M')).max(numeric_only=True)
    return monthHighs
  
def monthly_lows(df, true_average=False):
    """Monthly lows. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    dailyAvgs = daily_avgs(df, true_average=true_average)
    monthLows = dailyAvgs.groupby(pd.Grouper(freq='M')).min(numeric_only=True)
    return monthLows
    
def monthly_avg(df, true_average=False):
    """Monthly averages. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological
    standard).
    """
    dailyAvgs = mon_daily_avgs(df, true_average=true_average)
    monthlyMeans = dailyAvgs.groupby(pd.Grouper(freq='M')).mean(numeric_only=True)
    monthlyMeans.drop('YearDay', axis=1, inplace=True)
    monthlyAvg = monthlyMeans.groupby(monthlyMeans.index.month).mean(numeric_only=True)
    monthlyAvg.index = monthlyAvg.index.astype(int)
    results = xr.DataArray(monthlyAvg, dims=['month', 'variable'])
    results.name = 'Monthly Average'
    return results

def record_high_daily_avg(df, true_average=False):
    """Record high daily averages. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the records
    dailyAvgs = daily_avgs(df=df, true_average=true_average)
    recordHighDailyAvg = dailyAvgs.groupby('YearDay').max(numeric_only=True)
    recordHighDailyAvg.index = recordHighDailyAvg.index.astype(int)
    # Record years
    recordHighDailyAvgYear = dailyAvgs.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    recordHighDailyAvgYear.drop('YearDay', axis=1, inplace=True)
    recordHighDailyAvgYear.index = recordHighDailyAvgYear.index.astype(int)
    recordHighDailyAvgYear.columns = [i+' Year' for i in recordHighDailyAvgYear.columns]
    # Create xarray
    results = pd.concat((recordHighDailyAvg, recordHighDailyAvgYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Record High Daily Average'
    return results
    
def record_high_monthly_avg(df, true_average=False):
    """Record high monthly averages. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the records
    dailyAvgs = mon_daily_avgs(df, true_average=true_average)
    monthlyAvgs = dailyAvgs.groupby(pd.Grouper(freq='M')).mean(numeric_only=True)
    monthlyAvgs.drop('YearDay', axis=1, inplace=True)
    recordHighMonthlyAvg = monthlyAvgs.groupby(monthlyAvgs.index.month).max(numeric_only=True)
    recordHighMonthlyAvg.index = recordHighMonthlyAvg.index.astype(int)
    # Record years
    recordHighMonthlyAvgYear = monthlyAvgs.groupby(monthlyAvgs.index.month).apply(
                lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    recordHighMonthlyAvgYear.index = recordHighMonthlyAvgYear.index.astype(int)
    recordHighMonthlyAvgYear.columns = [i+' Year' for i in recordHighMonthlyAvgYear.columns]
    # Create xarray
    results = pd.concat((recordHighMonthlyAvg, recordHighMonthlyAvgYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Record High Monthly Average'
    return results

def record_low_daily_avg(df, true_average=False):
    """Record low daily averages.  If 'true_average' True, all measurements from each 24-hour day will be used to calculate the average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard)."""
    # Calculate the records
    dailyAvgs = daily_avgs(df=df, true_average=true_average)
    recordLowDailyAvg = dailyAvgs.groupby('YearDay').min(numeric_only=True)
    recordLowDailyAvg.index = recordLowDailyAvg.index.astype(int)
    # Record years
    recordLowDailyAvgYear = dailyAvgs.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    recordLowDailyAvgYear.drop('YearDay', axis=1, inplace=True)
    recordLowDailyAvgYear.index = recordLowDailyAvgYear.index.astype(int)
    recordLowDailyAvgYear.columns = [i+' Year' for i in recordLowDailyAvgYear.columns]
    # Create xarray
    results = pd.concat((recordLowDailyAvg, recordLowDailyAvgYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Record Low Daily Average'
    return results

def record_low_monthly_avg(df, true_average=False):
    """Record low monthly averages. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the records
    dailyAvgs = mon_daily_avgs(df, true_average=true_average)
    monthlyAvgs = dailyAvgs.groupby(pd.Grouper(freq='M')).mean(numeric_only=True)
    monthlyAvgs.drop('YearDay', axis=1, inplace=True)
    recordLowMonthlyAvg = monthlyAvgs.groupby(monthlyAvgs.index.month).min(numeric_only=True)
    recordLowMonthlyAvg.index = recordLowMonthlyAvg.index.astype(int)
    # Record years
    recordLowMonthlyAvgYear = \
            monthlyAvgs.groupby(monthlyAvgs.index.month).apply(
                lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    recordLowMonthlyAvgYear.index = recordLowMonthlyAvgYear.index.astype(int)
    recordLowMonthlyAvgYear.columns = [i+' Year' for i in recordLowMonthlyAvgYear.columns]
    # Create xarray
    results = pd.concat((recordLowMonthlyAvg, recordLowMonthlyAvgYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Record Low Monthly Average'
    return results

def avg_daily_high(df):
    """Average daily highs."""        
    dailyHighs = daily_highs(df)
    results = dailyHighs.groupby('YearDay').mean(numeric_only=True)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Average Daily High'
    return results

def avg_monthly_high(df, true_average=False):
    """Average monthly highs. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    monthlyHighs = monthly_highs(df, true_average=true_average)
    monthlyHighs.drop('YearDay', axis=1, inplace=True)
    avgMonthlyHighs = monthlyHighs.groupby(monthlyHighs.index.month).mean(numeric_only=True)
    results = xr.DataArray(avgMonthlyHighs, dims=['month', 'variable'])
    results.name = 'Average Monthly High'
    return results

def lowest_daily_high(df):
    """Lowest daily highs."""
    # Calculate the record
    dailyHighs = daily_highs(df)
    lowestHigh = dailyHighs.groupby('YearDay').min(numeric_only=True)
    lowestHigh.index = lowestHigh.index.astype(int)
    # Record years
    lowestHighYear = dailyHighs.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    lowestHighYear.drop('YearDay', axis=1, inplace=True)
    lowestHighYear.index = lowestHighYear.index.astype(int)
    lowestHighYear.columns = [i+' Year' for i in lowestHighYear.columns]
    # Create xarray
    results = pd.concat((lowestHigh, lowestHighYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Lowest Daily High'
    return results
    
def lowest_monthly_high(df, true_average=False):
    """Lowest monthly highs. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the record
    monthlyHighs = monthly_highs(df, true_average=true_average)
    monthlyHighs.drop('YearDay', axis=1, inplace=True)
    lowMonthlyHigh = monthlyHighs.groupby(monthlyHighs.index.month).min(numeric_only=True)
    lowMonthlyHigh.index = lowMonthlyHigh.index.astype(int)
    # Record years
    lowMonthlyHighYear = \
            monthlyHighs.groupby(monthlyHighs.index.month).apply(
                lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    lowMonthlyHighYear.index = lowMonthlyHighYear.index.astype(int)
    lowMonthlyHighYear.columns = [i+' Year' for i in lowMonthlyHighYear.columns]
    # Create xarray
    results = pd.concat((lowMonthlyHigh, lowMonthlyHighYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Lowest Monthly High'
    return results

def record_daily_high(df):
    """Record daily highs."""
    # Calculate the record
    dailyHighs = daily_highs(df)
    recordHigh = dailyHighs.groupby('YearDay').max(numeric_only=True)
    recordHigh.index = recordHigh.index.astype(int)
    # Record years
    recordHighYear = dailyHighs.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    recordHighYear.drop('YearDay', axis=1, inplace=True)
    recordHighYear.index = recordHighYear.index.astype(int)
    recordHighYear.columns = [i+' Year' for i in recordHighYear.columns]
    # Create xarray
    results = pd.concat((recordHigh, recordHighYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Record Daily High'
    return results

def record_monthly_high(df, true_average=False):
    """Record monthly highs. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the record
    monthlyHighs = monthly_highs(df, true_average=true_average)
    monthlyHighs.drop('YearDay', axis=1, inplace=True)
    recordMonthlyHigh = monthlyHighs.groupby(monthlyHighs.index.month).max(numeric_only=True)
    recordMonthlyHigh.index = recordMonthlyHigh.index.astype(int)
    # Record years
    recordMonthlyHighYear = \
            monthlyHighs.groupby(monthlyHighs.index.month).apply(
                lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    recordMonthlyHighYear.index = recordMonthlyHighYear.index.astype(int)
    recordMonthlyHighYear.columns = [i+' Year' for i in recordMonthlyHighYear.columns]
    # Create xarray
    results = pd.concat((recordMonthlyHigh, recordMonthlyHighYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Record Monthly High'
    return results

def avg_daily_low(df):
    """Average daily lows."""        
    dailyLows = daily_lows(df)
    results = dailyLows.groupby('YearDay').mean(numeric_only=True)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Average Daily Low'
    return results

def avg_monthly_low(df, true_average=False):
    """Average monthly lows. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    monthlyLows = monthly_lows(df, true_average=true_average)
    monthlyLows.drop('YearDay', axis=1, inplace=True)
    avgMonthlyLows = monthlyLows.groupby(monthlyLows.index.month).mean(numeric_only=True)
    results = xr.DataArray(avgMonthlyLows, dims=['month', 'variable'])
    results.name = 'Average Monthly Low'
    return results

def highest_daily_low(df):
    """Highest daily lows."""
    # Calculate the record
    dailyLows = daily_lows(df)
    highestLow = dailyLows.groupby('YearDay').max(numeric_only=True)
    highestLow.index = highestLow.index.astype(int)
    # Record years
    highestLowYear = dailyLows.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    highestLowYear.drop('YearDay', axis=1, inplace=True)
    highestLowYear.index = highestLowYear.index.astype(int)
    highestLowYear.columns = [i+' Year' for i in highestLowYear.columns]
    # Create xarray
    results = pd.concat((highestLow, highestLowYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Highest Daily Low'
    return results
    
def highest_monthly_low(df, true_average=False):
    """Highest monthly lows. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the record
    monthlyLows = monthly_lows(df, true_average=true_average)
    monthlyLows.drop('YearDay', axis=1, inplace=True)
    highestMonthlyLow = monthlyLows.groupby(monthlyLows.index.month).max(numeric_only=True)
    highestMonthlyLow.index = highestMonthlyLow.index.astype(int)
    # Record years
    highestMonthlyLowYear = \
            monthlyLows.groupby(monthlyLows.index.month).apply(
                lambda x: x.sort_index().idxmax(numeric_only=True).dt.year)
    highestMonthlyLowYear.index = highestMonthlyLowYear.index.astype(int)
    highestMonthlyLowYear.columns = [i+' Year' for i in highestMonthlyLowYear.columns]
    # Create xarray
    results = pd.concat((highestMonthlyLow, highestMonthlyLowYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Highest Monthly Low'
    return results

def record_daily_low(df):
    """Record daily lows."""
    # Calculate the record
    dailyLows = daily_lows(df)
    recordLow = dailyLows.groupby('YearDay').min(numeric_only=True)
    recordLow.index = recordLow.index.astype(int)
    # Record years
    recordLowYear = dailyLows.groupby('YearDay').apply(
            lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    recordLowYear.drop('YearDay', axis=1, inplace=True)
    recordLowYear.index = recordLowYear.index.astype(int)
    recordLowYear.columns = [i+' Year' for i in recordLowYear.columns]
    # Create xarray
    results = pd.concat((recordLow, recordLowYear), axis=1)
    results = xr.DataArray(results, dims=['yearday', 'variable'])
    results.name = 'Record Daily Low'
    return results

def record_monthly_low(df, true_average=False):
    """Record monthly lows. If 'true_average' is True, all measurements from each 24-hour day will be used to calculate the daily average. Otherwise, only the maximum and minimum observations are used. Defaults to False (meteorological standard).
    """
    # Calculate the record
    monthlyLows = monthly_lows(df, true_average=true_average)
    monthlyLows.drop('YearDay', axis=1, inplace=True)
    recordMonthlyLow = monthlyLows.groupby(monthlyLows.index.month).min(numeric_only=True)
    recordMonthlyLow.index = recordMonthlyLow.index.astype(int)
    # Record years
    recordMonthlyLowYear = \
            monthlyLows.groupby(monthlyLows.index.month).apply(
                lambda x: x.sort_index().idxmin(numeric_only=True).dt.year)
    recordMonthlyLowYear.index = recordMonthlyLowYear.index.astype(int)
    recordMonthlyLowYear.columns = [i+' Year' for i in recordMonthlyLowYear.columns]
    # Create xarray
    results = pd.concat((recordMonthlyLow, recordMonthlyLowYear), axis=1)
    results = xr.DataArray(results, dims=['month', 'variable'])
    results.name = 'Record Monthly Low'
    return results

def number_of_years_byday(df):
    """Number of years in the historical data records by day of year."""
    numYears = pd.concat([df[[v, 'YearDay']].dropna().groupby('YearDay')\
                                        .apply(lambda x: len(x.index.year.unique())) \
                         for v in filtered_hours.columns if v != 'YearDay'], axis=1)
    numYears.columns = [v for v in df.columns if v != 'YearDay']
    results = xr.DataArray(numYears, dims=['yearday', 'variable'])
    results.name = 'Number of Years'
    return results

def number_of_years_bymonth(df):
    """Number of years in the historical data records by month."""
    numYears = pd.concat([df[v].dropna().groupby(df[v].dropna().index.month)\
                                        .apply(lambda x: len(x.index.year.unique())) \
                         for v in filtered_hours.columns if v != 'YearDay'], axis=1)
    numYears.columns = [v for v in df.columns if v != 'YearDay']
    results = xr.DataArray(numYears, dims=['month', 'variable'])
    results.name = 'Number of Years'
    return results

def generate_yeardays():
    return pd.date_range(start='2020-01-01',end='2020-12-31', freq='1D').strftime('%d-%b')

Data cleaning

First we need to load in the data and metadata for the desired station. As before, stationname is the custom human-readable “City, ST” string for the station. This will be used to determine the directory from which to load the data. Since we are not downloading data, we do not need the NOAA-COOPS station ID number.

stationname = 'Virginia Key, FL'

Derive the local directory name containing the data from the station name. This is the same way the directory was created when the data were downloaded.

dirname = camel(stationname)
outdir = f'../{dirname}'

print(f"Station folder: {dirname}")
print(f"Full directory: {outdir}")

Station folder: virginiaKeyFl
Full directory: ../virginiaKeyFl

Next, load the data and metadata.

# Metadata
with open(os.path.join(outdir, 'metadata.yml')) as m:
    meta = yaml.safe_load(m)

# Data types
dtypeDict = {k: float for k in meta['variables']}
dtypeDict['Water Level QC'] = str

# Observational data
data = pd.read_csv(os.path.join(outdir, 'observational_data_record.csv.gz'),
                   index_col=f'time_{meta["tz"]}', parse_dates=True,
                   compression='infer', dtype=dtypeDict)

Now we filter the data to remove days with more than 4 hours of missing data and months with more than 2 days of missing data. These thresholds are stored in meta and can easily be changed. We have to do this one variable at a time because this is sensor-dependent, so it takes a short while to run.

filtered_hours = pd.concat([filter_hours(data[var],
                                       hr_threshold=meta['hr_threshold'])
                                       for var in meta['variables']], axis=1)

filtered_days = pd.concat([filter_days(data[var],
                                       hr_threshold=meta['hr_threshold'],
                                       day_threshold=meta['day_threshold'])
                                       for var in meta['variables']], axis=1)

Confirm that the data were filtered:

print(data.shape)
print(filtered_hours.shape)
print(filtered_days.shape)

(2745205, 4)
(2720854, 3)
(2657974, 3)

Calculate records

Now we’re ready to determine the records using all of the functions above. We’ll store these in an xarray dataset and add the appropriate metadata for convenience. But first, we need to add a day of year (DOY) column so that we can calculate daily records. We’ve used a function to do this because accounting for leap years is not trivial.

filtered_hours = DOY(filtered_hours)
filtered_days = DOY(filtered_days)

daily_records = \
    xr.Dataset({'Daily Average': daily_avg(filtered_hours),
                'Record High Daily Average': record_high_daily_avg(filtered_hours),
                'Record Low Daily Average': record_low_daily_avg(filtered_hours),
                'Average High': avg_daily_high(filtered_hours),
                'Lowest High': lowest_daily_high(filtered_hours),
                'Record High': record_daily_high(filtered_hours),
                'Average Low': avg_daily_low(filtered_hours),
                'Highest Low': highest_daily_low(filtered_hours),
                'Record Low': record_daily_low(filtered_hours),
                'Years': number_of_years_byday(filtered_hours)},
               attrs={k:v for k, v in meta.items() if k not in ['outdir', 'variables', 'units']})

monthly_records = \
    xr.Dataset({'Monthly Average': monthly_avg(filtered_days),
                'Record High Monthly Average': record_high_monthly_avg(filtered_days),
                'Record Low Monthly Average': record_low_monthly_avg(filtered_days),
                'Average High': avg_monthly_high(filtered_days),
                'Lowest High': lowest_monthly_high(filtered_days),
                'Record High': record_monthly_high(filtered_days),
                'Average Low': avg_monthly_low(filtered_days),
                'Highest Low': highest_monthly_low(filtered_days),
                'Record Low': record_monthly_low(filtered_days),
                'Years': number_of_years_bymonth(filtered_days)},
               attrs={k:v for k, v in meta.items() if k not in ['outdir', 'variables', 'units']})

Add data units and time series ranges for each variable to the arrays as metadata attributes.

for k, v in meta['units'].items():
    daily_records.attrs[k+' units'] = v

for var in daily_records.coords['variable'].values:
    if 'Year' not in var:
        daily_records.attrs[var+' data range'] = \
            (filtered_hours[var].first_valid_index().strftime('%Y-%m-%d'),
             filtered_hours[var].last_valid_index().strftime('%Y-%m-%d'))

for k, v in meta['units'].items():
    monthly_records.attrs[k+' units'] = v

for var in monthly_records.coords['variable'].values:
    if 'Year' not in var:
        monthly_records.attrs[var+' data range'] = \
            (filtered_days[var].first_valid_index().strftime('%Y-%m-%d'),
             filtered_days[var].last_valid_index().strftime('%Y-%m-%d'))

What do we have now? Let’s take a look:

daily_records

monthly_records

How are these stored? Let’s consider the monthly stats. Each statistic is its own variable within the dataset. Take Record High for example:

monthly_records['Record High']

Here, the rows are months and the columns are the records or corresponding year. Let’s see what the variables are:

monthly_records['Record High'].coords['variable']

<xarray.DataArray 'variable' (variable: 6)> Size: 48B
array(['Air Temperature', 'Air Temperature Year', 'Water Level',
       'Water Level Year', 'Water Temperature', 'Water Temperature Year'],
      dtype=object)
Coordinates:
  * variable  (variable) object 48B 'Air Temperature' ... 'Water Temperature ...

Alternatively, we can select a specific variable and see all of its stats (converting to a dataframe makes it easier to see):

monthly_records.sel(variable='Air Temperature').to_dataframe().drop('variable', axis=1)

	Monthly Average	Record High Monthly Average	Record Low Monthly Average	Average High	Lowest High	Record High	Average Low	Highest Low	Record Low	Years
month
1	68.656295	72.570968	63.038710	76.050000	72.95	78.05	55.550000	63.50	48.30	23.0
2	70.892723	74.900000	65.465517	76.485417	74.20	78.55	59.362500	70.00	47.90	24.0
3	72.272770	77.616667	66.079032	78.490000	74.15	82.85	63.334000	72.00	55.10	25.0
4	75.633064	79.383333	72.813333	80.740000	77.30	85.80	68.398000	72.60	61.20	25.0
5	78.728491	81.930645	76.974138	82.480769	79.35	87.25	73.951923	77.10	67.95	26.0
6	81.538006	83.613333	79.840000	84.754762	82.75	87.65	77.635714	80.75	75.10	21.0
7	82.909982	84.996774	80.980645	85.790385	84.20	88.70	79.019231	82.30	76.10	26.0
8	83.298278	85.906452	81.820968	85.845833	84.05	88.50	79.510417	83.55	77.00	24.0
9	82.093823	83.740000	80.571667	85.166000	83.90	86.70	78.336000	81.60	74.30	25.0
10	79.634756	81.238710	77.551613	83.812500	80.95	86.75	72.883333	77.80	64.60	24.0
11	75.042799	78.623333	71.373333	79.841667	76.90	82.05	65.972917	74.45	57.35	24.0
12	71.425446	76.870968	62.096667	77.470000	72.50	79.65	59.406000	70.50	48.75	25.0

Reorganize

For the sake of convenience later, let’s rearrange these data arrays before saving them. It will be more useful to have records and years as data variables instead of a dimension, but we have to do some renaming in order to pull this off.

First, separate the records and years into smaller xarrays:

day_records = daily_records.sel(variable=[i for i in daily_records.coords['variable'].values if 'Year' not in i])
day_years = daily_records.sel(variable=[i for i in daily_records.coords['variable'].values if 'Year' in i])

mon_records = monthly_records.sel(variable=[i for i in monthly_records.coords['variable'].values if 'Year' not in i])
mon_years = monthly_records.sel(variable=[i for i in monthly_records.coords['variable'].values if 'Year' in i])

Next, add “Year” to all of the variable names and remove it from the coordinate name:

day_years = day_years.rename_vars({i:i+' Year' for i in day_years.data_vars})
day_years.coords['variable'] = [i.removesuffix(' Year') for i in day_years.coords['variable'].values]

mon_years = mon_years.rename_vars({i:i+' Year' for i in mon_years.data_vars})
mon_years.coords['variable'] = [i.removesuffix(' Year') for i in mon_years.coords['variable'].values]

Now we can merge these two xarrays together, rearrange the order of the variables, and drop those that do not contain a year, such as daily average.

daily_records = xr.merge([day_records, day_years])
daily_records = daily_records[[item for items in zip(day_records.data_vars, day_years.data_vars) for item in items]]
daily_records = daily_records.drop_vars([x for x in daily_records.data_vars if daily_records[x].isnull().all()])

monthly_records = xr.merge([mon_records, mon_years])
monthly_records = monthly_records[[item for items in zip(mon_records.data_vars, mon_years.data_vars) for item in items]]
monthly_records = monthly_records.drop_vars([x for x in monthly_records.data_vars if monthly_records[x].isnull().all()])

monthly_records

Finally, let’s convert years to integers since we do not need decimal years.

daily_records[[i for i in daily_records.data_vars if "Year" in i]] = \
    daily_records[[i for i in daily_records.data_vars if "Year" in i]].astype(int)

monthly_records[[i for i in monthly_records.data_vars if "Year" in i]] = \
    monthly_records[[i for i in monthly_records.data_vars if "Year" in i]].astype(int)

‘yearday’ is not intuitive, so we can change it to calendar day instead and rename the coordinate. Similarly, we can use month names instead of numbers for the sake of clarity.

daily_records.coords['yearday'] = pd.date_range(start='2020-01-01', end='2020-12-31', freq='1D').strftime('%d-%b')
daily_records = daily_records.rename({'yearday':'Date'})

monthly_records.coords['month'] = pd.date_range(start='2020-01-01', end='2020-12-31', freq='1m').strftime('%b')
monthly_records = monthly_records.rename({'month': 'Month'})

Now take a look at the final products

daily_records

monthly_records

monthly_records.coords['variable']

<xarray.DataArray 'variable' (variable: 3)> Size: 24B
array(['Air Temperature', 'Water Level', 'Water Temperature'], dtype=object)
Coordinates:
  * variable  (variable) object 24B 'Air Temperature' ... 'Water Temperature'

We can still choose one environmental variable at a time, but now we get all of the records and corresponding years:

monthly_records.sel(variable='Air Temperature').to_dataframe().drop('variable', axis=1)

	Monthly Average	Record High Monthly Average	Record High Monthly Average Year	Record Low Monthly Average	Record Low Monthly Average Year	Average High	Lowest High	Lowest High Year	Record High	Record High Year	Average Low	Highest Low	Highest Low Year	Record Low	Record Low Year	Years
Month
Jan	68.656295	72.570968	2013	63.038710	2001	76.050000	72.95	2011	78.05	2022	55.550000	63.50	2013	48.30	1997	23
Feb	70.892723	74.900000	2018	65.465517	1996	76.485417	74.20	2000	78.55	2021	59.362500	70.00	2018	47.90	1996	24
Mar	72.272770	77.616667	2003	66.079032	2010	78.490000	74.15	2010	82.85	2003	63.334000	72.00	1997	55.10	1996	25
Apr	75.633064	79.383333	2020	72.813333	2004	80.740000	77.30	2004	85.80	2020	68.398000	72.60	2015	61.20	2009	25
May	78.728491	81.930645	2024	76.974138	2007	82.480769	79.35	2007	87.25	2024	73.951923	77.10	2003	67.95	1999	26
Jun	81.538006	83.613333	2010	79.840000	2014	84.754762	82.75	2014	87.65	2009	77.635714	80.75	2004	75.10	1995	21
Jul	82.909982	84.996774	2023	80.980645	2013	85.790385	84.20	2012	88.70	2018	79.019231	82.30	2022	76.10	2013	26
Aug	83.298278	85.906452	2022	81.820968	1994	85.845833	84.05	2003	88.50	2022	79.510417	83.55	2022	77.00	1994	24
Sep	82.093823	83.740000	2024	80.571667	2001	85.166000	83.90	2000	86.70	2021	78.336000	81.60	2024	74.30	2001	25
Oct	79.634756	81.238710	2020	77.551613	2000	83.812500	80.95	2010	86.75	2023	72.883333	77.80	2020	64.60	2005	24
Nov	75.042799	78.623333	2015	71.373333	2012	79.841667	76.90	2012	82.05	2020	65.972917	74.45	2020	57.35	2006	24
Dec	71.425446	76.870968	2015	62.096667	2010	77.470000	72.50	2010	79.65	1994	59.406000	70.50	2015	48.75	2010	25

Finally, write these to file for safe keeping.

daily_records.to_netcdf(os.path.join(outdir, 'statistics-daily.nc'), mode='w')
monthly_records.to_netcdf(os.path.join(outdir, 'statistics-monthly.nc'), mode='w')

We will plot these results in Part 3.