Downloading NOAA CO-OPS Data

Part 1 of 3

This notebook demonstrates downloading atmospheric and water observations from the National Oceanic and Atmospheric Administration (NOAA) Center for Operational Oceanographic Products and Services (CO-OPS) data portal. The objective is to replicate the Climatology for Virginia Key, FL page created and maintained by Brian McNoldy at the University of Miami Rosenstiel School of Marine, Atmospheric, and Earth Science.

We will retrieve water level, water temperature, and air temperature data from Virginia Key, FL. Ultimately, however, there are several variables that could also be retrieved:

Water level (i.e., tides)
Water temperature
Air temperature
Barometric pressure
Wind speed

In this notebook we will download the data, store the metadata, and write these to file. The second notebook, NOAA-CO-OPS-records, will filter these data and calculate a set of statistics and records. Part 3, NOAA-CO-OPS-plots, will plot and display the data.

Packages and configurations

First we import the packages we need.

from datetime import datetime as dt
from noaa_coops import Station
import pandas as pd
import numpy as np
import yaml
import os

By default, Python only displays warnings the first time they are thrown. Ideally, we want a code that does not throw any warnings, but it sometimes takes some trial and error to resolve the issue being warned about. So, for diagnostic purposes, we’ll set the kernel to always display warnings.

import warnings
warnings.filterwarnings('always')

Functions

Next, we define a number of functions that will come in handy later:

camel: This will use the site location to create a directory name in camelCase (e.g., “virginiaKeyFl”) so that we do not have to do it manually
get_units: Quick tool for retrieving the units for a desired variable
format_date: Formats a timestamp in YYYYMMDD for noaa_coops utility
load_atemp: Fetch air temperature data
load_water_temp: Fetch water temperature data
load_water_level: Fetch water level (tide) data
load_hourly_height: Fetch hourly water level height data, the precurser to the 6-minute water level product
download_data: Wrapper function that downloads all of the desired data.

Helper functions

def camel(text):
    """Convert 'text' to camel case"""
    s = text.replace(',', '').replace('.', '').replace("-", " ").replace("_", " ")
    s = s.split()
    if len(text) == 0:
        return text
    return s[0].lower() + ''.join(i.capitalize() for i in s[1:])

def get_units(variable, unit_system):
    """Return the desired units for 'variable'"""
    deg = u'\N{DEGREE SIGN}'
    unit_options = dict({
        'Air Temperature': {'metric': deg+'C', 'english': deg+'F'},
        'Barometric Pressure': {'metric': 'mb', 'english': 'mb'},
        'Wind Speed': {'metric': 'm/s', 'english': 'kn'},
        'Wind Gust': {'metric': 'm/s', 'english': 'kn'},
        'Wind Direction': {'metric': 'deg', 'english': 'deg'},
        'Water Temperature': {'metric': deg+'C', 'english': deg+'F'},
        'Water Level': {'metric': 'm', 'english': 'ft'}
    })
    return unit_options[variable][unit_system]

def format_date(datestr):
    """Format date strings into YYYYMMDD format"""
    dtdt = pd.to_datetime(datestr)
    return dt.strftime(dtdt, '%Y%m%d')

Downloading data

def load_atemp(metadata, start_date, end_date, verbose=True):
    """Download air temperature data from NOAA CO-OPS between 'start_date'
    and 'end_date' for 'stationid', 'unit_system', and timezone 'tz'
    provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving air temperature data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Air Temperature']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    air_temp = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='air_temperature',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    air_temp.columns = ['atemp', 'atemp_flag']
    return air_temp

def load_water_temp(metadata, start_date, end_date, verbose=True):
    """Download water temperature data from NOAA CO-OPS between
    'start_date' and 'end_date' for 'stationid', 'unit_system', and
    timezone 'tz' provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving water temperature data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Water Temperature']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    water_temp = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='water_temperature',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    water_temp.columns = ['wtemp', 'wtemp_flag']
    return water_temp

def load_water_level(metadata, start_date, end_date, verbose=True):
    """Download water level data from NOAA CO-OPS between 'start_date' and
    'end_date' for 'stationid', 'unit_system', 'datum', and timezone 'tz'
    provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving water level tide data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Verified 6-Minute Water Level']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    water_levels = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='water_level',
        datum=metadata['datum'],
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    water_levels.columns = ['wlevel', 's', 'wlevel_flag', 'wlevel_qc']
    return water_levels

def load_hourly_height(metadata, start_date, end_date, verbose=True):
    """Download verified hourly height data, the predecessor to the water level product, from NOAA CO-OPS from 'start_date' through 'end_date'."""
    if verbose:
        print('Retrieving hourly height data')
    station = Station(id=metadata['stationid'])
    hourly_heights = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='hourly_height',
        datum=metadata['datum'],
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    hourly_heights.columns = ['wlevel', 's', 'wlevel_flag']
    # Add QC column for comparing to water level product
    hourly_heights['wlevel_qc'] = 'v'
    return hourly_heights

def download_data(metadata, start_date=None, end_date=None, verbose=True):
    """Download data from NOAA CO-OPS"""
    if verbose:
        print('Downloading data')
    
    # NOAA CO-OPS API
    station = Station(id=metadata['stationid'])
    
    # List of data variables to combine at the end
    datasets = []
            
    # If no 'end_date' is passed, download through end of current date
    if end_date is None:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    else:
        end_date = format_date(end_date)
    
    # Air temperature
    if 'Air Temperature' in station.data_inventory:
        if start_date is None:
            start_date = format_date(station.data_inventory['Air Temperature']['start_date'])
        else:
            start_date = format_date(start_date)
        air_temp = load_atemp(metadata=metadata, start_date=start_date,
                              end_date=end_date, verbose=verbose)
        air_temp['atemp_flag'] = air_temp['atemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        air_temp.loc[air_temp['atemp_flag'] > 0, 'atemp'] = np.nan
        datasets.append(air_temp['atemp'])

    # Water temperature
    if 'Water Temperature' in station.data_inventory:
        if start_date is None:
            start_date = format_date(station.data_inventory['Water Temperature']['start_date'])
        else:
            start_date = format_date(start_date)
        water_temp = load_water_temp(metadata=metadata, start_date=start_date,
                                     end_date=end_date, verbose=verbose)
        water_temp['wtemp_flag'] = water_temp['wtemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        water_temp.loc[water_temp['wtemp_flag'] > 0, 'wtemp'] = np.nan
        datasets.append(water_temp['wtemp'])

    # Water level (tides)
    if 'Verified 6-Minute Water Level' in station.data_inventory:
        if start_date is None:
            start_date = format_date(station.data_inventory['Verified 6-Minute Water Level']['start_date'])
        else:
            start_date = format_date(start_date)
        water_levels = load_water_level(metadata=metadata, start_date=start_date,
                                        end_date=end_date, verbose=verbose)
        water_levels['wlevel_flag'] = water_levels['wlevel_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        water_levels.loc[water_levels['wlevel_flag'] > 0, 'wlevel'] = np.nan
        
        # Hourly water heights (historical product)
        if start_date is None:
            if 'Verified Hourly Height Water Level' in station.data_inventory:
                start = format_date(station.data_inventory['Verified Hourly Height Water Level']['start_date'])
                end = format_date(water_levels.index[0] + pd.Timedelta(days=1))
                hourly_heights = load_hourly_height(metadata=metadata, 
                                                    start_date=start, end_date=end, verbose=verbose)
                hourly_heights['wlevel_flag'] = \
                        hourly_heights['wlevel_flag'].str\
                            .split(',', expand=True).astype(int).sum(axis=1)
                hourly_heights.loc[hourly_heights['wlevel_flag'] > 0] = np.nan
                water_levels = pd.concat(
                    (hourly_heights[['wlevel', 'wlevel_flag', 'wlevel_qc']][:-1], 
                    water_levels[['wlevel', 'wlevel_flag', 'wlevel_qc']]), axis=0
                )
                water_levels = water_levels[~water_levels.index.duplicated(keep='first')]
        datasets.append(water_levels[['wlevel', 'wlevel_qc']])

    # Merge into single dataframe and rename columns
    if verbose:
        print('Compiling data')
    newdata = pd.concat(datasets, axis=1)
    newdata.index.name = f'time_{metadata["tz"]}'
    newdata.columns = [i for i in metadata['variables']+['Water Level QC']]
    return newdata

Load / download data

Now it’s time to load the data. First, specify the station we want to load. This will be used to load saved data or download all data from a new station, if we have not yet retrieved data from this particular stationname.

stationname is a custom human-readable “City, ST” string for the station, while id is the NOAA-COOPS station ID number.

We also need to specify other settings such as datum, timezone (tz), and unit system.

Finally, we indicate hr_threshold, the maximum number of hours of missing data allowed before a day is not counted in the records, and day_threshold, the maximum number of days of missing data allowed before a month is not counted.

stationname = 'Virginia Key, FL'
id = '8723214'
datum = 'MHHW'
tz = 'lst'
unit_system = 'english'
hr_threshold = 4
day_threshold = 2

Derive the directory name containing for data from the station name. This is where the data are or will be saved locally.

dirname = camel(stationname)
outdir = os.path.join(os.getcwd(), dirname)
outdir = f'../{dirname}'

print(f"Station folder: {dirname}")
print(f"Full directory: {outdir}")

Station folder: virginiaKeyFl
Full directory: ../virginiaKeyFl

Flag for printing statuses

verbose = True

Let’s see if we already have data from this station saved locally. This will be true if a directory already exists for the station.

If the directory outdir does not exist, then no data have been downloaded for this station, so we need to download everything through the present. This requires a few steps:

Create outdir
Download the data and record the timestamp of the last observation for each variable in the metadata. This will be used later when updating the data.
Write the data and metadata to file.

On the other hand, if data already exist locally, we will load it from file and download new data we do not yet have:

Load the data and metadata from file
Retrieve new data
Combine new data to existing data, update the ‘last_obs’ metadata entries, and write data and metadata to file

The noaa-coops tool only accepts dates without times, so it is possible to download data we already have. We therefore have to check what we download against what we already have to avoid duplicating data.

The most likely (and perhaps only) scenerio for encountering duplicated data is if the data we have for the most recent day is incomplete. For example, assume today is May 5, 2025 and we download data at noon. Also assume the start date is some earlier day, the last time we retrieved data, and this will be automatically determined from the metadata. Specifying an end date 2025-05-05 will retrieve all data available through noon on May 5. In this case, we do not yet have these data, so we concatenate what we do not have to what we do have. However, if we then run the download function again (say, for diagnostic purposes) with the new start date of 2025-05-01 and the end date 2025-05-05, it will again download the data through noon on May 5. But since we already have those data, we do not want to re-concatenate them.

This cell may take several seconds or minutes to run, depending on how much data are downloaded.

if not os.path.exists(outdir):
    if verbose:
        print('Creating new directory for this station.')
    os.makedirs(outdir)

    # Save metadata to file
    meta = dict({
        'stationname': stationname,
        'stationid': stationid,
        'dirname': dirname,
        'unit_system': unit_system,
        'tz': tz,
        'datum': datum,
        'hr_threshold': hr_threshold,
        'day_threshold': day_threshold,
        'variables': ['Air Temperature', 'Water Temperature', 'Water Level']})
    meta['units'] = [get_units(var, meta['unit_system']) for var in meta['variables']]
    with open(os.path.join(outdir, 'metadata.yml'), 'w') as fp:
        yaml.dump(meta, fp)

    # Download all data (set start and end date to None to get all data)
    if verbose:
        print('Downloading all data for this station.')
    data = download_data(metadata=meta, start_date=None, end_date=None)
    outFile = os.path.join(outDir, 'observational_data_record.csv.gz')
    data.to_csv(outDir, compression='infer')
    meta['last_obs'] = {i:data[i].last_valid_index().strftime('%Y-%m-%d %X') \
                        for i in meta['variables']}
    print(f"Observational data written to file '{outFile}'.")
    
else:
    # Load the metadata
    if verbose:
        print('Loading metadata from file')
    with open(os.path.join(outdir, 'metadata.yml')) as m:
        meta = yaml.safe_load(m)
    
    # Load the historical data
    if verbose:
        print('Loading historical data from file')
    dataInFile = os.path.join(outdir, 'observational_data_record.csv.gz')
    dtypeDict = {k: float for k in meta['variables']}
    dtypeDict['Water Level QC'] = str
    data = pd.read_csv(dataInFile, index_col=f'time_{meta["tz"]}', parse_dates=True,
                       compression='infer', dtype=dtypeDict)
    start_date = format_date(data.index.max())

    # Retrieve new data
    newdata = download_data(metadata=meta, start_date=start_date)
    if sum(~newdata.index.isin(data.index)) == 0:
        print('No new data available.')
    else:
        newdata.index.name = f"time_{meta['tz']}"
        newdata.columns = [i for i in data.columns]
        data = pd.concat([data,
                          newdata[newdata.index.isin(data.index) == False]], axis=0)
        data.to_csv(dataInFile, compression='infer')
        meta['last_obs'] = {i:data[i].last_valid_index().strftime('%Y-%m-%d %X') \
                            for i in meta['variables']}
        print("Updated observational data written to file 'observational_data_record.csv'.")

Loading metadata from file
Loading historical data from file
Downloading data
Retrieving air temperature data
Retrieving water temperature data
Retrieving water level tide data
Compiling data
Updated observational data written to file 'observational_data_record.csv'.

Check the data and metadata for sanity:

data

	Air Temperature	Water Temperature	Water Level	Water Level QC
time_lst
1994-01-28 00:00:00	NaN	NaN	NaN	NaN
1994-01-28 00:06:00	NaN	NaN	NaN	NaN
1994-01-28 00:12:00	NaN	NaN	NaN	NaN
1994-01-28 00:18:00	NaN	NaN	NaN	NaN
1994-01-28 00:24:00	NaN	NaN	NaN	NaN
...	...	...	...	...
2025-05-30 19:36:00	86.7	88.2	-1.657	p
2025-05-30 19:42:00	86.5	88.2	NaN	p
2025-05-30 19:48:00	86.5	88.2	-1.552	p
2025-05-30 19:54:00	86.4	88.2	NaN	p
2025-05-30 20:00:00	86.4	88.2	NaN	p

2745205 rows × 4 columns

meta

{'datum': 'MHHW',
 'day_threshold': 2,
 'dirname': 'virginiaKeyFl',
 'hr_threshold': 4,
 'stationid': '8723214',
 'stationname': 'Virginia Key, FL',
 'tz': 'lst',
 'unit_system': 'english',
 'units': {'Air Temperature': '°F',
  'Water Level': 'ft',
  'Water Temperature': '°F'},
 'variables': ['Air Temperature', 'Water Temperature', 'Water Level'],
 'last_obs': {'Air Temperature': '2025-05-30 20:00:00',
  'Water Temperature': '2025-05-30 20:00:00',
  'Water Level': '2025-05-30 19:48:00'}}

len(data.index.unique()) == data.shape[0]

True

The ‘last_obs’ metadata values matches the last observation in the data record and corresponds to the most recently available observation. Also, every observation time is unique, so there are no duplicated entries. So, everything checks out.

In the next part, we will filter these data and calculate statistics and records.