from datetime import datetime as dt
from noaa_coops import Station
import pandas as pd
import numpy as np
import yaml
import os
Downloading NOAA CO-OPS Data
Part 1 of 3
This notebook demonstrates downloading atmospheric and water observations from the National Oceanic and Atmospheric Administration (NOAA) Center for Operational Oceanographic Products and Services (CO-OPS) data portal. The objective is to replicate the Climatology for Virginia Key, FL page created and maintained by Brian McNoldy at the University of Miami Rosenstiel School of Marine, Atmospheric, and Earth Science.
We will retrieve water level, water temperature, and air temperature data from Virginia Key, FL. Ultimately, however, there are several variables that could also be retrieved:
- Water level (i.e., tides)
- Water temperature
- Air temperature
- Barometric pressure
- Wind speed
In this notebook we will download the data, store the metadata, and write these to file. The second notebook, NOAA-CO-OPS-records, will filter these data and calculate a set of statistics and records. Part 3, NOAA-CO-OPS-plots, will plot and display the data.
Packages and configurations
First we import the packages we need.
By default, Python only displays warnings the first time they are thrown. Ideally, we want a code that does not throw any warnings, but it sometimes takes some trial and error to resolve the issue being warned about. So, for diagnostic purposes, we’ll set the kernel to always display warnings.
import warnings
'always') warnings.filterwarnings(
Functions
Next, we define a number of functions that will come in handy later:
camel
: This will use the site location to create a directory name in camelCase (e.g., “virginiaKeyFl”) so that we do not have to do it manuallyget_units
: Quick tool for retrieving the units for a desired variableformat_date
: Formats a timestamp in YYYYMMDD for noaa_coops utilityload_atemp
: Fetch air temperature dataload_water_temp
: Fetch water temperature dataload_water_level
: Fetch water level (tide) dataload_hourly_height
: Fetch hourly water level height data, the precurser to the 6-minute water level productdownload_data
: Wrapper function that downloads all of the desired data.
Helper functions
def camel(text):
"""Convert 'text' to camel case"""
= text.replace(',', '').replace('.', '').replace("-", " ").replace("_", " ")
s = s.split()
s if len(text) == 0:
return text
return s[0].lower() + ''.join(i.capitalize() for i in s[1:])
def get_units(variable, unit_system):
"""Return the desired units for 'variable'"""
= u'\N{DEGREE SIGN}'
deg = dict({
unit_options 'Air Temperature': {'metric': deg+'C', 'english': deg+'F'},
'Barometric Pressure': {'metric': 'mb', 'english': 'mb'},
'Wind Speed': {'metric': 'm/s', 'english': 'kn'},
'Wind Gust': {'metric': 'm/s', 'english': 'kn'},
'Wind Direction': {'metric': 'deg', 'english': 'deg'},
'Water Temperature': {'metric': deg+'C', 'english': deg+'F'},
'Water Level': {'metric': 'm', 'english': 'ft'}
})return unit_options[variable][unit_system]
def format_date(datestr):
"""Format date strings into YYYYMMDD format"""
= pd.to_datetime(datestr)
dtdt return dt.strftime(dtdt, '%Y%m%d')
Downloading data
def load_atemp(metadata, start_date, end_date, verbose=True):
"""Download air temperature data from NOAA CO-OPS between 'start_date'
and 'end_date' for 'stationid', 'unit_system', and timezone 'tz'
provided in 'metadata' dictionary.
"""
if verbose:
print('Retrieving air temperature data')
= Station(id=metadata['stationid'])
station if not start_date:
= format_date(station.data_inventory['Air Temperature']['start_date'])
start_date if not end_date:
= format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
end_date = station.get_data(
air_temp =start_date,
begin_date=end_date,
end_date='air_temperature',
product=metadata['unit_system'],
units=metadata['tz'])
time_zone= ['atemp', 'atemp_flag']
air_temp.columns return air_temp
def load_water_temp(metadata, start_date, end_date, verbose=True):
"""Download water temperature data from NOAA CO-OPS between
'start_date' and 'end_date' for 'stationid', 'unit_system', and
timezone 'tz' provided in 'metadata' dictionary.
"""
if verbose:
print('Retrieving water temperature data')
= Station(id=metadata['stationid'])
station if not start_date:
= format_date(station.data_inventory['Water Temperature']['start_date'])
start_date if not end_date:
= format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
end_date = station.get_data(
water_temp =start_date,
begin_date=end_date,
end_date='water_temperature',
product=metadata['unit_system'],
units=metadata['tz'])
time_zone= ['wtemp', 'wtemp_flag']
water_temp.columns return water_temp
def load_water_level(metadata, start_date, end_date, verbose=True):
"""Download water level data from NOAA CO-OPS between 'start_date' and
'end_date' for 'stationid', 'unit_system', 'datum', and timezone 'tz'
provided in 'metadata' dictionary.
"""
if verbose:
print('Retrieving water level tide data')
= Station(id=metadata['stationid'])
station if not start_date:
= format_date(station.data_inventory['Verified 6-Minute Water Level']['start_date'])
start_date if not end_date:
= format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
end_date = station.get_data(
water_levels =start_date,
begin_date=end_date,
end_date='water_level',
product=metadata['datum'],
datum=metadata['unit_system'],
units=metadata['tz'])
time_zone= ['wlevel', 's', 'wlevel_flag', 'wlevel_qc']
water_levels.columns return water_levels
def load_hourly_height(metadata, start_date, end_date, verbose=True):
"""Download verified hourly height data, the predecessor to the water level product, from NOAA CO-OPS from 'start_date' through 'end_date'."""
if verbose:
print('Retrieving hourly height data')
= Station(id=metadata['stationid'])
station = station.get_data(
hourly_heights =start_date,
begin_date=end_date,
end_date='hourly_height',
product=metadata['datum'],
datum=metadata['unit_system'],
units=metadata['tz'])
time_zone= ['wlevel', 's', 'wlevel_flag']
hourly_heights.columns # Add QC column for comparing to water level product
'wlevel_qc'] = 'v'
hourly_heights[return hourly_heights
def download_data(metadata, start_date=None, end_date=None, verbose=True):
"""Download data from NOAA CO-OPS"""
if verbose:
print('Downloading data')
# NOAA CO-OPS API
= Station(id=metadata['stationid'])
station
# List of data variables to combine at the end
= []
datasets
# If no 'end_date' is passed, download through end of current date
if end_date is None:
= format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
end_date else:
= format_date(end_date)
end_date
# Air temperature
if 'Air Temperature' in station.data_inventory:
if start_date is None:
= format_date(station.data_inventory['Air Temperature']['start_date'])
start_date else:
= format_date(start_date)
start_date = load_atemp(metadata=metadata, start_date=start_date,
air_temp =end_date, verbose=verbose)
end_date'atemp_flag'] = air_temp['atemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
air_temp['atemp_flag'] > 0, 'atemp'] = np.nan
air_temp.loc[air_temp['atemp'])
datasets.append(air_temp[
# Water temperature
if 'Water Temperature' in station.data_inventory:
if start_date is None:
= format_date(station.data_inventory['Water Temperature']['start_date'])
start_date else:
= format_date(start_date)
start_date = load_water_temp(metadata=metadata, start_date=start_date,
water_temp =end_date, verbose=verbose)
end_date'wtemp_flag'] = water_temp['wtemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
water_temp['wtemp_flag'] > 0, 'wtemp'] = np.nan
water_temp.loc[water_temp['wtemp'])
datasets.append(water_temp[
# Water level (tides)
if 'Verified 6-Minute Water Level' in station.data_inventory:
if start_date is None:
= format_date(station.data_inventory['Verified 6-Minute Water Level']['start_date'])
start_date else:
= format_date(start_date)
start_date = load_water_level(metadata=metadata, start_date=start_date,
water_levels =end_date, verbose=verbose)
end_date'wlevel_flag'] = water_levels['wlevel_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
water_levels['wlevel_flag'] > 0, 'wlevel'] = np.nan
water_levels.loc[water_levels[
# Hourly water heights (historical product)
if start_date is None:
if 'Verified Hourly Height Water Level' in station.data_inventory:
= format_date(station.data_inventory['Verified Hourly Height Water Level']['start_date'])
start = format_date(water_levels.index[0] + pd.Timedelta(days=1))
end = load_hourly_height(metadata=metadata,
hourly_heights =start, end_date=end, verbose=verbose)
start_date'wlevel_flag'] = \
hourly_heights['wlevel_flag'].str\
hourly_heights[',', expand=True).astype(int).sum(axis=1)
.split('wlevel_flag'] > 0] = np.nan
hourly_heights.loc[hourly_heights[= pd.concat(
water_levels 'wlevel', 'wlevel_flag', 'wlevel_qc']][:-1],
(hourly_heights[['wlevel', 'wlevel_flag', 'wlevel_qc']]), axis=0
water_levels[[
)= water_levels[~water_levels.index.duplicated(keep='first')]
water_levels 'wlevel', 'wlevel_qc']])
datasets.append(water_levels[[
# Merge into single dataframe and rename columns
if verbose:
print('Compiling data')
= pd.concat(datasets, axis=1)
newdata = f'time_{metadata["tz"]}'
newdata.index.name = [i for i in metadata['variables']+['Water Level QC']]
newdata.columns return newdata
Load / download data
Now it’s time to load the data. First, specify the station we want to load. This will be used to load saved data or download all data from a new station, if we have not yet retrieved data from this particular stationname
.
stationname
is a custom human-readable “City, ST” string for the station, while id
is the NOAA-COOPS station ID number.
We also need to specify other settings such as datum
, timezone (tz
), and unit system.
Finally, we indicate hr_threshold
, the maximum number of hours of missing data allowed before a day is not counted in the records, and day_threshold
, the maximum number of days of missing data allowed before a month is not counted.
= 'Virginia Key, FL'
stationname id = '8723214'
= 'MHHW'
datum = 'lst'
tz = 'english'
unit_system = 4
hr_threshold = 2 day_threshold
Derive the directory name containing for data from the station name. This is where the data are or will be saved locally.
= camel(stationname)
dirname = os.path.join(os.getcwd(), dirname)
outdir = f'../{dirname}'
outdir
print(f"Station folder: {dirname}")
print(f"Full directory: {outdir}")
Station folder: virginiaKeyFl
Full directory: ../virginiaKeyFl
Flag for printing statuses
= True verbose
Let’s see if we already have data from this station saved locally. This will be true if a directory already exists for the station.
If the directory outdir
does not exist, then no data have been downloaded for this station, so we need to download everything through the present. This requires a few steps:
- Create
outdir
- Download the data and record the timestamp of the last observation for each variable in the metadata. This will be used later when updating the data.
- Write the data and metadata to file.
On the other hand, if data already exist locally, we will load it from file and download new data we do not yet have:
- Load the data and metadata from file
- Retrieve new data
- Combine new data to existing data, update the ‘last_obs’ metadata entries, and write data and metadata to file
The noaa-coops tool only accepts dates without times, so it is possible to download data we already have. We therefore have to check what we download against what we already have to avoid duplicating data.
The most likely (and perhaps only) scenerio for encountering duplicated data is if the data we have for the most recent day is incomplete. For example, assume today is May 5, 2025 and we download data at noon. Also assume the start date is some earlier day, the last time we retrieved data, and this will be automatically determined from the metadata. Specifying an end date 2025-05-05
will retrieve all data available through noon on May 5. In this case, we do not yet have these data, so we concatenate what we do not have to what we do have. However, if we then run the download function again (say, for diagnostic purposes) with the new start date of 2025-05-01
and the end date 2025-05-05
, it will again download the data through noon on May 5. But since we already have those data, we do not want to re-concatenate them.
This cell may take several seconds or minutes to run, depending on how much data are downloaded.
if not os.path.exists(outdir):
if verbose:
print('Creating new directory for this station.')
os.makedirs(outdir)
# Save metadata to file
= dict({
meta 'stationname': stationname,
'stationid': stationid,
'dirname': dirname,
'unit_system': unit_system,
'tz': tz,
'datum': datum,
'hr_threshold': hr_threshold,
'day_threshold': day_threshold,
'variables': ['Air Temperature', 'Water Temperature', 'Water Level']})
'units'] = [get_units(var, meta['unit_system']) for var in meta['variables']]
meta[with open(os.path.join(outdir, 'metadata.yml'), 'w') as fp:
yaml.dump(meta, fp)
# Download all data (set start and end date to None to get all data)
if verbose:
print('Downloading all data for this station.')
= download_data(metadata=meta, start_date=None, end_date=None)
data = os.path.join(outDir, 'observational_data_record.csv.gz')
outFile ='infer')
data.to_csv(outDir, compression'last_obs'] = {i:data[i].last_valid_index().strftime('%Y-%m-%d %X') \
meta[for i in meta['variables']}
print(f"Observational data written to file '{outFile}'.")
else:
# Load the metadata
if verbose:
print('Loading metadata from file')
with open(os.path.join(outdir, 'metadata.yml')) as m:
= yaml.safe_load(m)
meta
# Load the historical data
if verbose:
print('Loading historical data from file')
= os.path.join(outdir, 'observational_data_record.csv.gz')
dataInFile = {k: float for k in meta['variables']}
dtypeDict 'Water Level QC'] = str
dtypeDict[= pd.read_csv(dataInFile, index_col=f'time_{meta["tz"]}', parse_dates=True,
data ='infer', dtype=dtypeDict)
compression= format_date(data.index.max())
start_date
# Retrieve new data
= download_data(metadata=meta, start_date=start_date)
newdata if sum(~newdata.index.isin(data.index)) == 0:
print('No new data available.')
else:
= f"time_{meta['tz']}"
newdata.index.name = [i for i in data.columns]
newdata.columns = pd.concat([data,
data == False]], axis=0)
newdata[newdata.index.isin(data.index) ='infer')
data.to_csv(dataInFile, compression'last_obs'] = {i:data[i].last_valid_index().strftime('%Y-%m-%d %X') \
meta[for i in meta['variables']}
print("Updated observational data written to file 'observational_data_record.csv'.")
Loading metadata from file
Loading historical data from file
Downloading data
Retrieving air temperature data
Retrieving water temperature data
Retrieving water level tide data
Compiling data
Updated observational data written to file 'observational_data_record.csv'.
Check the data and metadata for sanity:
data
Air Temperature | Water Temperature | Water Level | Water Level QC | |
---|---|---|---|---|
time_lst | ||||
1994-01-28 00:00:00 | NaN | NaN | NaN | NaN |
1994-01-28 00:06:00 | NaN | NaN | NaN | NaN |
1994-01-28 00:12:00 | NaN | NaN | NaN | NaN |
1994-01-28 00:18:00 | NaN | NaN | NaN | NaN |
1994-01-28 00:24:00 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... |
2025-05-30 19:36:00 | 86.7 | 88.2 | -1.657 | p |
2025-05-30 19:42:00 | 86.5 | 88.2 | NaN | p |
2025-05-30 19:48:00 | 86.5 | 88.2 | -1.552 | p |
2025-05-30 19:54:00 | 86.4 | 88.2 | NaN | p |
2025-05-30 20:00:00 | 86.4 | 88.2 | NaN | p |
2745205 rows × 4 columns
meta
{'datum': 'MHHW',
'day_threshold': 2,
'dirname': 'virginiaKeyFl',
'hr_threshold': 4,
'stationid': '8723214',
'stationname': 'Virginia Key, FL',
'tz': 'lst',
'unit_system': 'english',
'units': {'Air Temperature': '°F',
'Water Level': 'ft',
'Water Temperature': '°F'},
'variables': ['Air Temperature', 'Water Temperature', 'Water Level'],
'last_obs': {'Air Temperature': '2025-05-30 20:00:00',
'Water Temperature': '2025-05-30 20:00:00',
'Water Level': '2025-05-30 19:48:00'}}
len(data.index.unique()) == data.shape[0]
True
The ‘last_obs’ metadata values matches the last observation in the data record and corresponds to the most recently available observation. Also, every observation time is unique, so there are no duplicated entries. So, everything checks out.
In the next part, we will filter these data and calculate statistics and records.