Skip to content

Commit

Permalink
included change requests from leifdenby:
Browse files Browse the repository at this point in the history
 - removed linting dependencies
 - minor changes to test file
 - added notebook outlining generation of meps_example_reduced from meps_example
  • Loading branch information
SimonKamuk committed May 28, 2024
1 parent 9352949 commit 4995de0
Show file tree
Hide file tree
Showing 3 changed files with 252 additions and 15 deletions.
237 changes: 237 additions & 0 deletions DEVELOPING.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating meps_example_reduced\n",
"This notebook outlines how the small-size test dataset meps_example_reduced was created based on the slightly larger dataset meps_example. The zipped up datasets are 263 MB and 2.6 GB, respectively.\n",
"\n",
"The dataset was reduced in size by reducing the number of grid points and variables.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Standard library\n",
"import os\n",
"\n",
"# Third-party\n",
"import numpy as np\n",
"import torch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The number of grid points was reduced to 1/4 by halving the number of coordinates in both the x and y direction. This was done by removing a quarter of the grid points along each outer edge, so the center grid points would stay centered in the new set.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load existing grid\n",
"grid_xy = np.load('data/meps_example/static/nwp_xy.npy')\n",
"# Get slices in each dimension by cutting off a quarter along each edge\n",
"num_x, num_y = grid_xy.shape[1:]\n",
"x_slice = slice(num_x//4, 3*num_x//4)\n",
"y_slice = slice(num_y//4, 3*num_y//4)\n",
"# Index and save reduced grid\n",
"grid_xy_reduced = grid_xy[:, x_slice, y_slice]\n",
"np.save('data/meps_example_reduced/static/nwp_xy.npy', grid_xy_reduced)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"This cut out the border, so a new perimeter of 10 grid points was established as border (10 was also the border size in the original \"meps_example\").\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Outer 10 grid points are border\n",
"old_border_mask = np.load('data/meps_example/static/border_mask.npy')\n",
"assert np.all(old_border_mask[10:-10, 10:-10] == False)\n",
"assert np.all(old_border_mask[:10, :] == True)\n",
"assert np.all(old_border_mask[:, :10] == True)\n",
"assert np.all(old_border_mask[-10:,:] == True)\n",
"assert np.all(old_border_mask[:,-10:] == True)\n",
"\n",
"# Create new array with False everywhere but the outer 10 grid points\n",
"border_mask = np.zeros_like(grid_xy_reduced[0,:,:], dtype=bool)\n",
"border_mask[:10] = True\n",
"border_mask[:,:10] = True\n",
"border_mask[-10:] = True\n",
"border_mask[:,-10:] = True\n",
"np.save('data/meps_example_reduced/static/border_mask.npy', border_mask)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A few other files also needed to be copied using only the new reduced grid"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load surface_geopotential.npy, index only values from the reduced grid, and save to new file\n",
"surface_geopotential = np.load('data/meps_example/static/surface_geopotential.npy')\n",
"surface_geopotential_reduced = surface_geopotential[x_slice, y_slice]\n",
"np.save('data/meps_example_reduced/static/surface_geopotential.npy', surface_geopotential_reduced)\n",
"\n",
"# Load pytorch file grid_features.pt\n",
"grid_features = torch.load('data/meps_example/static/grid_features.pt')\n",
"# Index only values from the reduced grid. \n",
"# First reshape from (num_grid_points_total, 4) to (num_grid_points_x, num_grid_points_y, 4), \n",
"# then index, then reshape back to new total number of grid points\n",
"print(grid_features.shape)\n",
"grid_features_new = grid_features.reshape(num_x, num_y, 4)[x_slice,y_slice,:].reshape((-1, 4))\n",
"# Save to new file\n",
"torch.save(grid_features_new, 'data/meps_example_reduced/static/grid_features.pt')\n",
"\n",
"# flux_stats.pt is just a vector of length 2, so the grid shape and variable changes does not change this file\n",
"torch.save(torch.load('data/meps_example/static/flux_stats.pt'), 'data/meps_example_reduced/static/flux_stats.pt')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The number of variables was reduced by truncating the variable list to the first 8."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_vars = 8\n",
"\n",
"# Load parameter_weights.npy, truncate to first 8 variables, and save to new file\n",
"parameter_weights = np.load('data/meps_example/static/parameter_weights.npy')\n",
"parameter_weights_reduced = parameter_weights[:num_vars]\n",
"np.save('data/meps_example_reduced/static/parameter_weights.npy', parameter_weights_reduced)\n",
"\n",
"# Do the same for following 4 pytorch files\n",
"for file in ['diff_mean', 'diff_std', 'parameter_mean', 'parameter_std']:\n",
" old_file = torch.load(f'data/meps_example/static/{file}.pt')\n",
" new_file = old_file[:num_vars]\n",
" torch.save(new_file, f'data/meps_example_reduced/static/{file}.pt')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly the files in each of the directories train, test, and val have to be reduced. The folders all have the same structure with files of the following types:\n",
"```\n",
"nwp_YYYYMMDDHH_mbrXXX.npy\n",
"wtr_YYYYMMDDHH.npy\n",
"nwp_toa_downwelling_shortwave_flux_YYYYMMDDHH.npy\n",
"```\n",
"with ```YYYYMMDDHH``` being some date with hours, and ```XXX``` being some 3-digit integer.\n",
"\n",
"The first type of file has x and y in dimensions 1 and 2, and variable index in dimension 3. Dimension 0 is unchanged.\n",
"The second type has has x and y in dimensions 1 and 2. Dimension 0 is unchanged.\n",
"The last type has just x and y as the only 2 dimensions.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(65, 268, 238, 18)\n",
"(65, 268, 238)\n"
]
}
],
"source": [
"print(np.load('data/meps_example/samples/train/nwp_2022040100_mbr000.npy').shape)\n",
"print(np.load('data/meps_example/samples/train/nwp_toa_downwelling_shortwave_flux_2022040112.npy').shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following loop goes through each file in each sample folder and indexes them according to the dimensions given by the file name."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for sample in ['train', 'test', 'val']:\n",
" files = os.listdir(f'data/meps_example/samples/{sample}')\n",
"\n",
" for f in files:\n",
" data = np.load(f'data/meps_example/samples/{sample}/{f}')\n",
" if 'mbr' in f:\n",
" data = data[:,x_slice,y_slice,:num_vars]\n",
" elif 'wtr' in f:\n",
" data = data[x_slice, y_slice]\n",
" else:\n",
" data = data[:,x_slice,y_slice]\n",
" np.save(f'data/meps_example_reduced/samples/{sample}/{f}', data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, the file ```data_config.yaml``` is modified manually by truncating the variable units, long and short names, and setting the new grid shape. Also the unit descriptions containing ```^``` was automatically parsed using latex, and to avoid having to install latex in the GitHub CI/CD pipeline, this was changed to ```**```."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
5 changes: 0 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,5 @@ dask>=2024.4.2

# for dev
pre-commit>=2.15.0
codespell>=2.0.0
black>=21.9b0
isort>=5.9.3
flake8>=4.0.1
pylint>=3.0.3
pytest>=8.1.1
pooch>=1.8.1
25 changes: 15 additions & 10 deletions tests/test_mllam_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,34 @@
from neural_lam.weather_dataset import WeatherDataset
from train_model import main as train_model

# Disable weights and biases to avoid unnecessary logging
# and to avoid having to deal with authentication
os.environ["WANDB_DISABLED"] = "true"

# Initializing variables for the s3 client
S3_BUCKET_NAME = "mllam-testdata"
S3_ENDPOINT_URL = "https://object-store.os-api.cci1.ecmwf.int"
S3_FILE_PATH = "neural-lam/npy/meps_example_reduced.v0.1.0.zip"
S3_FULL_PATH = "/".join([S3_ENDPOINT_URL, S3_BUCKET_NAME, S3_FILE_PATH])
TEST_DATA_KNOWN_HASH = (
"98c7a2f442922de40c6891fe3e5d190346889d6e0e97550170a82a7ce58a72b7"
)

def test_retrieve_data_ewc():
# Initializing variables for the client
S3_BUCKET_NAME = "mllam-testdata"
S3_ENDPOINT_URL = "https://object-store.os-api.cci1.ecmwf.int"
S3_FILE_PATH = "neural-lam/npy/meps_example_reduced.v0.1.0.zip"
S3_FULL_PATH = "/".join([S3_ENDPOINT_URL, S3_BUCKET_NAME, S3_FILE_PATH])
known_hash = (
"98c7a2f442922de40c6891fe3e5d190346889d6e0e97550170a82a7ce58a72b7"
)

def test_retrieve_data_ewc():
# Download and unzip test data into data/meps_example_reduced
pooch.retrieve(
url=S3_FULL_PATH,
known_hash=known_hash,
known_hash=TEST_DATA_KNOWN_HASH,
processor=pooch.Unzip(extract_dir=""),
path="data",
fname="meps_example_reduced.zip",
)


def test_load_reduced_meps_dataset():
# The data_config.yaml file is downloaded and extracted in
# test_retrieve_data_ewc together with the dataset itself
data_config_file = "data/meps_example_reduced/data_config.yaml"
dataset_name = "meps_example_reduced"

Expand Down

0 comments on commit 4995de0

Please sign in to comment.