Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data conversion scripts. #272

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Data Formats

Inside HydraGNN, we use [torch_geometric.Data](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Data.html) to store all graph properties.

Node, edge, and graph-level attributes are divided into different
variables as follows,

Data.x: node-level with shape [num_nodes, num_node_features]
Data.y: graph-level with shape [num_graph_features)
Data.edge_attr: edge-level with shape [num_edges, num_edge_features]

We also make use of coordinates and edges,

Data.pos: node-level coordinates with shape [num_nodes, num_dimensions]
Data.edge_index: node indices for each edge with shape [2, num_edges]

## Adios Storage Format

The Adios data file maintains these names and shapes,
but concatenates all data over all graphs.
Adios data fields `Z.variable_count` and `Z.variable_offset`
(where `Z` = `x`, `pos`, `y`, or `edge_addr`) allow parsing out
the data pertaining to a specific graph
by offset and count. For example (using 0-based indexing),
the starting node index for graph 0 is,

dataset.x.variable_offset[0]

while the number of atoms in graph 0 is,

dataset.x.variable_count[0]

To retrieve edge features for graph 20, we would use,

off = dataset.edge_attr.variable_offset[20]
count = dataset.edge_attr.variable_count[20]
dataset.edge_attr[off:off+count]

Note: Since these offsets and counts index graphs,
it would be more descriptive to have called these
`graph_offset` and `graph_count`.

### Metadata

In order to describe names and dimensions of
individual features, the Adios datafile contains
supplementary fields `x_name`, `y_name`, and `edge_attr_name`.
These also have `feature_count` and `feature_offset`,
which are indexes into the feature dimensions of `x`, `y`, and `edge_attr`,
respectively.

So, for example, the *qm7x* dataset has:

x_name = ["atomic_number", "pos", "forces, "charge", "dipole", "volume_ratio"]

x_name.feature_count = [1, 3, 3, 1, 3, 1]
x_name.feature_offset = [0, 1, 4, 7, 8, 11]

## Selecting Features in the Model Configuration File

When specifying a model, we provide a list of input
and output node, edge, and graph-level features:

"inputs": {
"node": ["atomic_number", "pos"],
"edge": ["length"],
"graph": []
},

"outputs": [
{ "type": "graph",
"layer_dims": [100, 100],
"features": ["energy"],
"loss": "mse"
},
{ "type": "node",
"layer_dims": [128],
"features": ["charge", "volume_ratio"],
"loss": "mse"
}
]

"loss": {
"type": "sum",
"weights": [0.001, 1.0]
}

Note that each output head forms its own list element
and has its own loss function. The combined, total loss,
is specified by the "loss" element.
6 changes: 5 additions & 1 deletion examples/LennardJones/LennardJones.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,11 @@
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/alexandria/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/ani1_x/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/csce/train_gap.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,11 @@ def __getitem__(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)
comm.Barrier()
Expand Down
6 changes: 5 additions & 1 deletion examples/dftb_uv_spectrum/train_discrete_uv_spectrum.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/dftb_uv_spectrum/train_smooth_uv_spectrum.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/eam/eam.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,11 @@ def info(*args, logtype="info", sep=" "):
% (len(trainset), len(valset), len(testset))
)

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)
timer.stop()
Expand Down
8 changes: 4 additions & 4 deletions examples/ising_model/create_configurations.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def E_dimensionless(config, L, spin_function, scale_spin):
spin[x, y, z] = spin_function(config[x, y, z])

count_pos = 0
number_nodes = L ** 3
number_nodes = L**3
positions = np.zeros((number_nodes, 3))
atomic_features = np.zeros((number_nodes, 5))
for x in range(L):
Expand Down Expand Up @@ -79,15 +79,15 @@ def create_dataset(

count_config = 0

for num_downs in tqdm(range(0, L ** 3)):
for num_downs in tqdm(range(0, L**3)):

primal_configuration = np.ones((L ** 3,))
primal_configuration = np.ones((L**3,))
for down in range(0, num_downs):
primal_configuration[down] = -1.0

# If the current composition has a total number of possible configurations above
# the hard cutoff threshold, a random configurational subset is picked
if scipy.special.binom(L ** 3, num_downs) > histogram_cutoff:
if scipy.special.binom(L**3, num_downs) > histogram_cutoff:
for num_config in range(0, histogram_cutoff):
config = np.random.permutation(primal_configuration)
config = np.reshape(config, (L, L, L))
Expand Down
12 changes: 8 additions & 4 deletions examples/ising_model/train_ising.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def create_dataset_mpi(
np.random.seed(seed)

count_config = 0
rx = list(nsplit(range(0, L ** 3), comm_size))[rank]
rx = list(nsplit(range(0, L**3), comm_size))[rank]
info("rx", rx.start, rx.stop)
for num_downs in range(rx.start, rx.stop):
subdir = os.path.join(dir, str(num_downs))
Expand All @@ -95,13 +95,13 @@ def create_dataset_mpi(
prefix = "output_%d_" % num_downs
subdir = os.path.join(dir, str(num_downs))

primal_configuration = np.ones((L ** 3,))
primal_configuration = np.ones((L**3,))
for down in range(0, num_downs):
primal_configuration[down] = -1.0

# If the current composition has a total number of possible configurations above
# the hard cutoff threshold, a random configurational subset is picked
if scipy.special.binom(L ** 3, num_downs) > histogram_cutoff:
if scipy.special.binom(L**3, num_downs) > histogram_cutoff:
for num_config in range(0, histogram_cutoff):
config = np.random.permutation(primal_configuration)
config = np.reshape(config, (L, L, L))
Expand Down Expand Up @@ -352,7 +352,11 @@ def info(*args, logtype="info", sep=" "):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)
timer.stop()
Expand Down
6 changes: 5 additions & 1 deletion examples/lsms/lsms.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,11 @@ def info(*args, logtype="info", sep=" "):
% (len(trainset), len(valset), len(testset))
)

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)
timer.stop()
Expand Down
7 changes: 6 additions & 1 deletion examples/md17/md17.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import hydragnn


# Update each sample prior to loading.
def md17_pre_transform(data):
# Set descriptor as element type.
Expand Down Expand Up @@ -72,7 +73,11 @@ def md17_pre_filter(data):
train, val, test = hydragnn.preprocess.split_dataset(
dataset, config["NeuralNetwork"]["Training"]["perc_train"], False
)
(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
train, val, test, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/mptrj/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/multidataset/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,11 @@ def info(*args, logtype="info", sep=" "):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset,
valset,
testset,
Expand Down
6 changes: 5 additions & 1 deletion examples/multidataset_hpo/gfm.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,11 @@ def main():
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/ogb/train_gap.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,11 @@ def __getitem__(self, idx):
% (len(trainset), len(valset), len(testset))
)

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/open_catalyst_2020/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/open_catalyst_2022/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,7 +416,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
6 changes: 5 additions & 1 deletion examples/qm7x/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,11 @@ def get(self, idx):
os.environ["HYDRAGNN_AGGR_BACKEND"] = "mpi"
os.environ["HYDRAGNN_USE_ddstore"] = "1"

(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
trainset, valset, testset, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
7 changes: 6 additions & 1 deletion examples/qm9/qm9.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import hydragnn


# Update each sample prior to loading.
def qm9_pre_transform(data):
# Set descriptor as element type.
Expand Down Expand Up @@ -63,7 +64,11 @@ def qm9_pre_filter(data):
train, val, test = hydragnn.preprocess.split_dataset(
dataset, config["NeuralNetwork"]["Training"]["perc_train"], False
)
(train_loader, val_loader, test_loader,) = hydragnn.preprocess.create_dataloaders(
(
train_loader,
val_loader,
test_loader,
) = hydragnn.preprocess.create_dataloaders(
train, val, test, config["NeuralNetwork"]["Training"]["batch_size"]
)

Expand Down
Loading