Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

X-ray fluorescence (XRF) core scanning is renowned for its highresolution, non-destructive, and user-friendly operation. Despite the extensive applications of XRF data, the universal quantification of this data into specific geological proxies remains challenging due to the inherent non-linearity and project-scale limitation.

Our study aims to address the challenges by harnessing two interdisciplinary advancements:

Vast amount of XRF spectra acquired from series of scientific drilling programs
More powerful training scheme and complex model inspired by the success of large language models (LLMs).

We proposed a pretraining-finetuning framework that leverages the vast amount of XRF spectra to pretrain a foundation model. Masked Spectrum Modeling (MSM) is modifed from BERT, ViT, and MAE to our pretraining process. It is designed to let our foundation model learn the underlying patterns and relationships in the XRF spectra, which can be transferred to downstream tasks. The pretraining process is followed by fine-tuning the model on specific geological proxies to adapt the model to the target tasks. Hence, the downstream fine-tuning does not necessary require large amount of labeled data, which is contrast to the conventional method training a model from scratch in each project.

The work has been published on EGU2024 as a poster:

Lee, A.-S., Lin, H.-T., and Liou, S. Y. H.: Pretraining Foundation Models: Unleashing the Power of Forgotten Spectra for Advanced Geological Applications, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-4956, https://doi.org/10.5194/egusphere-egu24-4956, 2024.

Environment setup

Docker container

We adopt the container template, cuda118, from https://github.com/dispink/docker-example.

Versions of the main packages

Python 3.11
CUDA 11.8
cudnn 8.6.0

Folder structure

.devcontainer: Contain the configuration files for the Docker container, which is compatible to VScode Dev Container.
data: Contain all the data used in the project. It is further divided into subfolders:
- raw: Raw spectra in the Avaatech XRF Core Scanner format. Each subfolder contains the raw data for a core series.
- legacy: Previously compiled and raw data (Lee et al., 2022).
- pretrain: Data used for pre-training and is built from the previously compiled spectra data spe_dataset_20220629.csv. It is further divided into subfolder sutructure as below.
```
    +- train
        +- spe
        +- info.csv
    +- test
        (same as in train)
```
- fine-tune: Data used for fine-tuning. It is further divided into subfolder sutructure as below
```
    +- CaCO3   
        +- train
            +- spe
            +- target
            +- info.csv
        +- test
            (same as in train)
    +- TOC
        (same as in CaCO3)
```
There is no validation set in folders because it will be randomly sampled during training. The test set is composed of three cores ('PS75-056-1', 'LV28-44-3', 'SO264-69-2') isolated from the beginning and not used in the pre-training and fine-tuning process. I should test the model only at the very last step of the project, otherwise may introduce data leakage and over-estimate the model's generalization ability. The script is src/datas/build_data.py.
notebooks: Collect Jupyter notebooks for experimentation, analysis, and model development.
configs: Store configuration files or parameters used in the project, such as hyperparameters, model configurations, or experiment settings.
docs: Include any project-related documentation, such as data dictionaries, or project specifications.
results: Store output files, reports, or visualizations.
logs: Store log files generated during model training, evaluation, or other experiments.
models: Store all the trained models. It is further divided into subfolders:
- pre-trained: Pre-trained models.
- fine-tuned: Fine-tuned models.
src: Contain all the scripts used in the project. It is further divided into subfolders:
- datas: Scripts for data preprocessing, and data loading.
- models: Scripts for model architectures, loss functions, and evaluation metrics.
- train: Scripts for training and related functions.
- eval: Scripts for evalutaion and related functions.
- inference: Scripts for inference and prediction on the test or new data.
- utils: Utility scripts for logging and other helper functions.
archives: Store old or deprecated scripts, models, or data.
pilot: Store pilot experiments before integrating into the main project.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
.devcontainer		.devcontainer
archives/src		archives/src
docs		docs
files		files
notebooks		notebooks
pilot		pilot
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_finetune.py		eval_finetune.py
finetune.py		finetune.py
finetune_data_amount.sh		finetune_data_amount.sh
finetune_grid_search.sh		finetune_grid_search.sh
finetune_num_split_data.py		finetune_num_split_data.py
finetune_pretrained.sh		finetune_pretrained.sh
finetune_test.sh		finetune_test.sh
grid_search.sh		grid_search.sh
pretrain.py		pretrain.py
restructure.md		restructure.md
run_finetune.sh		run_finetune.sh
run_pretrain.sh		run_pretrain.sh
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

The work has been published on EGU2024 as a poster:

Environment setup

Docker container

Versions of the main packages

Folder structure

About

Releases

Packages

Contributors 3

Languages

License

dispink/xpt

Folders and files

Latest commit

History

Repository files navigation

Pretraining Foundation Models: Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

The work has been published on EGU2024 as a poster:

Environment setup

Docker container

Versions of the main packages

Folder structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Pretraining Foundation Models:
Unleashing the Power of Forgotten Spectra for Advanced Geological Applications

Packages