The repository contain code to pre-process a collection of SIMRAD's EK60/EK80 acoustic raw files and LSSS interpretaion masks into an xarray
datasets using pyEcholab and the CRIMAC annotationtools (https://github.com/CRIMAC-WP4-Machine-learning/CRIMAC-annotationtools) . The dataset is then stored as zarr
or netcdf
files on disk.
The processing is split into three separate steps. The steps needs to be run in order, but later steps can be rerun independent of the first step.
The first step is to generate an index-time file. The output from this step is a parquet file containing the individual input file names and associated ping and time numbers. In cases where there are discontinouties in the time or distance variable, a new time and distance variable is generated. This new variable is used when generating time and distance in the subsequent steps. The parquet file can be used to look up the original data.
The output of this step are the parquet files:
<OUTPUT_NAME>_pingdist.parquet
This file contains the uncorrected ping_time and distance values for the survey
<OUTPUT_NAME>_pingdistcorrected.parquet
This file contains the corrected ping_time and distance values for the survey
The corrected parquet file contains the 3 following columns "raw_file" , "distance" and "ping_time" This correction file is automatically read in step 2
Step 1 is run with the setting OUTPUT_TYPE=parquet
This step reads the .raw files and generate a gridded version of the data such that the dimension is time, range and frequency. If the range resolution is similar between the channels, the data is simply stacked. In cases where the data have different range resolution, the data is regridded onto the grid of the main frequency (MAIN_FREQ).
Use the same <OUTPUT_NAME> as in step 1 if you want to use the corrected ping_time and distance from step 1 If the parquetfile from step 1 with the corrected values is not found, step 2 will use the original ping_time and distance from the raw files
The output of this step is the Zarr/NetCDF file: <OUTPUT_NAME>_sv.zarr
or <OUTPUT_NAME>_sv.nc
.
This steps first convert Marec LSSS' work files into a parquet file containing the annotations using the CRIMAC-annotationtools. These data are independent of the gridded data in step 2. Next the data is overlayed on the grid from step 2, and a pixel wise annotation that matches the grid in step 2 is generated.
To run step 3 use --env OUTPUT_TYPE=labels.zarr . In addition you have to set --env shipID=842 --env parselayers=0 . See details in the Example step 3 below
The output of this step is the parquet file: <OUTPUT_NAME>_labels.parquet
and the Zarr file: <OUTPUT_NAME>_labels.zarr
.
- Automatic range re-gridding (by default it uses the main channel’s range from the first raw file, see
MAX_RANGE_SRC
option below). - Sv processing and re-gridding the channels are done in parallel (using
Dask
’s delayed). - Automatic resuming from the last
ping_time
if the output file exists. - Batch processing is done by appending directly to the output file, should be memory efficient.
- The image of this repository is available at Docker Hub (https://hub.docker.com/r/crimac/preprocessor).
- Processing annotations from
.work
files into apandas
dataframe object (using: https://github.com/CRIMAC-WP4-Machine-learning/CRIMAC-annotationtools).
-
Two directories need to be mounted:
/datain
should be mounted to the data directory where the.raw
files are located./dataout
should be mounted to the directory where the output is written./workin
should be mounted to the directory where the.work
files are located (optional).
-
Choose the frequency of the main channel:
--env MAIN_FREQ=38000
-
Choose the range determination type:
# Set the maximum range as 500, --env MAX_RANGE_SRC=500 # or use the the main channel's maximum range from all the files (for historical data), --env MAX_RANGE_SRC=auto # or use the the main channel's maximum range from the first processed file (for historical data) --env MAX_RANGE_SRC=None
-
Select output type,
zarr
andNetCDF4
are supported:#for step 1 . This creates the file <OUTPUT_NAME>_pingdistcorrected.parquet that is used in step 2 --env OUTPUT_TYPE=parquet #for step 2 --env OUTPUT_TYPE=zarr --env OUTPUT_TYPE=netcdf4 #for step 3 --env OUTPUT_TYPE=labels.zarr
-
Select file name output (optional, default to
out.<zarr/nc>
)# use the same <OUTPUT_NAME> in both step 1 and 2 --env OUTPUT_NAME=S2020842
-
Set if we want a visual overview of the Sv data (in a PNG format image)
--env WRITE_PNG=1 # enable or 0 to disable
-
Optional attribute to process only one selected file when there are many raw files in the raw folder
--env RAW_FILE=2019847-D20190509-T014326.raw
-
Optional attribute for logging LOGGING=1 (on) LOGGING=0 (off). Standard is with logging on when the attribute is not set
--env LOGGING=1 # enable or 0 to disable
-
Optional attribute for debug (detailed stderr output) DEBUG=1 (on) DEBUG=0 (off). Standard is with debug off when the attribute is not set. DEBUG=1 will often exit on errors
--env DEBUG=1 # enable or 0 to disable
-
For step 3 : shipID is used in labels.zarr to annotate the objects. parselayers=0 skips parsing layers. parselayers=1 parses layers
--env shipID=842
--env parselayers=0
docker run -it \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/EK60/EK60_RAWDATA:/datain \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/LSSS/WORK:/workin \
-v /localscratch/ibrahim-echo/out:/dataout \
--security-opt label=disable \
--env OUTPUT_TYPE=parquet \
--env MAIN_FREQ=38000 \
--env MAX_RANGE_SRC=500 \
--env OUTPUT_NAME=S2020842 \
--env WRITE_PNG=0 \
crimac/preprocessor
docker run -it \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/EK60/EK60_RAWDATA:/datain \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/LSSS/WORK:/workin \
-v /localscratch/ibrahim-echo/out:/dataout \
--security-opt label=disable \
--env OUTPUT_TYPE=zarr \
--env MAIN_FREQ=38000 \
--env MAX_RANGE_SRC=500 \
--env OUTPUT_NAME=S2020842 \
--env WRITE_PNG=0 \
crimac/preprocessor
docker run -it \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/EK60/EK60_RAWDATA:/datain \
-v /data/cruise_data/2020/S2020842_PHELMERHANSSEN_1173/ACOUSTIC/LSSS/WORK:/workin \
-v /localscratch/ibrahim-echo/out:/dataout \
--security-opt label=disable \
--env OUTPUT_TYPE=labels.zarr \
--env shipID=842
--env parselayers=0
--env OUTPUT_NAME=S2020842 \
crimac/preprocessor
git clone https://github.com/CRIMAC-WP4-Machine-learning/CRIMAC-preprocessing.git
cd CRIMAC-preprocessing/
docker build --build-arg=commit_sha=$(git rev-parse HEAD) --no-cache --tag crimac-preprocessor .