Bioacoustics Datasets

A collection of tools and 1,000,000+ unified annotations for bioacoustics datasets.

Dataset	Species	# of annotated calls	Dataset size (GB)	Duration (hh:mm:ss)	License
Animal Sounds	Birds, cats, chickens, cows, dogs, donkeys, frogs, lions, monkeys, sheep	809	0.13	0:57:47	-
AnuraSet	Anurans	16089	18.6	27:00:00	cc-by-4.0
BIRDeep	38 avian species	3749	13.41	8:50:00	MIT
BirdVox	25 avian species	35402	0.79	7:23:00	cc-by-4.0
Domestic Canary	Canary	23308	0.856	3:00:00	cc-by-4.0
Columbia/Costa Rica Coffee Farms	89 avian species	6952	3.8	34:00:00	cc-by-4.0
Darpa	Humans	1718	0.67	4:00:00	No license specified, the work may be protected by copyright
Avain Dawn	58 avian species, 1 amphibian species	41183	20.3	131:15:00	cc-by-4.0
DCASE	Birds	7206	32.00	17:25:00	cc-by-4.0
Egyptian fruit bat	Egyptian fruit bat	90000	91.00	37:45:00	cc-by-4.0
ENABirds	Birds	16052	1.40	6:20:00	cc-by-1.0
Female Rook	Rook birds	3417	54.37	10:45:36	cc-by-nc-nd-4.0
The Vocal Repertoire of Adult and Neonate Otters	Otter	441	0.57	0:06:23	cc
Hainan Gibbons	Hainan Gibbons	1233	13.39	104:00:00	cc-by-4.0
Hawaii Birds	27 avian species	59583	5.8	51:00:00	cc-by-4.0
HICEAS	Whales, Dolphins	796	3.10	12:40:00	"Public dataset hosted in Google Cloud Storage"
Distributed acoustic cues for caller identity in macaque vocalization	Macaques monkeys	7285	0.15	0:45:00	cc-by-1.0
InfantMarmosetVox	Marmosets monkeys	169318	21.2	58:20:00	cc-by-4.0
Northeast US Sounds	81 avian species	50760	27.8	285:00:00	cc-by-4.0
Orcas Classifications	Orca whales	398	0.26	0:26:30	-
Pigs	Pig	6887	0.2	0:40:26	cc-by-4.0
Rainforest	Birds, frogs	1216	13.05	20:16:00	"Free for personal or academic purposes"
Rodents	Rodents: mouse, gerbil	4576	1.36	0:48:34	cc-by-4.0
Rook	Rook birds	17662	23.49	17:21:17	cc-by-4.0
Sierra Nevada	21 avian species	10976	3.57	16:40:00	cc-by-4.0
Southwest Amazon	132 avian species	16482	4.51	21:00:00	cc-by-4.0
Watkins Marine Animal Sounds	21 dolphin, 13 seal, 32 whale species	15152	9.61	29:10:15	"Sound files are free to download for personal or academic use"
Western US	56 avian species	20147	7.08	33:00:00	cc-by-4.0

To install all of the data, you need about TOTAL_GB of free space. But you can also pick and choose which datasets you'd like to download.

Installation Instructions

Run ./scripts/download_data.sh

After running the download script, your datasets folder should look like this:

└── datasets/
    ├── annotations.pkl
    ├── dataset1/
    │   ├── audio/
    │   │   ├── audio1.wav
    │   │   └── audio2.wav
    │   ├── annotations.pkl
    │   └── stats.txt
    └── dataset2/
        ├── audio/
        ├── annotations.pkl
        └── stats.txt

There are individual annotation files for each dataset and one master annotations file located directly in the datasets folder. annotations.pkl is a Python dictionary structured as the following

{
   wav_file_path: [
                   {'start_time': 0, 'end_time': 1.7, 'species': 'bird', 'sub-species': 'serinus canaria'},
                   {'start_time': 2.3, 'end_time': 2.48, ...},
                   ...
                  ]
   ...
}

Tools

./scripts/generate_spectrograms.sh - Running this will generate 100 high quality mel spectrograms for each dataset and place them in the visualizations folder
scripts/moving_spectrogram.py example_audio.wav output.wav - This script takes in a wav file and generates a moving spectogram with audio called example_audio.wav and saves it in visualizations/output.wav
training/data_engine.py - This a very helpful file to take in datasets and easily produce a PyTorch Dataset. The __ getitem __() method has the output [audio, is_vocalization, species, speaker]. Audio is a tensor of numbers, is_vocalization is a boolean, species is the species of the vocalization, and speaker is the speaker of the vocalization. species and speaker will both be "Noise" if it is a non-vocalization event and speaker will be 'no-speaker' if there is no speaker data. Dataset has three required parameters: datasets_path which should just be datsets folder. Save_path which is where train and val splits will be stored. And datasets which is a list of all the datasets you would like to utilize. After the data is loaded into the data_engine, you can call data_engine.get_annotated_dataset(dataset_names=[]) which returns the above stated PyTorch dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
training		training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioacoustics Datasets

Installation Instructions

Tools

About

Releases

Packages

Languages

zacbakerr/bioacoustics-datasets

Folders and files

Latest commit

History

Repository files navigation

Bioacoustics Datasets

Installation Instructions

Tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages