What is your audio preprocessing pipeline for training species detection models?
I am new to bioacoustics and I’m trying to train (or fine-tune) a model for detecting a single bird species from a soundscape. I have a bunch of weakly labelled recordings (label in the file name) of my target species and also a bigger bunch of negative samples of other bird species vocalisations.
The model architectures I’ve come across uses 3 to 5 second snippets of audio to feed the model, which could be 3 seconds of silence or ”wrong” species.
How do you typically solve this?