本專案將使用Pytorch,實作一個簡單的的音頻信號分類器,可應用于機械信號分類識別,鳥叫聲信號識別等應用場景,
專案使用librosa進行音頻信號處理,backbone使用mobilenet_v2,在Urbansound8K資料上,最終收斂的準確率在訓練集99%,測驗集82%,如果想進一步提高識別準確率可以使用更重的backbone和更多的資料增強方法,
完整的專案代碼:https://download.csdn.net/download/guyuealian/30306697
尊重原創,轉載請注明出處:https://panjinquan.blog.csdn.net/article/details/120601437
目錄
1. 專案結構
2. 環境配置
3.資料處理
(1)資料集Urbansound8K
(2)自定義資料集
(3)音頻特征提取:
4.訓練Pipeline
5.預測demo.py
1. 專案結構

2. 環境配置
使用pip命令安裝libsora和pyaudio,pydub等庫
pip install librosa
pip install pyaudio
pip install pydub
3.資料處理
(1)資料集Urbansound8K
Urbansound8K是目前應用較為廣泛的用于自動城市環境聲分類研究的公共資料集,
包含10個分類:空調聲、汽車鳴笛聲、兒童玩耍聲、狗叫聲、鉆孔聲、引擎空轉聲、槍聲、手提鉆、警笛聲和街道音樂聲,

資料集下載:https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz
(2)自定義資料集
可以自己錄制音頻信號,制作自己的資料集,參考[audio/dataloader/record_audio.py]
每個檔案夾存放一個類別的音頻資料,每條音頻資料長度在3秒左右,建議每類的音頻資料均衡
生產train和test資料串列:參考[audio/dataloader/create_data.py]
(3)音頻特征提取:
音頻信號是一維的語音信號,不能直接用于模型訓練,需要使用librosa將音頻轉為梅爾頻譜(Mel Spectrogram),
librosa提供python介面,在音頻、樂音信號的分析中經常用到
wav, sr = librosa.load(data_path, sr=16000)
# 使用librosa獲得音頻的梅爾頻譜
spec_image = librosa.feature.melspectrogram(y=wav, sr=sr, hop_length=256)
關于librosa的使用方法,請參考:
- 音頻特征提取——librosa工具包使用
- 梅爾頻譜(mel spectrogram)原理與使用
4.訓練Pipeline
(1)構建訓練和測驗資料
def build_dataset(self, cfg):
"""構建訓練資料和測驗資料"""
input_shape = eval(cfg.input_shape)
# 獲取資料
train_dataset = AudioDataset(cfg.train_data, data_dir=cfg.data_dir, mode='train', spec_len=input_shape[3])
train_loader = DataLoader(dataset=train_dataset, batch_size=cfg.batch_size, shuffle=True,
num_workers=cfg.num_workers)
test_dataset = AudioDataset(cfg.test_data, data_dir=cfg.data_dir, mode='test', spec_len=input_shape[3])
test_loader = DataLoader(dataset=test_dataset, batch_size=cfg.batch_size, shuffle=False,
num_workers=cfg.num_workers)
print("train nums:{}".format(len(train_dataset)))
print("test nums:{}".format(len(test_dataset)))
return train_loader, test_loader
由于librosa.load加載音頻資料特別慢,建議使用cache先進行快取,方便加速
def load_audio(audio_file, cache=False):
"""
加載并預處理音頻
:param audio_file:
:param cache: librosa.load加載音頻資料特別慢,建議使用進行快取進行加速
:return:
"""
# 讀取音頻資料
cache_path = audio_file + ".pk"
# t = librosa.get_duration(filename=audio_file)
if cache and os.path.exists(cache_path):
tmp = open(cache_path, 'rb')
wav, sr = pickle.load(tmp)
else:
wav, sr = librosa.load(audio_file, sr=16000)
if cache:
f = open(cache_path, 'wb')
pickle.dump([wav, sr], f)
f.close()
# Compute a mel-scaled spectrogram: 梅爾頻譜圖
spec_image = librosa.feature.melspectrogram(y=wav, sr=sr, hop_length=256)
return spec_image
(2)構建backbone模型
backbone是一個基于CNN+FC的網路結構,與影像CNN分類模型不同的是,影像CNN分類模型的輸入維度(batch,3,H,W)輸入資料depth=3,而音頻信號的梅爾頻譜圖是深度為depth=1,可以認為是灰度圖,輸入維度(batch,1,H,W),因此實際使用中,只需要將傳統的CNN影像分類的backbone的第一層卷積層的in_channels=1即可,需要注意的是,由于維度不一致,導致不能使用imagenet的pretrained模型,
當然可以將梅爾頻譜圖(灰度圖)是轉為3通道RGB圖,這樣就跟普通的RGB影像沒有什么區別了,也可以imagenet的pretrained模型,如
# 將梅爾頻譜圖(灰度圖)是轉為為3通道RGB圖 spec_image = cv2.cvtColor(spec_image, cv2.COLOR_GRAY2RGB)
def build_model(self, cfg):
if cfg.net_type == "mbv2":
model = mobilenet_v2.mobilenet_v2(num_classes=cfg.num_classes)
elif cfg.net_type == "resnet34":
model = resnet.resnet34(num_classes=args.num_classes)
elif cfg.net_type == "resnet18":
model = resnet.resnet18(num_classes=args.num_classes)
else:
raise Exception("Error:{}".format(cfg.net_type))
model.to(self.device)
return model
(3)訓練引數配置
相關的命令列引數,可參考:
def get_parser():
data_dir = "/media/pan/新加卷/dataset/UrbanSound8K"
# data_dir = "E:/dataset/UrbanSound8K"
train_data = 'data/UrbanSound8K/train.txt'
test_data = 'data/UrbanSound8K/test.txt'
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--batch_size', type=int, default=32, help='訓練的批量大小')
parser.add_argument('--num_workers', type=int, default=4, help='讀取資料的執行緒數量')
parser.add_argument('--num_epoch', type=int, default=100, help='訓練的輪數')
parser.add_argument('--num_classes', type=int, default=10, help='分類的類別數量')
parser.add_argument('--learning_rate', type=float, default=1e-3, help='初始學習率的大小')
parser.add_argument('--input_shape', type=str, default='(None, 1, 128, 128)', help='資料輸入的形狀')
parser.add_argument('--gpu_id', type=int, default=0, help='GPU ID')
parser.add_argument('--net_type', type=str, default="mbv2", help='backbone')
parser.add_argument('--data_dir', type=str, default=data_dir, help='資料路徑')
parser.add_argument('--train_data', type=str, default=train_data, help='訓練資料的資料串列路徑')
parser.add_argument('--test_data', type=str, default=test_data, help='測驗資料的資料串列路徑')
parser.add_argument('--work_dir', type=str, default='work_space/', help='模型保存的路徑')
return parser
配置好資料路徑,其他引數默認設定,即可以開始訓練了:
python train.py

訓練完成,使用mobilenet_v2,最終訓練集準確率99%左右,測驗集81%左右,看起來有點過擬合了,
如果想進一步提高識別準確率可以使用更重的backbone,如resnet34,采用更多的資料增強方法,提高模型的泛發性,
![]() | ![]() |
完整的訓練代碼train.py:
# -*-coding: utf-8 -*-
"""
@Author : panjq
@E-mail : pan_jinquan@163.com
@Date : 2021-07-28 09:09:32
"""
import argparse
import os
import numpy as np
import torch
import tensorboardX as tensorboard
from datetime import datetime
from tqdm import tqdm
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import StepLR, MultiStepLR
from audio.dataloader.audio_dataset import AudioDataset
from audio.utils.utility import print_arguments
from audio.utils import file_utils
from audio.models import mobilenet_v2, resnet
class Train(object):
"""Training Pipeline"""
def __init__(self, cfg):
self.device = "cuda:{}".format(cfg.gpu_id) if torch.cuda.is_available() else "cpu"
self.num_epoch = cfg.num_epoch
self.net_type = cfg.net_type
self.work_dir = os.path.join(cfg.work_dir, self.net_type)
self.model_dir = os.path.join(self.work_dir, "model")
self.log_dir = os.path.join(self.work_dir, "log")
file_utils.create_dir(self.model_dir)
file_utils.create_dir(self.log_dir)
self.tensorboard = tensorboard.SummaryWriter(self.log_dir)
self.train_loader, self.test_loader = self.build_dataset(cfg)
# 獲取模型
self.model = self.build_model(cfg)
# 獲取優化方法
self.optimizer = torch.optim.Adam(params=self.model.parameters(),
lr=cfg.learning_rate,
weight_decay=5e-4)
# 獲取學習率衰減函式
self.scheduler = MultiStepLR(self.optimizer, milestones=[50, 80], gamma=0.1)
# 獲取損失函式
self.losses = torch.nn.CrossEntropyLoss()
def build_dataset(self, cfg):
"""構建訓練資料和測驗資料"""
input_shape = eval(cfg.input_shape)
# 獲取資料
train_dataset = AudioDataset(cfg.train_data, data_dir=cfg.data_dir, mode='train', spec_len=input_shape[3])
train_loader = DataLoader(dataset=train_dataset, batch_size=cfg.batch_size, shuffle=True,
num_workers=cfg.num_workers)
test_dataset = AudioDataset(cfg.test_data, data_dir=cfg.data_dir, mode='test', spec_len=input_shape[3])
test_loader = DataLoader(dataset=test_dataset, batch_size=cfg.batch_size, shuffle=False,
num_workers=cfg.num_workers)
print("train nums:{}".format(len(train_dataset)))
print("test nums:{}".format(len(test_dataset)))
return train_loader, test_loader
def build_model(self, cfg):
"""構建模型"""
if cfg.net_type == "mbv2":
model = mobilenet_v2.mobilenet_v2(num_classes=cfg.num_classes)
elif cfg.net_type == "resnet34":
model = resnet.resnet34(num_classes=args.num_classes)
elif cfg.net_type == "resnet18":
model = resnet.resnet18(num_classes=args.num_classes)
else:
raise Exception("Error:{}".format(cfg.net_type))
model.to(self.device)
return model
def epoch_test(self, epoch):
"""模型測驗"""
loss_sum = []
accuracies = []
self.model.eval()
with torch.no_grad():
for step, (inputs, labels) in enumerate(tqdm(self.test_loader)):
inputs = inputs.to(self.device)
labels = labels.to(self.device).long()
output = self.model(inputs)
# 計算損失值
loss = self.losses(output, labels)
# 計算準確率
output = torch.nn.functional.softmax(output, dim=1)
output = output.data.cpu().numpy()
output = np.argmax(output, axis=1)
labels = labels.data.cpu().numpy()
acc = np.mean((output == labels).astype(int))
accuracies.append(acc)
loss_sum.append(loss)
acc = sum(accuracies) / len(accuracies)
loss = sum(loss_sum) / len(loss_sum)
print("Test epoch:{:3.3f},Acc:{:3.3f},loss:{:3.3f}".format(epoch, acc, loss))
print('=' * 70)
return acc, loss
def epoch_train(self, epoch):
"""模型訓練"""
loss_sum = []
accuracies = []
self.model.train()
for step, (inputs, labels) in enumerate(tqdm(self.train_loader)):
inputs = inputs.to(self.device)
labels = labels.to(self.device).long()
output = self.model(inputs)
# 計算損失值
loss = self.losses(output, labels)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 計算準確率
output = torch.nn.functional.softmax(output, dim=1)
output = output.data.cpu().numpy()
output = np.argmax(output, axis=1)
labels = labels.data.cpu().numpy()
acc = np.mean((output == labels).astype(int))
accuracies.append(acc)
loss_sum.append(loss)
if step % 50 == 0:
lr = self.optimizer.state_dict()['param_groups'][0]['lr']
print('[%s] Train epoch %d, batch: %d/%d, loss: %f, accuracy: %f,lr:%f' % (
datetime.now(), epoch, step, len(self.train_loader), sum(loss_sum) / len(loss_sum),
sum(accuracies) / len(accuracies), lr))
acc = sum(accuracies) / len(accuracies)
loss = sum(loss_sum) / len(loss_sum)
print("Train epoch:{:3.3f},Acc:{:3.3f},loss:{:3.3f}".format(epoch, acc, loss))
print('=' * 70)
return acc, loss
def run(self):
# 開始訓練
for epoch in range(self.num_epoch):
train_acc, train_loss = self.epoch_train(epoch)
test_acc, test_loss = self.epoch_test(epoch)
self.tensorboard.add_scalar("train_acc", train_acc, epoch)
self.tensorboard.add_scalar("train_loss", train_loss, epoch)
self.tensorboard.add_scalar("test_acc", test_acc, epoch)
self.tensorboard.add_scalar("test_loss", test_loss, epoch)
self.scheduler.step()
self.save_model(epoch, test_acc)
def save_model(self, epoch, acc):
"""保持模型"""
model_path = os.path.join(self.model_dir, 'model_{:0=3d}_{:.3f}.pth'.format(epoch, acc))
if not os.path.exists(os.path.dirname(model_path)):
os.makedirs(os.path.dirname(model_path))
torch.jit.save(torch.jit.script(self.model), model_path)
def get_parser():
data_dir = "/media/pan/新加卷/dataset/UrbanSound8K"
# data_dir = "E:/dataset/UrbanSound8K"
train_data = 'data/UrbanSound8K/train.txt'
test_data = 'data/UrbanSound8K/test.txt'
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--batch_size', type=int, default=32, help='訓練的批量大小')
parser.add_argument('--num_workers', type=int, default=4, help='讀取資料的執行緒數量')
parser.add_argument('--num_epoch', type=int, default=100, help='訓練的輪數')
parser.add_argument('--num_classes', type=int, default=10, help='分類的類別數量')
parser.add_argument('--learning_rate', type=float, default=1e-3, help='初始學習率的大小')
parser.add_argument('--input_shape', type=str, default='(None, 1, 128, 128)', help='資料輸入的形狀')
parser.add_argument('--gpu_id', type=int, default=0, help='GPU ID')
parser.add_argument('--net_type', type=str, default="mbv2", help='backbone')
parser.add_argument('--data_dir', type=str, default=data_dir, help='資料路徑')
parser.add_argument('--train_data', type=str, default=train_data, help='訓練資料的資料串列路徑')
parser.add_argument('--test_data', type=str, default=test_data, help='測驗資料的資料串列路徑')
parser.add_argument('--work_dir', type=str, default='work_space/', help='模型保存的路徑')
return parser
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
print_arguments(args)
t = Train(args)
t.run()
5.預測demo.py
# -*-coding: utf-8 -*-
"""
@Author : panjq
@E-mail : pan_jinquan@163.com
@Date : 2021-07-28 09:09:32
"""
import os
import cv2
import argparse
import librosa
import torch
import numpy as np
from audio.dataloader.audio_dataset import load_audio, normalization
from audio.dataloader.record_audio import record_audio
from audio.utils import file_utils, image_utils
class Predictor(object):
def __init__(self, cfg):
# self.device = "cuda:{}".format(cfg.gpu_id) if torch.cuda.is_available() else "cpu"
self.device = "cpu"
self.input_shape = eval(cfg.input_shape)
self.spec_len = self.input_shape[3]
self.model = self.build_model(cfg.model_file)
def build_model(self, model_file):
# 加載模型
model = torch.jit.load(model_file, map_location="cpu")
model.to(self.device)
model.eval()
return model
def inference(self, input_tensors):
with torch.no_grad():
input_tensors = input_tensors.to(self.device)
output = self.model(input_tensors)
return output
def pre_process(self, spec_image):
"""音頻資料預處理"""
if spec_image.shape[1] > self.spec_len:
input = spec_image[:, 0:self.spec_len]
else:
input = np.zeros(shape=(self.spec_len, self.spec_len), dtype=np.float32)
input[:, 0:spec_image.shape[1]] = spec_image
input = normalization(input)
input = input[np.newaxis, np.newaxis, :]
input_tensors = np.concatenate([input])
input_tensors = torch.tensor(input_tensors, dtype=torch.float32)
return input_tensors
def post_process(self, output):
"""輸出結果后處理"""
scores = torch.nn.functional.softmax(output, dim=1)
scores = scores.data.cpu().numpy()
# 顯示圖片并輸出結果最大的label
label = np.argmax(scores, axis=1)
score = scores[:, label]
return label, score
def detect(self, audio_file):
"""
:param audio_file: 音頻檔案
:return: label:預測音頻的label
score: 預測音頻的置信度
"""
spec_image = load_audio(audio_file)
input_tensors = self.pre_process(spec_image)
# 執行預測
output = self.inference(input_tensors)
label, score = self.post_process(output)
return label, score
def detect_file_dir(self, file_dir):
"""
:param file_dir: 音頻檔案目錄
:return:
"""
file_list = file_utils.get_files_lists(file_dir, postfix=["*.wav"])
for file in file_list:
print(file)
label, score = self.detect(file)
print(label, score)
def detect_record_audio(self, audio_dir):
"""
:param audio_dir: 錄制音頻并進行識別
:return:
"""
time = file_utils.get_time()
file = os.path.join(audio_dir, time + ".wav")
record_audio(file)
label, score = self.detect(file)
print(file)
print(label, score)
def get_parser():
model_file = 'data/pretrained/model_060_0.827.pth'
file_dir = 'data/audio'
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--num_classes', type=int, default=10, help='分類的類別數量')
parser.add_argument('--input_shape', type=str, default='(None, 1, 128, 128)', help='資料輸入的形狀')
parser.add_argument('--net_type', type=str, default="mbv2", help='backbone')
parser.add_argument('--gpu_id', type=int, default=0, help='GPU ID')
parser.add_argument('--model_file', type=str, default=model_file, help='模型檔案')
parser.add_argument('--file_dir', type=str, default=file_dir, help='音頻檔案的目錄')
return parser
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
p = Predictor(args)
p.detect_file_dir(file_dir=args.file_dir)
# audio_dir = 'data/record_audio'
# p.detect_record_audio(audio_dir=audio_dir)
完整的專案代碼:https://download.csdn.net/download/guyuealian/30306697
更多AI博客,請參考:
人體關鍵點檢測需要用到人體檢測,請查看鄙人另一篇博客:2D Pose人體關鍵點實時檢測(Python/Android /C++ Demo)_pan_jinquan的博客-CSDN博客


轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/325859.html
標籤:其他


