Paper | Google Colab Example | Website | Citation BibTex
Robocall Audio Dataset is a collection of over one thousand audio recordings of automated or semi-automated phone calls. Such calls are commonly called robocalls. These recordings were made available by the FTC through the Project Point of No Entry initiative (FTC link, FTC News, Web Archive link). The dataset consists of over a thousand robocall audio recording used in the real-world. Most of these robocalls are suspected illegal calls. Malicious actors used a majority of these recordings to defraud people. The dataset also includes the cease and desist letters sent by the FTC to the suspected Point of Entry carrier (also called gateway provider) responsible for routing the call into the North American phone network.
Each audio recording was collected using the links embedded within the Cease and Desist letters sent by the FTC to the suspected call-originating entity (telephone carrier or the robocaller). The webpage and the PDF files published on the PPoNE website were collected using automated crawlers. Links embedded within the PDF were extracted using pdfgrep
and then downloaded using wget
.
Although this dataset does not contain granular information about where or how these audio example were collected, most example robocall audio recordings are collected using telephony honeypots, voicemails, or reports from phone users who may have recorded the call using their own devices. These calls were likely generated by a robocalling system, and the audio traversed the phone network (over a logical channel) before being recorded by the recipient.
Since these recordings are sourced from various honeypots and voicemails, the original audio format included wav
, amr
, and mp3
. Some recordings were in stereo
and others in mono
.
The recordings were converted to WAV
(pcm_s16le
) and resampled to 16kHz
using ffmpeg
. When the source audio was in stereo, it was converted into two mono streams (filenames _left.wav
and _right.wav
). The _left.wav
contains the audio stream originated by the remote party (robocaller), and the _right.wav
contains the audio stream originated by the local party (honeypot or voicemail). Only the _left.wav
files were transcribed and included in the dataset. However, the respective _right.wav
audio files are also included in the dataset for completeness.
The metadata.csv
format contains the filename and the transcription of the audio recording. It also includes the language used within the call and was detected automatically using Whisper. The dataset consists of 1432 calls out of which 96.2% (1378) calls are in english and 3.8% (54) are in Mandarin/Chinese. The medium
(multilingual) model was used to transcribe the audio. The specific cease and desist letter or the warning letter is also included for each audio recording.
The cease and desist and warning letters are included in the pdf
format in the pdf_files
directory. The case_pdf
column in the metadata.csv
file contains the link to the specific letter for each audio recording.
The dataset is hosted on GitHub and can be easily accessed using Pandas and HuggingFace datasets.
import pandas as pd
df = pd.read_csv('metadata.csv')
df.columns
#Output: Index(['file_name', 'language', 'transcript', 'case_details', 'case_pdf'], dtype='object')
df.head()
The dataset can also be loaded using Huggingface's datasets
library.
from datasets import Dataset, Audio
import pandas as pd
df = pd.read_csv('metadata.csv')
audio_dataset = Dataset.from_dict({
"audio": df['file_name'].to_list(),
"transcript": df['transcript'].to_list(),
"language" : df['language'].to_list(),
"case_pdf" : df['case_pdf'].to_list(),
}).cast_column("audio", Audio(sampling_rate=16000))
#audio_dataset
# Output
# >> Dataset({
# features: ['audio', 'transcript', 'language', 'case_pdf'],
# num_rows: 1432
# })
Inspect individual audio entries
audio_dataset[0]
'''
#Output
{'audio': {'path': 'audio-wav-16khz/1112259_normalized.wav',
'array': array([0.03210449, 0.03390503, 0.03796387, ..., 0.00616455, 0.00695801,
0.0072937 ]),
'sampling_rate': 16000},
'transcript': 'We would like to inform you that there is an order placed for Apple iPhone 11 Pro using your Amazon account. If you do not authorize this order, press 1 or press 2 to authorize this order. ',
'language': 'en',
'case_pdf': 'pdf_files/pointofnoentry-every1telecomceasedesistletterfinaljms.pdf'}
'''
This document describing the data is released under the Creative Commons BY-ND [ 1] license. The data itself is in the public domain. If you find this structured data useful, we would appreciate (but do not require) an acknowledgement in any publications
@techreport{robocallDatasetTechReport,
author = {{Sathvik Prasad and Bradley Reaves}},
title = {{Robocall Audio from the FTC's Project Point of No Entry}},
institution = {{North Carolina State University}},
year = {2023},
month = {Nov},
number = {TR-2023-1},
url = {}
}
Pull Requests are welcome!
- Extract the caller ID information from the PDF complaints for each call
- Extract the time of the call form the PDF complaints for each call