Does Sonicribe work offline?

Yes, Sonicribe works 100% offline. All voice processing happens locally on your computer using the Whisper AI model. Your voice data never leaves your device.

Is there a subscription fee?

No, Sonicribe is a one-time purchase of $79. There are no monthly fees, no API costs, and no hidden charges. You own it forever.

What languages does Sonicribe support?

Sonicribe supports 99+ languages including English, Spanish, French, German, Chinese, Japanese, and many more through the Whisper AI model.

What are the system requirements?

Sonicribe works on macOS 12.0+ (Apple Silicon and Intel Macs) and Windows 10/11. Hardware with dedicated GPU acceleration offers the best performance.

Speaker Identification in Sonicribe: Multi-Person Transcription

Name: Sonicribe
Price: 79 USD
Availability: InStock
Author: Sonicribe

Know Who Said What

When you transcribe a meeting with four people, a plain transcript is a wall of text. You cannot tell who proposed the new feature, who raised the objection, or who committed to the deadline. The words are there, but the attribution is lost.

Speaker identification solves this. Sonicribe analyzes the audio, detects when different people are speaking, and labels each segment with a speaker identifier. The result is a structured conversation where every statement is attributed to its speaker.

This guide explains how speaker identification works in Sonicribe, how to get the best results, and how to use speaker-labeled transcripts in your workflow.

How Speaker Identification Works

Speaker identification (also called speaker diarization) is the process of determining "who spoke when" in an audio recording. It operates in several stages:

Stage 1: Voice Activity Detection

The first step is determining when someone is speaking versus when there is silence or background noise. This separates speech segments from non-speech segments.

Stage 2: Speaker Embedding Extraction

For each speech segment, the model extracts a voice embedding, which is a mathematical representation of the speaker's voice characteristics. These embeddings capture features like pitch, timbre, speaking rate, and vocal quality that distinguish one voice from another.

Stage 3: Clustering

The embeddings are grouped (clustered) so that segments from the same speaker are associated together. If three people are in a meeting, the algorithm identifies three distinct clusters of voice characteristics.

Stage 4: Labeling

Each cluster is assigned a speaker label (Speaker 1, Speaker 2, Speaker 3, etc.). These labels are then applied to the corresponding segments in the transcript.

The Result

The raw Whisper transcript transforms from:

"We need to finalize the Q3 roadmap by Friday. I think the notification system should be our top priority. The customer data suggests analytics is more urgent. Can we do a quick analysis to settle this?"

Read more: Speaker Identification in Transcription: How It Works

Into a labeled conversation:

Speaker 1: We need to finalize the Q3 roadmap by Friday. Speaker 2: I think the notification system should be our top priority. Speaker 3: The customer data suggests analytics is more urgent. Speaker 1: Can we do a quick analysis to settle this?

Enabling Speaker Identification in Sonicribe

Step 1: Open Sonicribe Settings

Navigate to Sonicribe's settings panel. Under the transcription options, you will find the speaker identification toggle.

Step 2: Enable Identify Speakers

Turn on the "Identify Speakers" option. You can optionally specify the expected number of speakers, which helps the algorithm produce more accurate results. If you do not specify a number, Sonicribe will estimate it automatically.

Step 3: Record or Import Audio

Start recording (via microphone or system audio capture) or import an audio/video file. Speaker identification works with all audio input methods.

Step 4: Review the Labeled Transcript

After transcription, the output will include speaker labels. Review the transcript and rename the generic labels (Speaker 1, Speaker 2) to actual names if desired.

Read more: Getting Started with Sonicribe: Your Complete Guide

Tips for Best Speaker Identification Results

Audio Quality Matters

Speaker identification accuracy depends heavily on audio quality. The algorithm needs to distinguish between voices, which requires clear, separate audio for each speaker.

Do:

Use a central microphone for in-person meetings
Ensure all speakers are at similar distances from the microphone
Record in a quiet environment
Use headphones during virtual meetings (reduces echo)

Avoid:

Recording in noisy environments
Placing the microphone next to one speaker (their voice dominates)
Letting speakers talk over each other frequently
Using low-quality Bluetooth microphones

Minimize Crosstalk

Crosstalk (two or more people speaking simultaneously) is the biggest challenge for speaker identification. When voices overlap, the algorithm cannot reliably separate them, leading to segments attributed to the wrong speaker or merged into a single segment.

Strategies to reduce crosstalk:

In in-person meetings, establish speaking norms (one person at a time)
In virtual meetings, use the mute button when not speaking
For interviews, pause briefly between questions and answers

Specify the Number of Speakers

If you know how many people will be in the recording, tell Sonicribe. The expected speaker count helps the clustering algorithm produce cleaner results.

Scenario	Expected Speakers
1-on-1 meeting	2
Small team standup	3-5
Interview	2-3
Panel discussion	4-6
Lecture with Q&A	2-10 (lecturer + questioners)
Unknown	Leave unspecified (auto-detect)

Consistent Speaking Positions

For in-person recordings, speaker identification works better when each person speaks from a consistent position relative to the microphone. If people move around the room or switch seats, the voice characteristics captured at different positions may vary enough to confuse the algorithm.

Use Cases for Speaker Identification

Meeting Transcription

The most common use case. Speaker identification transforms meeting transcripts from undifferentiated text into structured conversations. This makes it easy to:

Identify who committed to action items
Attribute decisions to specific people
Review what a particular person said across the meeting
Create accurate meeting minutes

Interviews

Journalists, researchers, and hiring managers conduct interviews where attributing quotes to the correct speaker is essential. Speaker identification ensures that interviewee statements are clearly separated from interviewer questions.

Legal Depositions and Proceedings

Legal transcription requires accurate attribution. Speaker identification labels each participant's statements, though for formal legal transcription, manual verification is always recommended.

Podcast Transcription

Multi-host podcasts and interview-format shows benefit from speaker labels. The transcript becomes a readable dialogue rather than a monologue.

Read more: How to Switch from Dragon to Sonicribe: Modern Alternative

Medical Consultations

When transcribing patient-provider conversations, speaker identification separates the provider's clinical observations from the patient's reported symptoms. This aids in creating structured clinical notes.

Focus Groups and Group Discussions

Research focus groups involve multiple participants. Speaker identification helps researchers track which participant made specific comments, which is essential for qualitative analysis.

Working with Speaker-Labeled Transcripts

Renaming Speakers

After transcription, Sonicribe's speaker labels are generic (Speaker 1, Speaker 2, etc.). You can rename these to actual participant names:

1. Open the transcript

2. Identify which speaker label corresponds to which person (listen to the first few segments if needed)

3. Rename the labels

Once renamed, every instance of that speaker label updates throughout the transcript.

Filtering by Speaker

Once speakers are labeled, you can mentally (or programmatically) filter the transcript to see only what a specific person said. This is useful for:

Extracting one person's action items
Reviewing a specific participant's contributions
Pulling quotes from a particular speaker

Exporting Speaker-Labeled Transcripts

Export the transcript in your preferred format. Speaker labels are preserved in the export:

Plain text: Speaker names appear as prefixes before each segment
SRT/VTT subtitles: Speaker labels appear in subtitle entries
Copy to clipboard: Full labeled transcript ready to paste

Accuracy and Limitations

What Speaker Identification Handles Well

Distinct voices: Speakers with clearly different vocal characteristics (e.g., male and female, different pitch ranges)
Turn-taking conversations: Discussions where one person speaks at a time
Good audio quality: Clean recordings with minimal background noise
Known speaker count: When you specify the number of speakers in advance

Where Speaker Identification Struggles

Similar voices: Two speakers with very similar vocal characteristics may be confused
Heavy crosstalk: Overlapping speech degrades accuracy
Background noise: Noise can be misidentified as a speaker or mask voice characteristics
Very short utterances: Single-word responses ("yes," "agreed") may be attributed to the wrong speaker
Many speakers (8+): Accuracy decreases as the number of speakers increases, especially with similar voice types

Accuracy Expectations

Scenario	Expected Accuracy
2 speakers, clear audio	95%+
3-4 speakers, clear audio	90-95%
5-6 speakers, clear audio	85-90%
2 speakers, noisy audio	85-90%
Heavy crosstalk	70-80%
8+ speakers	75-85%

These numbers represent segment-level accuracy: the percentage of speech segments attributed to the correct speaker. Even at 85% accuracy, the transcript is far more useful than one without any speaker labels.

Speaker Identification vs Speaker Verification

It is worth clarifying the distinction between two related but different capabilities:

Read more: How to Switch from Otter.ai to Sonicribe: Migration Guide

Speaker identification (what Sonicribe does): Determines how many distinct speakers are in a recording and labels their segments. It does not know who the speakers are by name. It assigns generic labels that you rename manually. Speaker verification: Compares a voice against a known voice profile to verify identity. This requires pre-registered voice profiles and is used in security contexts (voice-based authentication). Sonicribe does not do speaker verification.

Privacy and Speaker Identification

Speaker identification in Sonicribe processes entirely on your Mac. No voice data is sent to a server, and no voice profiles are stored in a cloud database. The speaker analysis happens as part of the local transcription pipeline and produces labels that exist only in your transcript file.

This is a meaningful privacy advantage over cloud-based speaker identification services, which analyze your participants' voice characteristics on external servers. With Sonicribe, your meeting participants' voice data never leaves the room (or your device, in the case of virtual meetings).

Combining Speaker Identification with Other Features

Speaker identification works alongside Sonicribe's other features:

System audio capture + speaker identification: Transcribe virtual meetings with labeled speakers
Custom vocabulary + speaker identification: Technical terms are recognized correctly while speakers are labeled
Formatting modes + speaker identification: Output structured, labeled conversations in your preferred format

The combination of system audio capture and speaker identification is particularly powerful for meeting transcription. You capture the meeting audio silently, transcribe it locally, and get a structured, speaker-labeled transcript without any cloud service involvement.

Get Started with Speaker Identification

Download Sonicribe and enable speaker identification for your next multi-person recording. The free tier gives you 10,000 words per week to test the feature with real meetings, interviews, or group discussions.

Turn on "Identify Speakers" in settings, record your next meeting, and see the difference between a wall of text and a conversation with names attached. You will never go back to unlabeled transcripts.