As part of my ongoing Java explorations, I recently tackled a practical project that hit the sweet spot between business needs and cutting-edge tech. The challenge? Building a system to transcribe audio files across multiple languages with specific formatting requirements.
This wasn’t just about converting speech to text. The system needed to identify different speakers, handle six different languages, format numbers and dates properly, and flag sections that might need human review. Given these requirements, I decided to leverage Google Cloud’s Vertex AI with Gemini 1.5, all orchestrated through a Spring Boot application.
The Challenge: Beyond Simple Speech-to-Text
Most off-the-shelf transcription tools give you a wall of text and call it a day. Our requirements were much more nuanced:
- Multilingual Support: The system had to handle English plus five other languages without missing a beat.
- Speaker Diarization: We needed clear labels showing who was talking when (e.g., “Speaker A:”, “Speaker B:"), which is surprisingly hard to get right.
- Specific Formatting: Each transcript needed timestamps at speaker changes and language-appropriate number/date formats.
- Handling Imperfections: Real-world audio has issues - background noise, people talking over each other, mumbling. The system needed to handle these gracefully.
- Quality Assessment: We needed some way to flag transcriptions that might need human review.
Architecture: A Spring Boot & Google Cloud Symphony
I built a workflow that looks something like this:

- Audio Ingestion & Preparation: The backend receives audio files and converts them to FLAC format. I chose FLAC because it preserves audio quality while keeping file sizes manageable.
- Cloud Storage: These FLAC files get uploaded to Google Cloud Storage. This step is crucial - it lets Gemini access potentially large audio files without timing out or hitting memory limits.
- Vertex AI & Gemini 1.5: Our Spring Boot app calls Vertex AI, pointing Gemini to the audio file’s location. The magic happens in the prompt we send along with this request (more on that in a bit).
- Processing & Storage: Once Gemini does its thing, we parse the response, add our own confidence scoring, and store everything in our database.
- Notification/Feedback: For user-triggered transcriptions, we send back success/failure notifications.
Using GCS as the middleman was a bit of extra work, but it paid off by making the system more robust when handling larger files.
Harnessing Gemini 1.5
Gemini 1.5’s massive context window and multimodal capabilities made it perfect for handling audio files directly via GCS URIs.
Prompt Engineering for Precision
This was the trickiest part of the whole project. Just saying “hey, transcribe this” wasn’t going to cut it. After many iterations, I landed on this prompt structure:
| |
This prompt spells out exactly what we need:
- The target language (dynamically inserted at runtime)
- How to format timestamps, speaker labels, numbers, and dates
- Clear instructions for handling tricky situations like unclear speech
Navigating Safety Settings
I hit an unexpected roadblock with Vertex AI’s default safety settings. They’re designed to prevent harmful content generation, but in our case, they sometimes flagged legitimate business conversations containing certain keywords.
After some trial and error, I identified which safety categories we could safely disable for our specific use case. This required careful consideration - we needed the full, unfiltered transcription, but we were working with known audio sources in a controlled environment.
Measuring Confidence: A Pragmatic Approach
One challenge with LLMs is they don’t give you a simple “I’m 87% confident in this transcription” score. However, Gemini does return log probabilities (avg_logprobs) for the generated tokens, which provide some insight into the model’s internal confidence.
I cobbled together a basic confidence scoring system:
- Extract the
avg_logprobsfrom the API response - Set a threshold based on experimentation (around
-0.4) - If the average log probability falls below this threshold, flag that transcript for human review
It’s not perfect, but it gave us a practical way to identify potentially problematic transcriptions that might need a second pair of eyes.
Implementation: Async vs. Batch Processing
To handle different use cases, I implemented two processing modes:
Asynchronous Processing: For user-triggered transcriptions, we process things in the background using Spring’s
@Asyncannotations. The user gets an immediate acknowledgment, and we notify them when the job completes.Batch Processing: For processing backlogs of audio files, I built a scheduled job using Spring’s
@Scheduledannotation. This runs during off-hours, picking up unprocessed files and working through them methodically.
| |
Learnings and Final Thoughts
This project taught me a ton about applying LLMs to practical business problems. My key takeaways:
- Gemini 1.5 is a beast for these kinds of tasks, especially when working with audio through GCS integration.
- Prompt engineering makes or breaks you. The difference between a useless transcript and a perfect one often comes down to how clearly you communicate your requirements.
- Safety settings matter. Understanding what they do and how to configure them for your specific use case is crucial.
- Imperfect signals can still be useful. Even though
avg_logprobsisn’t a perfect confidence metric, it gave us a practical way to implement human-in-the-loop review. - Spring Boot’s flexibility shines when implementing both async and batch processing patterns to handle different use cases.
Wrangling all these technologies together was challenging but incredibly satisfying. There’s something deeply rewarding about seeing a complex system like this come together to solve a real business problem.