Forem: Servin Osmanov

My Journey Improving a TTS Model for the Crimean Tatar Language

Servin Osmanov — Fri, 07 Nov 2025 19:42:08 +0000

When you work with machine learning, success often hides behind hours of frustration, countless errors, and broken pipelines. This project — improving the Crimean Tatar TTS (Text-to-Speech) model — was exactly that kind of journey. What started as a small experiment to fine-tune an existing model turned into a full-scale debugging adventure that taught me more about data integrity, audio processing, and patience than any tutorial could.

In my previous article, Why Language Tech Matters: Developing AI Tools for Small Languages, I explored how AI can empower low-resource languages. This piece continues that journey with a hands-on look at improving TTS models.

The Starting Point: A Model That Worked — but Only Partially

My goal was simple: improve the voice model “Sevil” for the Crimean Tatar language. I had already worked with similar voices — “Arslan” and “Abibullah” — using Hugging Face datasets like speech-uk/tts-crh-arslan and speech-uk/tts-crh-abibullah.

The first training attempts went well — loss around 0.283, results acceptable. But something didn’t add up. The dataset had 1,566 audio files, yet the training logs showed only 415 were being used — about 26.5% of the total.

That meant almost three-quarters of my data was silently ignored.

At first, I thought it was a fluke. Then I realized it was a systemic problem in the Hugging Face datasets API when loading compressed audio from Parquet files.

Diagnosing the Problem

When I loaded the dataset through datasets.load_dataset(), most files threw errors:

Error: "Error while decoding audio"
Error: "Audio file appears to be empty"

That didn’t make sense — the audio bytes were clearly present in the Parquet files.

After checking manually with pandas.read_parquet(), I confirmed the data was there. The problem wasn’t in the files — it was in how the decoder handled them.

That’s when I realized: the datasets API couldn’t decode raw audio bytes correctly. The data was fine, but the pipeline was broken.

Turning Bytes into Sound

At this point, I tried everything.

I extracted the bytes manually, saved them as .raw files, and tried to load them with librosa and soundfile. Nothing worked.

Without proper metadata (sample rate, channels, encoding), the files were unreadable.

Eventually, I discovered the solution: FFmpeg.

By using known dataset parameters —

sample rate: 16000 Hz, channels: mono, format: PCM 16-bit little-endian —

I could convert all .raw files into clean .wav audio.

ffmpeg -f s16le -ar 16000 -ac 1 \
       -i sevil_0000.raw \
       sevil_0000.wav

And just like that — 1,566 files successfully converted.

No corruption. No decoding errors. 100% validation success.

It was the moment of breakthrough — the kind that makes you sit back, smile, and realize you’ve just solved a problem that haunted you for two days straight.

Training the Model — Again

With clean audio finally ready, I retrained the Sevil model from scratch, this time using all 1,566 recordings.

Training setup (based on my previous configs for “Arslan”):

num_train_epochs = 500
batch_size = 4
learning_rate = 1e-4
fp16 = True
warmup_steps = 2000
save_steps = 2000

The progress was promising:

Loss dropped from 1.14 → 0.80 → 0.50 → 0.27
The voice quality improved with every epoch
And then… it crashed at 78% completion.

The culprit? A familiar one for SpeechT5 users —

RuntimeError: torch.cat(): expected a non-empty list of Tensors

It turned out to be a bug in guided_attention_loss, a component that sometimes fails with uneven sequence lengths.

Fixing the Crash

Instead of starting over, I resumed training from the last checkpoint (step 16,000) and simply disabled guided attention:

model.config.use_guided_attention_loss = False

That one line saved the project.

Training resumed, completed 98% of the full cycle, and stabilized with a final loss of 0.267 — a small numerical improvement, but a big qualitative one.

The model became more consistent and robust across new data.

Comparing Results

Model	Data Used	Loss	Success	Notes
Sevil v1	415 files (26.5%)	0.276	47% test	Trained on partial data
Sevil v2	1,566 files (100%)	0.267	100% test	Fully trained, stable

The difference wasn’t just in metrics — it was in confidence.

Sevil v2 generalized better, produced smoother intonation, and maintained pronunciation consistency even on unseen words.

Lessons Learned

Never trust good metrics without checking data coverage.

My “good” baseline was trained on just 26% of the data.
FFmpeg is a lifesaver.

It solved what specialized libraries couldn’t.
Validate every single file.

Automation saved hours of manual checking.
Guided attention loss is optional.

Sometimes stability matters more than theoretical accuracy.
Document everything.

By keeping track of every attempt — successful or not — I could understand the full story, not just the happy ending.

Why This Project Matters

For me, this wasn’t just about fixing one dataset. It was about enabling a low-resource language — Crimean Tatar — to have a better voice in the digital world.

Improving the Sevil model means clearer pronunciation, smoother prosody, and better accessibility for learners and native speakers alike.

And for anyone working with custom TTS datasets:

Check your files, validate your data, and don’t give up when your model crashes at 78%.

That crash might be the best teacher you’ll ever have.

Author: Servin Osmanov

Lead Fullstack Python / ReactJS Engineer

AI researcher and TTS developer for low-resource languages

Project: servinosmanov/tts-crh-sevil-fixed on Hugging Face

My Journey as a Judge at CBIT Hacktoberfest 2025 — Lessons from the Other Side of the Table

Servin Osmanov — Thu, 30 Oct 2025 18:29:37 +0000

When I first started attending hackathons, I was always the one coding, building, and pitching. This year, for the first time, I got to experience the other side — as a judge at the CBIT Hacktoberfest Hackathon 2025.

It was an incredible 24-hour online event organized by the CBIT Open Source Community, celebrating open-source culture and collaboration as part of the global Hacktoberfest. The event gathered hundreds of students from universities across the world — developers, designers, and dreamers ready to learn, code, and share.

What the Event Was About

Hacktoberfest is all about celebrating open source, and this hackathon perfectly captured that spirit. The CBIT edition — now in its 8th year — encouraged participants to collaborate on innovative solutions using modern technologies while contributing to the open-source ecosystem.

Teams of 3–5 members worked virtually through Discord, tackling real-world challenges in just 24 hours. Despite the distance and time zones, the sense of connection and creativity was palpable.

The Judging Framework

Every project was evaluated based on a clear, balanced set of criteria (total: 50 points):

Criterion	Points	What It Measured
Innovation & Creativity	10	Original ideas and new approaches
Collaboration	5	Teamwork, communication, and shared contribution
Implementation	20	Technical soundness, scalability, and efficiency
Design	5	Usability and user experience
Presentation	10	Clarity, storytelling, and delivery

As a judge, I was looking for that spark — projects that combined solid implementation with a clear purpose.

What I Saw and Learned

Some projects were technically ambitious, pushing the boundaries of what’s possible in 24 hours. Others were beautifully simple, focusing on accessibility or social impact. A few used AI and automation in creative ways that genuinely surprised me.

But beyond the technology, what stood out was the teamwork. Many participants were strangers before the event — yet they managed to code, design, and present together like long-time collaborators. That’s the magic of hackathons.

The Judging Experience

Our panel included engineers and leaders from Electronic Arts, JP Morgan Chase, Oracle, Deliveroo, ServiceNow, EY, and more. Being part of such a diverse group gave every discussion depth. Each judge viewed “innovation” slightly differently — and that variety made our evaluations richer.

I appreciated the effort teams put into their presentations. Even short demos told full stories — from the problem statement to the final prototype. That’s where creativity met clarity.

Key Takeaways

Innovation isn’t always about technology. Sometimes it’s about empathy — understanding who you’re helping and why.
Good presentation matters. The best projects told their stories with confidence and focus.
Community is everything. Open-source hackathons like this show how technology connects us beyond borders.

As a developer, I’ve always believed in continuous learning. Judging this event reaffirmed that growth happens in many forms — whether you’re writing code or evaluating it.

Why Hackathons Matter More Than Ever

In a world where remote collaboration is the new normal, hackathons remain one of the most human ways to innovate. They combine competition, creativity, and community into one shared experience. For students and professionals alike, they’re the perfect place to experiment and grow.

I’m grateful to have played a small role in this journey — to witness ideas take shape, to learn from participants, and to see firsthand how open source continues to inspire a new generation of builders.

Author:
Servin Osmanov

Lead Software Engineer @ Anvaya Solutions Inc.

Judge at CBIT Hacktoberfest Hackathon 2025

Why Language Tech Matters: Developing AI Tools for Small Languages

Servin Osmanov — Sat, 25 Oct 2025 15:45:27 +0000

In a world where artificial intelligence is transforming how we communicate, the survival of small languages depends not just on cultural passion — but on technology.

While English, Chinese, or Spanish dominate the digital space, thousands of smaller languages remain digitally invisible. Without online presence, data, or digital tools, these languages risk extinction in the 21st century.

As a software engineer and cultural advocate, I’ve spent years developing digital tools for the Crimean Tatar language — a Turkic minority language spoken by less than 300,000 people worldwide. Through this work, I’ve learned that the intersection of AI and linguistics isn’t just a research topic; it’s a lifeline for cultural identity.

The Challenge: The Technology Gap for Small Languages

Mainstream AI models—from GPT-based chatbots to translation engines—are trained primarily on high-resource languages. This creates a serious imbalance: while global communication becomes easier for some, others are left behind.

For small linguistic communities, the lack of:

digital corpora,
standardized spelling systems,
and high-quality training datasets

makes it nearly impossible to integrate their languages into modern tools like voice assistants, translation services, or educational apps.

When a language isn’t “machine-readable,” it risks becoming irrelevant in the digital world — even if it’s still spoken at home.

Our Journey: Building Tools for the Crimean Tatar Language

In 2018, our team launched a series of educational and linguistic tools under the Qırımtatar Lugatı (Crimean Tatar Dictionary) project:

Our mission was to give people an accessible and modern way to learn, use, and preserve the Crimean Tatar language.

The apps quickly grew into more than just dictionaries — they became living platforms connecting speakers, learners, and educators. Yet, users wanted more: context-based search, on-the-fly translation, and natural voice pronunciation.

This inspired us to integrate artificial intelligence to make our tools more flexible and intelligent.

The Next Step: Bringing AI to Minority Languages

We are now integrating AI and machine learning into the Crimean Tatar dictionary project to expand its capabilities:

🧠 AI-assisted translation: dynamic, real-time translation between Crimean Tatar, Turkish, English, and Ukrainian.
🔊 Speech recognition and synthesis: allowing users to hear natural pronunciation and practice correct intonation.
📖 Adaptive learning: using AI to personalize vocabulary lessons based on user behavior and progress.
🪶 Data-driven NLP foundation: building scalable open datasets that will support future chatbots, voice assistants, and translation systems.

To achieve this, we’ve adopted and fine-tuned open-source speech synthesis models such as:

facebook/mms-tts-crh — part of Meta’s Massively Multilingual Speech project, providing a solid baseline for Crimean Tatar TTS.
robinhad/qirimtatar-tts — a community-driven model that helps us generate high-quality speech for educational and cultural content.

By leveraging these models, we’re building an AI-powered TTS system that brings Crimean Tatar audio resources to learners, teachers, and content creators for the first time.

Why This Matters

Language is more than communication — it’s collective memory.

When a language disappears, so does a unique worldview and cultural identity.

AI gives us a chance to reverse that process. With the right tools, small languages can become visible online, connect their communities, and survive in the digital era.

Projects like Qırımtatar Lugatı show that you don’t need a giant corporation to make an impact.

A small, passionate team with the right technical vision can give a digital voice — literally — to those who’ve been silent for too long.

How Developers and Researchers Can Help

If you’re a developer, linguist, or AI researcher, here are a few ways to make a difference:

Contribute to open-source datasets for small or low-resource languages.
Support multilingual NLP and TTS initiatives on Hugging Face or GitHub.
Collaborate with communities and educators who are digitizing endangered languages.
Help localize educational and cultural apps into underrepresented languages.

Every dataset, every model, and every contribution brings us closer to a more linguistically inclusive AI ecosystem.

Looking Ahead

Our roadmap for 2025 includes:

Full integration of AI-powered translation within the Qırımtatar Lugatı apps.
Implementation of voice features using our fine-tuned TTS models.
Launching an open API for developers who want to build tools for Crimean Tatar or similar minority languages.

Technology alone won’t preserve culture — but it can amplify it.

For small languages like Crimean Tatar, AI isn’t just a tool — it’s hope.

About the Author

Servin Osmanov is a Senior Full-Stack Engineer and founder of the Qırım.Online and Ana-Yurt.Com project — an initiative focused on preserving Crimean Tatar culture and language through modern technology.

LinkedIn: linkedin.com/in/servin-osmanov

GitHub: github.com/MrSerWin