From DNA to DSA : How I Went From Studying Genomes To Understanding Algorithms!

Alfiya Fatima — Mon, 30 Dec 2024 19:33:26 +0000

Introduction

I am Alfiya Fatima, a third year engineering student and a Full-Stack Developer from Bengaluru.
Let's Connect on LinkedIn.

alfiyafatima09

Let's Begin :)

In 12th grade, I recall my lecturer telling, "The DNA in your body stretches 120 billion miles—twice the solar system's diameter!" Her words sparked my curiosity about decoding DNA, though I never imagined I’d someday use algorithms to analyze genomes.

How did I find myself here, exploring the world of Data Structures and Algorithms?

I was fortunate to attend the ACM India Winter School 2024 on Data Structures and Algorithms for Strings at Amrita Vishwa Vidyapeetham, Coimbatore. It was a remarkable experience, surrounded by brilliant peers and guided by top professors:

Prof. Venkatesh Raman (IMSc Chennai)
Prof. Sharma Thankachan (North Carolina State University)
Prof. Chirag Jain (IISc Bangalore).

With my friend Akash, I joined students from diverse backgrounds-undergraduates to PhDs-representing institutes across India. The professors’ ability to simplify complex concepts and spark curiosity turned each session into an exciting journey through the string algorithms.

This opportunity started with a suggestion from my senior, Ashutosh Pandey , who encouraged us to be a part of this.

What Did We Learn?

String Matching Algorithms
FFT-Based Algorithms
Suffix Trees and Arrays
Burrows-Wheeler Transform & FM Index
Document Retrieval & Range Minimum Queries
Applications in Computational Biology

Glimpse about the classes!

Have you ever encountered pattern matching?

Pattern matching is the process of finding a specific sequence or pattern within a larger set of data, like searching for a word in a text or a pattern in a dataset.

It’s a fundamental concept with wide applications, from powering search engines to solving complex problems in computational biology. Whether it’s finding relevant information on the web or identifying gene sequences in vast genomic datasets, pattern matching is key.

But what makes this so intriguing? How can we efficiently solve various challenges, such as finding all occurrences of a pattern in a text, identifying repeated patterns for genome assembly, enabling data compression, or detecting similarities in large datasets?

To answer these broader questions, let us first narrow down the problem: How do we find a specific pattern in a given text efficiently?

Naive Approach

The simplest way to find a pattern in a text is to compare it with every possible substring of the text. This is exactly what the Naive Algorithm does.

How It Works:

The algorithm compares the pattern with each substring of the text.
If it finds a match, great! If not, it moves on to the next one.

Example:
Text: "banana"
Pattern: "ana"

The algorithm compares the pattern "ana" with each possible substring of the text:
Compare "ana" with "ban" → No match.
Compare "ana" with "ana" → Match found!
Compare "ana" with "nan" → No match.
Compare "ana" with "ana" → Match found!

Is it efficient?

The time complexity of the Naive Algorithm is O(n × m), where:
n is the length of the text,
m is the length of the pattern.

Now, imagine you have a genomic sequence with 10 million base pairs (n = 10^7) and a pattern of 10 bases (m = 10). How many comparisons would the Naive Algorithm perform? The answer: 100 million comparisons.

What If There’s a Smarter Way to Match Patterns?

Have you ever wondered if there's a way to avoid checking positions that you already know won’t work? This is where the Knuth-Morris-Pratt (KMP) Algorithm comes in.

How does KMP help?
KMP preprocesses the pattern to create a prefix table, which stores information about how much of the pattern matches at different points. This helps skip unnecessary comparisons.

Example:
Text: "banana"
Pattern: "ana"
Prefix Table: [0, 0, 1]

The table helps KMP skip checks and match the pattern more efficiently.

Does it improve performance?
Yes! The time complexity of KMP is O(n + m), much faster than the Naive approach (O(n × m)).

But is it enough for large genomic data?

Is There Even More Efficient Way?

Ok read this:

Every Substring Is a Prefix of Some Suffix of a Text T

At first, this may seem like a simple idea, but in pattern matching algorithms, it's a powerful insight. When searching for a pattern in a text, every substring is a prefix of some suffix. Recognizing this allows us to optimize pattern matching, skipping unnecessary parts of the data and improving efficiency.

What is a Suffix Tree?

A Suffix Tree is a compressed tree structure that represents all the suffixes of a string.

For example, for the string "banana", the suffixes would be:

"banana"
"anana"
"nana"
"ana"
"na"
"a"

Each of these suffixes is represented as a node in the tree, and the tree is compressed, meaning that common prefixes among the suffixes are shared to save space.

Why Is This Important?

Efficient Pattern Matching: Once the suffix tree is built in O(n) time, searching for a pattern like “ana” takes just O(m) time, making it much faster than methods like the Naive or KMP algorithms.

Multiple Patterns: The tree allows you to search for multiple patterns at once and find the longest repeated substring in O(n) time.

How Does It Scale for Large Genomes?
For large genomes, like one with 10 million base pairs, building the tree takes O(n) time, and searching for a 10-base pattern takes only O(m) time, ensuring highly efficient searches.

Didn’t you find it fascinating how we managed to reduce the complexity?

It was one of my favorite parts in the class! Seeing how optimizing algorithms leads to massive performance improvements was truly rewarding and made the experience even more enjoyable.

Conclusion:

While some concepts were challenging, I embraced the uncertainty, trusting that with time, everything would fall into place. The journey through these algorithms wasn’t always easy, but each step forward deepened my understanding. I’m confident that as I continue to explore, the connections will become even clearer.

Heartfelt Gratitude

I am truly thankful to ACM India, Amrita Vishwa Vidyapeetham, and the amazing mentors who made this program possible. Their commitment to fostering an environment of learning, collaboration, and innovation has been a great source of inspiration.

A special thank you to Dr. K. Somasundaram for his humility and constant support throughout the winter school. He made sure we were comfortable and made the most of the experience.

The connections I made here is something I will always cherish. Collaborating, sharing ideas, and learning from each other made this journey even more memorable.

As the program came to a close, we received certificates and shared a final group photo to mark the end of this enriching experience.

Forem: Alfiya Fatima