Advice on using a header in text to re-organize a large dataset

Hannah Frank — Tue, 13 Aug 2019 22:15:46 +0000

Hi Dev Community,

I have over 100 files in the following structure (called a FASTA -- very familiar to those of you who spend time looking at genetic data):

Original File 1:

>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG

>Gene2_id1
ATGCATGCAGC
ACTAGGCCACG
CCGTAGCGGAC

>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA

Each of these ~100 files contain ~20,000 of these genes. The problem is that my files are organized such that Gene1 IDs are blended alongside Gene2 IDs.

For my analysis, I need all of my Gene1 IDs organized in one place. I will ideally end up with one file for each GeneX, like so:

Desired Final Gene1 File:

>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG

>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA

The length of sequence varies between genes and between individuals within genes, so I need all the lines below a header line and above the next header line to be associated with the header.

My current solution has been to take each file, and then create a new file based on the header of each line. So the first file creates three new files: one for >Gene1_id1, one for >Gene2_id1, and one for Gene1_id2. From there, I was planning on re-organizing to suit my needs.

The problem with the above approach is that it has created ~800,000 similarly-named files which are killing my computer. There must be a better way.

Any advice on how to proceed? Thanks!!!

-Hannah

Forem: Hannah Frank

Advice on using a header in text to re-organize a large dataset