MSA Files

This part shows how to parse, refine, filter, slice, and write MSA files.

In [1]: from prody import *

In [2]: from matplotlib.pylab import *

In [3]: ion()  # turn interactive mode on

Let’s get the Pfam MSA file for the protein family that contains PIWI_ARCFU:

Parsing MSA files

This shows how to use MSAFile or parseMSA() to read the MSA file.

Reading using MSAFile yields an MSAFile object. Iterating over the object will yield an object of Sequence from which labels, sequence can be obtained. Like any file object, you can over iterate over a MSAFile object once.

In [4]: msafile = 'PF02171_seed.sth'

In [5]: msafobj = MSAFile(msafile)

In [6]: msafobj
Out[6]: <MSAFile: PF02171_seed (Stockholm; mode 'rt')>

In [7]: msa_seq_list = list(msafobj)

In [8]: msa_seq_list[0]
Out[8]: <Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>

parseMSA() returns an MSA object. We can parse compressed files, but reading uncompressed files is much faster.

In [9]: fetchPfamMSA('PF02171', compressed=True)
Out[9]: 'PF02171_full.sth.gz'

In [10]: parseMSA('PF02171_full.sth.gz')
Out[10]: <MSA: PF02171_full (2067 sequences, 1392 residues)>

In [11]: fetchPfamMSA('PF02171', format='fasta')
Out[11]: 'PF02171_full.fasta.gz'

In [12]: parseMSA('PF02171_full.fasta.gz')
Out[12]: <MSA: PF02171_full (2067 sequences, 1392 residues)>

Iterating over a file will yield sequence id, sequence, residue start and end indices:

In [13]: msa = MSAFile('PF02171_seed.sth')

In [14]: seq_list = [seq for seq in msa]

In [15]: seq_list
[<Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>,
 <Sequence: O16720_CAEEL (length 395; 302 residues and 93 gaps)>,
 <Sequence: AGO10_ARATH (length 395; 322 residues and 73 gaps)>,
 <Sequence: AGO1_SCHPO (length 395; 300 residues and 95 gaps)>,
 <Sequence: AGO6_ARATH (length 395; 311 residues and 84 gaps)>,
 <Sequence: AGO4_ARATH (length 395; 309 residues and 86 gaps)>,
 <Sequence: YQ53_CAEEL (length 395; 328 residues and 67 gaps)>,
 <Sequence: Q21691_CAEEL (length 395; 329 residues and 66 gaps)>,
 <Sequence: O62275_CAEEL (length 395; 331 residues and 64 gaps)>,
 <Sequence: O16386_CAEEL (length 395; 300 residues and 95 gaps)>,
 <Sequence: Q21495_CAEEL (length 395; 285 residues and 110 gaps)>,
 <Sequence: O76922_DROME (length 395; 298 residues and 97 gaps)>,
 <Sequence: PIWI_DROME (length 395; 292 residues and 103 gaps)>,
 <Sequence: Q17567_CAEEL (length 395; 312 residues and 83 gaps)>,
 <Sequence: PIWL1_HUMAN (length 395; 293 residues and 102 gaps)>,
 <Sequence: PIWI_ARCFU (length 395; 297 residues and 98 gaps)>,
 <Sequence: Y1321_METJA (length 395; 274 residues and 121 gaps)>,
 <Sequence: O67434_AQUAE (length 395; 276 residues and 119 gaps)>]

Filtering and Slicing

This shows how to use the MSAFile object or MSA object to refine MSA using filters and slices.


Any function that takes label and sequence arguments and returns a boolean value can be used for filtering the sequences. A sequence will be yielded if the function returns True. In the following example, sequences from organism ARATH are filtered:

In [16]: msafobj = MSAFile(msafile, filter=lambda lbl, seq: 'ARATH' in lbl)

In [17]: seq_list2 = [seq for seq in msafobj]

In [18]: seq_list2
[<Sequence: AGO10_ARATH (length 395; 322 residues and 73 gaps)>,
 <Sequence: AGO6_ARATH (length 395; 311 residues and 84 gaps)>,
 <Sequence: AGO4_ARATH (length 395; 309 residues and 86 gaps)>]


A list of integers can be used to slice sequences as follows. This enables selective parsing of the MSA file.

In [19]: msafobj = MSAFile(msafile, slice=list(range(10)) + list(range(374,384)))

In [20]: list(msafobj)[0]
Out[20]: <Sequence: TAG76_CAEEL (length 20; 19 residues and 1 gaps)>

Slicing can also be done using MSA. The MSA object offers other functionalities like querying, indexing, slicing row and columns and refinement.

MSA objects


Retrieving a sequence at a given index or by label will give an object of Sequence. Here’s an example using an index.

In [21]: msa = parseMSA(msafile)

In [22]: seq = msa[0]

In [23]: seq
Out[23]: <Sequence: TAG76_CAEEL (PF02171_seed[0]; length 395; 307 residues and 88 gaps)>

In [24]: str(seq)

Here we retrieve a sequence by UniProt ID:

In [25]: msa['YQ53_CAEEL']
Out[25]: <Sequence: YQ53_CAEEL (PF02171_seed[6]; length 395; 328 residues and 67 gaps)>


You can query whether a sequence in contained in the instance using the UniProt identifier of the sequence as follows:

In [26]: 'YQ53_CAEEL' in msa
Out[26]: True


Slice an MSA instance to give a new MSA. object :

In [27]: new_msa = msa[:2]

In [28]: new_msa
Out[28]: <MSA: PF02171_seed (2 sequences, 395 residues)>

Slice using a list of UniProt IDs:

In [29]: msa[['TAG76_CAEEL', 'O16720_CAEEL']]
Out[29]: <MSA: PF02171_seed (2 sequences, 395 residues)>

Retrieve a character or a slice of a sequence:

In [30]: msa[0,0]
Out[30]: <Sequence: TAG76_CAEEL (length 1; 1 residues and 0 gaps)>

In [31]: msa[0,0:10]
Out[31]: <Sequence: TAG76_CAEEL (length 10; 9 residues and 1 gaps)>

Slice MSA rows and columns:

In [32]: msa[:10,20:40]
Out[32]: <MSA: PF02171_seed (10 sequences, 20 residues)>

Merging MSAs

mergeMSA() can be used to merge two or more MSAs. Based on their labels only those sequences that appear in both MSAs are retained, and concatenated horizontally to give a joint or merged MSA. This can be useful while evaluating covariance patterns for proteins with multiple domains or protein-protein interactions. The example shows merging for the multi-domain receptor 3KG2 containing pfam domains PF01094 and PF00497.

Let’s parse and merge the two files:

In [33]: msa1 = parseMSA('PF01094_full.sth')

In [34]: msa1
Out[34]: <MSA: PF01094_full (11803 sequences, 2730 residues)>

In [35]: msa2 = parseMSA('PF00497_full.sth')

In [36]: msa2
Out[36]: <MSA: PF00497_full (24877 sequences, 2159 residues)>

In [37]: merged = mergeMSA(msa1, msa2)

In [38]: merged
Out[38]: <MSA: PF01094_full + PF00497_full (1174 sequences, 4889 residues)>

The merged MSA contains 4889 sequences with common labels.

Writing MSAs

writeMSA() can be used to write MSA. It takes filename as input which should contain appropriate extension that can be ".slx" or ".sth" or ".fasta" or format should be specified as "SELEX", "Stockholm" or "FASTA". Input MSA should be MSAFile or MSA object. Filename can contain ".gz" extension, in which case a compressed file will be written.

In [39]: writeMSA('sliced_MSA.gz', msa, format='SELEX')
Out[39]: 'sliced_MSA.gz'

In [40]: writeMSA('sliced_MSA.fasta', msafobj)
Out[40]: 'sliced_MSA.fasta'

writeMSA() returns the name of the MSA file that is written.