Recommand · October 14, 2021 0

How to tackle this dictionaries exercise in Python

Parsing the GO file and getting the terms

The next step is to get the GO-terms out of the GO file. The function
get_go_terms(mapping_list, go_file) whose first argument corresponds to the list of dictionaries computed by get_mapping and second argument corresponds to the name of a GO file. The result of this function should be a dictionary indexed by Ensembl protein IDs and having as values sets of GO terms.

Let’s look at an example. Suppose go_file contains the following.

UniProtKB P10070 GLI2 GO:0045944 PMID:16553965 
UniProtKB P10070 GLI2 GO:0045944 PMID:12165851 
UniProtKB P10070 GLI2 GO:0009913 PMID:12165851

Column 2 contains the protein ID, in this case UniProt ID, and column 5 the GO term. In this example UniProt protein ID “P10070” corresponds with Ensembl ID “ENSP00000354586” in mapping_list. So the returning dictionary should contain the key “ENSP00000354586” with GO terms “0045944” and “0009913” in a set as the value.
The skeleton script contains a start to this function. Finish this script so that the result is a dictionary with for each Ensembl ID the corresponding GO terms as a set.

Hints:

  • Don’t forget about split(), entries are separated by a tab-character.
  • Skip lines that start with "!" as they are just comments.
  • Protein IDs are in column 2 and GO terms in column 5. Remember the way Python list numbering works!
  • Given a protein ID (from the GO file), you can look up the corresponding Ensembl ID by checking all the dictionaries in mapping_list with a for loop.
import sys
def get_mapping(map_file):
    f = open(map_file, "r")
    # Result is a list of dictionaries.
    mapping_list = []
    # Skip the header on the first line.
    header = f.readline()
    header = header.split()

    #dicts in mapping_list
    col =len(header)-1
    for i in range(col):
        d ={}
        mapping_list.append(d)


    for line in f:
        line=line.strip('\n').split('\t')

        for dic in range(len(mapping_list)):
            #mapping_list.append(dic)

                if not line[dic+1]=="":
                    mapping_list[dic][line[dic+1]]=line[0]



    f.close()
    return mapping_list

def get_go_terms(mapping_list, go_file):
    # Open the file.
    f = open(go_file, "r")

    # This will be the dictionary that this function returns.
    # Entries will have as a key an Ensembl ID and the value will
    # be a set of GO terms.
    go_dict = dict()
   
    for line in f:
        

    return go_dict

rno.go sample

!CVS Version: Revision: 1.272 $
!GOC Validation Date: 06/04/2011 $
!Submission Date: 6/4/2011
!
! The above "Submission Date" is when the annotation project provided
! this file to the Gene Ontology Consortium (GOC).  The "GOC Validation
! Date" indicates when this file was last changed as a result of a GOC
! validation and filtering process.  The "CVS Version" above is the
! GOC version of this file.
!
! Note: The contents of this file may differ from that submitted to the
! GOC. The identifiers and syntax of the file have been checked, rows of
! data not meeting the standards set by the GOC have been removed. This
! file may also have annotations removed because the annotations for the
! listed Taxonomy identifier are only allowed in a file provided by
! another annotation project.  The original submitted file is available from:
!  http://www.geneontology.org/gene-associations/submission/
!
! For information on which taxon are allowed in which files please see:
!  http://www.geneontology.org/GO.annotation.shtml#script
!
!gaf-version: 2.0
!Project_name: Rat Genome Database (RGD)
!URL: http://rgd.mcw.edu/
!Contact Email: simont@hmgc.mcw.edu
!Funding: NHLBI at US NIH, grant number 2R01HL064541
RGD 1302933 Mcpt1l3     GO:0003824  RGD:1600115 IEA InterPro:IPR009003  F   mast cell protease 1-like 3     gene    taxon:10116 20110430    InterPro        
RGD 1302933 Mcpt1l3     GO:0004252  RGD:1600115 IEA InterPro:IPR001254  F   mast cell protease 1-like 3     gene    taxon:10116 20110430    InterPro        
RGD 1302933 Mcpt1l3     GO:0006508  RGD:1600115 IEA InterPro:IPR001254  P   mast cell protease 1-like 3     gene    taxon:10116 20110430    InterPro        
RGD 1302933 Mcpt1l3     GO:0008233  RGD:1600115 IEA SP_KW:KW-0645   F   mast cell protease 1-like 3     gene    taxon:10116 20110430    UniProtKB       
RGD 1302934 St8sia5     GO:0003828  RGD:1624291 ISO RGD:1549984 F   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110602    RGD     
RGD 1302934 St8sia5     GO:0005794  RGD:1600115 IEA SP_KW:KW-0333   C   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    UniProtKB       
RGD 1302934 St8sia5     GO:0006486  RGD:1600115 IEA InterPro:IPR001675  P   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    InterPro        
RGD 1302934 St8sia5     GO:0008152  RGD:1624291 ISO RGD:1549984 P   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110602    RGD     
RGD 1302934 St8sia5     GO:0008373  RGD:1600115 IEA InterPro:IPR001675  F   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    InterPro        
RGD 1302934 St8sia5     GO:0016020  RGD:1600115 IEA SP_KW:KW-0472   C   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    UniProtKB       
RGD 1302934 St8sia5     GO:0016021  RGD:1600115 IEA SP_KW:KW-0812   C   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    UniProtKB       
RGD 1302934 St8sia5     GO:0030173  RGD:1600115 IEA InterPro:IPR001675  C   ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5        gene    taxon:10116 20110430    InterPro        
RGD 1302935 Nudt19      GO:0005739  RGD:1600115 IEA SP_SL:SL-0173   C   nudix (nucleoside diphosphate linked moiety X)-type motif 19        gene    taxon:10116 20110430    UniProtKB       
RGD 1302935 Nudt19      GO:0005739  RGD:1624291 ISO RGD:1551965 C   nudix (nucleoside diphosphate linked moiety X)-type motif 19        gene    taxon:10116 20110602    RGD     
RGD 1302935 Nudt19      GO:0005777  RGD:1600115 IEA SP_SL:SL-0204   C   nudix (nucleoside diphosphate linked moiety X)-type motif 19        gene    taxon:10116 20110430    UniProtKB       
RGD 1302935 Nudt19      GO:0016787  RGD:1600115 IEA SP_KW:KW-0378   F   nudix (nucleoside diphosphate linked moiety X)-type motif 19        gene    taxon:10116 20110430    UniProtKB       
RGD 1302935 Nudt19      GO:0046872  RGD:1600115 IEA SP_KW:KW-0479   F   nudix (nucleoside diphosphate linked moiety X)-type motif 19        gene    taxon:10116 20110430    UniProtKB       
RGD 1302936 C1rl        GO:0004252  RGD:1600115 IEA InterPro:IPR001254  F   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    InterPro        
RGD 1302936 C1rl        GO:0005576  RGD:1600115 IEA SP_SL:SL-0243   C   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    UniProtKB       
RGD 1302936 C1rl        GO:0005615  RGD:1600115 IEA Ensembl:ENSMUSP00000042883  C   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    ENSEMBL     
RGD 1302936 C1rl        GO:0005615  RGD:1624291 ISO RGD:1549998 C   complement component 1, r subcomponent-like     gene    taxon:10116 20110602    RGD     
RGD 1302936 C1rl        GO:0006508  RGD:1600115 IEA InterPro:IPR001254  P   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    InterPro        
RGD 1302936 C1rl        GO:0006958  RGD:1600115 IEA SP_KW:KW-0180   P   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    UniProtKB       
RGD 1302936 C1rl        GO:0008233  RGD:1600115 IEA SP_KW:KW-0645   F   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    UniProtKB       
RGD 1302936 C1rl        GO:0045087  RGD:1600115 IEA SP_KW:KW-0399   P   complement component 1, r subcomponent-like     gene    taxon:10116 20110430    UniProtKB       
RGD 1302937 Krt13       GO:0003674  RGD:1598407 ND      F   keratin 13      gene    taxon:10116 20070301    RGD     

The get_mapping is working correctly, but I don’t know what I should do for coding to get_go_terms() to work has the exercise says.