
Thursday, November 10th, 2022
  12:00-4:00 PM CST
Agenda: https://www.rcsb.org/news/feature/63288c8c831916e52206a1f1
Quick links to sections in document. TIme in CST 
  12:05-12:20pm - Section 1: Review of Crash Course  Objectives and Arabidopsis Case Study 
  Questions from the participants: 
  12:20-1pm - Section 2:  Protein Candidates from Function Queries in KBase 
  Questions from the participants: 
  1-1:30pm - Section 3:  Accessing Experimental Structures from the PDB 
  Questions from the participants: 
  1:30-2pm - Section 4:  Accessing Computed Structure Models generated using AlphaFold2 or RoseTTAFold 
  Questions from the participants: 
  2-2:15pm - BREAK 
  2:15-2:45pm - Section 5:  Introduction to Mol* Molecular Graphics System 
  Questions from the participants: 
  2:45-3:15pm - Section 6:  KBase Apps for Protein Structure Data Communication and Integration with RCSB  PDB 
  Questions from the participants: 
  3:15-3:45pm - Section 7:  Reviewing and expanding on the capabilities offers by RCSB tools in KBase 
  Questions from the participants: 
  3:45-4pm Section 8: Future  Goals for KBase/RCSB PDB Collaboration 
  Questions and comments from the participants: 
Speaker: Christopher S. Henry, Ph.D. - KBase, Argonne National Laboratory
  Synopsis: Dr. Henry will start by introducing the overall Crash Course objectives, reviewing the tools to be presented in KBase and how these will connect to data and tools within PDB. Next, he will present a simple case study in KBase where KBase tools will be used to identify a gap in the metabolic annotations of Arabidopsis, the RibD step of riboflavin biosynthesis. Next, the KBase-PDB query tool will reveal two potential gene candidates to fill this gap: AT4G20960, and AT3G47390. Dr. Henry will explore the output of this tool, showing how the PDB data provides evidence for these new annotations. In session 4 of the workshop, we will show how PDB tools can be further applied to investigate these annotations further.
KBase Narrative: https://narrative.kbase.us/narrative/130357
AT4G20960
  >sp|Q8GWP5.1|RIBD_ARATH RecName: Full=Riboflavin biosynthesis protein PYRD, chloroplastic; Includes: RecName: Full=Diaminohydroxyphosphoribosylaminopyrimidine deaminase; Short=DRAP deaminase; AltName: Full=Riboflavin-specific deaminase; Includes: RecName: Full=Inactive 5-amino-6-(5-phosphoribosylamino)uracil reductase; AltName: Full=HTP reductase; Flags: Precursor 
  MQISCLPISIPSITPRTSIPLLPSLSSNPRRIFNLTSLQSPNHCFFKRLHKSQTGFSNPVLAAMRREEDV 
  EVDDSFYMRKCVELAKRAIGCTSPNPMVGCVIVKDGDIVGQGFHPKAGQPHAEVFALRDAGELAENATAY 
  VSLEPCNHYGRTPPCTEALIKAKVRRVVIGMVDPNPIVFSSGISRLKDAGIDVTVSVEEELCKKMNEGFI 
  HRMLTGKPFLALRYSMSVNGCLLDKIGQGASDSGGYYSKLLQEYDAIILSSSLSDELSSISSQEAINVSI 
  QPIQIIVASNAQQSHILASSHTVEESGPKVVVFTAKESVAESGISSSGVETVVLEKINLDSILDYCYNRG 
  LCSVLLDLRGNVKDLEVLLRDGFEQKLLQKVIIEVLPEWSTKDERQIASMKWLESKHVKDLQSKQLGGSV 
  LLEGYF 
AT3G47390
  >sp|Q9STY4.1|RIBRX_ARATH RecName: Full=Riboflavin biosynthesis protein PYRR, chloroplastic; Includes: RecName: Full=Inactive diaminohydroxyphosphoribosylaminopyrimidine deaminase; Short=DRAP deaminase; AltName: Full=Riboflavin-specific deaminase; Includes: RecName: Full=5-amino-6-(5-phosphoribosylamino)uracil reductase; AltName: Full=HTP reductase; Includes: RecName: Full=Riboflavin biosynthesis intermediates N-glycosidase; Flags: Precursor 
  MALSFRISSSSPLICRATLSNGDNSRNYHTTDAAFIRRAADLSEMSAGLTSPHPNFGCVIATSSGKVAGE 
  GYLYAQGTKPAEALAVEAAGEFSRGATAYLNMEPGDCHGDHTAVSALVQAGIERVVVGIRHPLQHLRGSA 
  IRELRSHGIEVNVLGEDFESKVLEDARKSCLLVNAPLIHRACSRVPFSVLKYAMTLDGKIAASSGHAAWI 
  SSKLSRTRVFELRGGSDAVIVGGNTVRQDDPRLTARHGQGHTPTRIVMTQSLDLPEKANLWDVSEVSTIV 
  VTQRGARKSFQKLLASKGVEVVEFDMLNPREVMEYFHLRGYLSILWECGGTLAASAISSSVIHKVVAFVA 
  PKIIGGSKAPSPVGDLGMVEMTQALNLIDVCYEQVGPDMLVSGFLQPIPDLLPVIPSEDATVEIDPSVDP 
  FEPSIIFFYKTWDLYGMWNITIRYHTTVHVKWYLALSKKHNLLILHPKTLKANKFVGVENPKAYDCVEKI 
  RTARSPEEAALIGRSTLRQKPELVRNDWEDVKIEVMYKALKCKFSTYPHLKSMLLSTIGTVLVEASPHDL 
  FWGGGREGEGLNYLGRLLMQLRSEYLGESSVSAEKTSSA 
The crystal structure of RibD from Escherichia coli in complex with the oxidised NADP+ cofactor in the active site of the reductase domain:
  https://www.rcsb.org/structure/2O7P
NMR structure of the E.coli protein YbiA, Northeast Structural Genomics target ET24:
  https://www.rcsb.org/structure/2B3W
Computed structure model of Arabidopsis thaliana’s riboflavin biosynthesis protein PYRD, chloroplastic: PYRD
Computed structure model of Arabidopsis thaliana’s riboflavin biosynthesis protein PYRR, chloroplastic: PYRR
Q (Herman-UnivTexas): Is it possible to integrate cryo-EM structures (low resolution or tomography) to the Kbase for structure prediction? In my case for a virus-like self-assembled structure from humans.
A (name): You can load and share the structure, which Q. Zhang will demonstrate later. Eventually we plan to support structure alignment search against experimental structures in PDB (courtesy of work done by Ada Sedova’s team at ORNL). You can also query PDB for hits based on gene sequence and bring those structures into KBase for comparison. We don’t yet offer any folding capabilities in KBase. Also, you can import your structures directly into PDB for alignment today, which Brinda will demonstrate in her talk. You would need to know good structures to compare to, but the KBase and PDB tools . I don’t think that’s exactly what you were looking for, but hopefully this helps.
Thanks for the response, right now I'm trying to get a better structure prediction of the monomer protein (400 aa), then later focus on the macrosctructure. I’ll check the talks to learn the workflows.
Meeting agenda with times:
  https://www.rcsb.org/news/feature/63288c8c831916e52206a1f1
Speaker: Janaka N. Edirisinghe, Ph.D. - KBase, Argonne National Laboratory
  Synopsis: In this presentation, Dr. Janaka Edirisinghe demonstrates the KBase discovery pipeline in identifying potential gene candidates on a novel Pyridine degradation pathway in Micrococcus luteus. Here (i) we use cheminformatics analysis to propose new biochemistry then, (ii) we use metabolic modeling and omics data to identify potential gene candidates, (iii)we then query the PDB to fetch metadata/annotations for experimentally resolved structures corresponding to the gene candidates.  For these selected structures, later in this workshop, PDB team will demonstrate (iv) deriving co-crystallized structures with the substrates of interest that bolster the confidence of the identified gene candidates on this novel degradation pathway.
Relevant material:
  KBase Narrative HTML version of the workflow - https://kbase.us/n/127880/210/    (No KBase account needed to access this link)
  Live KBase Narrative: https://narrative.kbase.us/narrative/127880 
PAAZ, AN ENZYME OF A BACTERIAL PYRIDINE DEGRADATION PATHWAY
  >MLuteus_masurca_RAST.CDS.3483_CDS_PaaZ
  MTTTATAEAAVNTVETVPSFVQDSWWTPDAGSAASAVPVRDASTGEVLAKVSADGLDLAAVVEYGRTTGQAELGKLTFHQRALKLKELAQYLNARREHFYTFSAQTGATKVDSMIDIDGGIGVLFTFGSKGRRELPNSQVVVDGPMEVLSKDGSFAGEHIYTRIPGVAVQINAFNFPVWGMLEKLAPAFIAGVPTIVKPATPTGYVAAAVVKAIIESNILPAGSLQLISGSVRGLLDVLDYRDLVAFTGSASTALTLKSHRNVVEGGVRFTSETDSLNAAILGTDAVEGTPEFDAFIKSVVTEMTVKAGQKC
  TSIRRAIVPEGLVPAVIAAVGKRIQERVVLGDPRAEGVTMGALASVEQLADVRAAVQSMIDAGGELAYGTLDSPSVTAADGTTGVVAEGAFMSPVVLGWNDPEAEAIHSLEAFGPVASVIGYKDLPDAVRLAARGGGSLVATVCTNDPAVARELVTGIAAHHGRVLMLNREDARSSTGHGSPVPHLVHGGPGRAGGGEELGGIRSVMHHMQRTAIQGSPNMLTAVTGVWHTGADRNFTADTEGTHPFRKSLSTLHIGDAIRSELREVTLEEITKFANSTGDTFYAHTNQEAAEANPFFPGIVAHGYLLLAWGAGLFVEPAPGPVLANYGLENLRFITPVAAGDSIRVTLTAKKITPRETDEYGEVAWDAVLTNQKDEIVATYDVLTLVEKG
Q (Gregg Crichlow): What does the biochemical number - the number after the partial EC number - represent? How can that be used in the type of analysis you presented?
  A (name): The enzymatic reaction rules used in the cheminformatics tool in KBase take the form: <3rd level EC number>_<reaction rule index>. Reactions generated by these rules are then named with this format: <3rd level EC number>_<reaction rule index>_<specific reaction index>. So the most useful portion of the ID, as demonstrated by Janaka, is the 3rd level EC number, which can be used to find enzyme candidates for the reaction by looking for enzymes that have been annotated with an EC number that STARTS with the same 3 digits. This is based on the assumption that enzymes in the same 3rd level EC class are likely to perform other functions in the same 3rd level EC class either through promiscuity (as is likely the case in our pryidine example) or because their current annotation is actually erroneous. To get at your question though, the next number in the ID, <reaction rule index>, is not nearly as useful, but does still have some basic utility. Essentially, reactions generated by the same reaction rule are cheminformatically equivalent… meaning they involve breaking and creating the same chemical bonds in similar chemical functional groups. Hypothetically, you should see greater structural homology in enzymes conforming to the reaction rules compared to enzymes confirming to different reaction rules (even if they are in the same EC class). So I’m suggesting the reaction rule index can almost be treated as a finer resolution chemical sub-class of the 3rd level of the EC tree. This only applies to our rule set though… and there are numerous rulesets in existence in cheminformatics. We are working on importing new rulesets into our tool in KBase so you can use them there… and hopefully these approaches standardize better over the coming years.
Q (Jameel): what is the source of the differential expression data?
  A (Chris): We cultured the cells in minimal media with Glucose and pyridine and ran RNAseq. We then uploaded the reads to KBase and ran the RNAseq pipeline in a narrative he links to in his narrative. That table comes specifically from the DEseq app.
Q (Mariana Fernandes): Are there equivalents of this tools for fungi genomes/genes or are this tools appropriate?
  A (Janaka):… In terms of Fungi, we do have tools for construction of models (Build Fungal Model) based on annotated genomes. Once the models are constructed, you can use the same pipeline based on the models (any metabolic model for that matter).
Speaker: Dennis Piehl, Ph.D. - RCSB Protein Data Bank, Rutgers University
  Synopsis: We will provide an overview and walkthrough of the PDB Archive and the tools available at RCSB.org to search, visualize, and download data from the PDB Archive. We will use the story of the Arabidopsis thaliana resistosome as an example for the tutorial/walkthrough. 
Relevant material:
  Resources and Documentation:
Q (Patrik D’haeseleer): Can you search computational structures by protein accession, e.g. Swissprot or genbank accession number?
A (Chris): The query system should support this. You can definitely search experimental structures this way using both the interface in KBase and the query tool in PDB. The query tool in KBase does not YET support query of computational structures, but we hope to add this soon. Of course, the PDB support both structure types.
Q (Sangwoo): Could we also use EC numbers to search for proteins, especially enzymes, in RCSB?
A (QZ): Yes, you can. Both the KBase query app and the PDB query interface and API support this.
Speaker: Brinda Vallat, Ph.D., RCSB Protein Data Bank, Rutgers University
  Synopsis: This session will focus on finding computed structure models (CSMs) on RCSB.org and applying the various tools available on RCSB.org to search, analyze and visualize CSMs alongside experimentally-determined PDB structures. The tutorial will provide an overview of CSM-specific search attributes, CSM structure summary page, sequence and structure similarity searches, and the pairwise structure alignment tool. Two case studies will be used to highlight the different tools and functionalities available for studying CSMs on RCSB.org: Disease resistance RPP13-like protein 4 from Arabidopsis thaliana resistosome and Micrococcus luteus aldehyde dehydrogenase. 
Relevant material:
Case study 1: Disease resistance RPP13-like protein 4
Case study 2: Micrococcus luteus aldehyde dehydrogenase
  MTTTATAEAAVNTVETVPSFVQDSWWTPDAGSAASAVPVRDASTGEVLAKVSADGLDLAAVVEYGRTTGQAELGKLTFHQ
  RALKLKELAQYLNARREHFYTFSAQTGATKVDSMIDIDGGIGVLFTFGSKGRRELPNSQVVVDGPMEVLSKDGSFAGEHI
  YTRIPGVAVQINAFNFPVWGMLEKLAPAFIAGVPTIVKPATPTGYVAAAVVKAIIESNILPAGSLQLISGSVRGLLDVLD
  YRDLVAFTGSASTALTLKSHRNVVEGGVRFTSETDSLNAAILGTDAVEGTPEFDAFIKSVVTEMTVKAGQKCTSIRRAIV
  PEGLVPAVIAAVGKRIQERVVLGDPRAEGVTMGALASVEQLADVRAAVQSMIDAGGELAYGTLDSPSVTAADGTTGVVAE
  GAFMSPVVLGWNDPEAEAIHSLEAFGPVASVIGYKDLPDAVRLAARGGGSLVATVCTNDPAVARELVTGIAAHHGRVLML
  NREDARSSTGHGSPVPHLVHGGPGRAGGGEELGGIRSVMHHMQRTAIQGSPNMLTAVTGVWHTGADRNFTADTEGTHPFR
  KSLSTLHIGDAIRSELREVTLEEITKFANSTGDTFYAHTNQEAAEANPFFPGIVAHGYLLLAWGAGLFVEPAPGPVLANY
  GLENLRFITPVAAGDSIRVTLTAKKITPRETDEYGEVAWDAVLTNQKDEIVATYDVLTLVEKG
Pairwise alignment tool
  https://www.rcsb.org/alignment 
  https://www.rcsb.org/docs/tools/pairwise-structure-alignment
Additional information 
  https://www.rcsb.org/docs/general-help/computed-structure-models-and-rcsborg
  https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/computed-structure-models
  https://github.com/ihmwg/ModelCIF
Q (Dylan Chivian): are the computed structure models always full-length of the chain or sometimes just a domain or two?
A (Dennis): For the case of AlphaFold2 and RoseTTAFold models (which are integrated with the RCSB PDB), yes, they are always the full-length chain. This is because (at least in the case of AlphaFold), the input sequence to fold are the full UniProt sequences. Thus, the output is the full-length structure.
Q (Patrik D’haeseleer): One caution to keep in mind regarding the AlphaFold models is that they include any signal peptides! If the N-terminal is internal, this can distort the protein structure, compared to what you would expect for the mature protein minus signal peptide
Q (Dylan Chivian): are the AlphaFold2 computed structure models always a single chain?
  A (Dennis): Yes, to date AlphaFold DB only predicts the structure of a single chain at a time. They have not yet performed this prediction for protein complexes. (The current set of RoseTTAFold models—from David Baker’s group—are of yeast binary complexes, so complexed predictions do exist. These all start with “MA_” in the RCSB PDB, e.g., https://www.rcsb.org/structure/MA_MABAKCEPC0006)
  Chris: Other groups… I think Baker lab for sure and groups at ORNL are adding tools to compute complex structures.       
  Patrik: several of the tools available on colab now support modeling heteromers - see the table at https://github.com/sokrypton/ColabFold
Q (Herman-Univ Texas): Can you perform comparison of more than 2 structures?, in my case a human cell receptor with multiple domains, some of them crystallized, and compare them with alphafold prediction (full structure, 700 aa)
  A (Shuchi):…Yes, you can add rows to the structure alignment tool to upload more IDs/files
  Q. Any tutorial on how to do this?
  A (Shuchi): perhaps the documentation https://www.rcsb.org/docs/tools/pairwise-structure-alignment can help
  Thanks Shuchi
Q (Yujun): How can you know whether the alignment of the protein is good or not?
  A (Shuchi):…The RMSD and TM-scores can help - low RMSD and high ™-score 
  Q (Vanessa):...Can you build multiple models from a set of assemblies from Kbase and compare their structures?
  A (name): KBase doesn’t support folding yet, but you can export the sequences and fold in google collab, and QZ can share information on that if you’d like. You can upload and view those structures in kBase, and also submit the structures to PDB for structure homology search or structure alignment against a hit from a PDB query performed in KBase or in the PDB website. I hope we can support folding and structural homology search against PDB directly in KBase soon.
Q (Y): can you import the sequence for modeling as a txt file? (pls I mean kbase upload)
  A (Chris): So I’m assuming you mean genome sequence and by modeling, I assume you mean a metabolic model. So, yes, you can import a DNA sequence in FASTA format for metabolic model reconstruction. Currently only full microbial genome sequences or metagenome sequences are supported for denovo reconstruction. If you have a euk, we cannot do euk gene calling in KBase yet, so you would need to import existing gene calls or select from the many reference genomes we already have in KBase (we have most plants and fungi). For microbes, we can call genes. For plants, microbes, and fungi, we can build models from protein sequences. Soon I think we will support uploading protein sequence sets and annotating and building models from those, but we don’t support them yet as this is still a relatively new datatype in KBase (free standing protein sequences that are outside a genome)
Speaker: Shuchismita Dutta, Ph.D., RCSB Protein Data Bank, Rutgers University
  Synopsis: This session will introduce participants to Mol*’s interface and some key functions in molecular visualization. The tool’s functions will be examined when presented in the context of 3D structures available from RCSB.org, as a stand-alone tool, and also when it is used to visualize query results, annotations integrated from various bioinformatics resources. The story of the Arabidopsis thaliana resistosome (introduced in an earlier talk) will be revisited for the tutorial/walkthrough.
Relevant material:
Q (Dylan Chivian): Does Mol* offer a scripting command language (e.g. select cartoon for chain A, hide all non chain A, ball-and-stick for hetero atoms, turn off waters, color acidic and basic residues, show surface/grasp view, etc.)?
A (Shuchi): The scripting options exist but are not necessarily exposed for our tool view. However, if you have specific needs, feel free to write to us using the Contact us button form the home page.
Q (Chris): Can we inject additional custom sequences (e.g. custom sequences from KBase) into that 1-D-3-D multi-sequence alignment view you were showing?
A (Shuchi):… You can group additional models into groups - the 1D-3D view for Groups is currently available only for groups by Uniprot structures in the RCSB.org. More options to come (I hope). If you have a specific request, feel free to write to us using the contact us button on the home page.
Q (Dylan Chivian): the active-site geometry search is fantastic! can you run it interactively from Mol*? can you include hetero atoms? can you include waters?
A (Shuchi):…Yes you can include ligands in the query too - see https://www.rcsb.org/docs/search-and-browse/advanced-search/structure-motif-search
Q (Patrik D’haeseleer): does 1D-3D view work for structural alignments from CSM structures? 
  A (Dennis): Yes, for example: https://www.rcsb.org/3d-sequence/AF_AFQ38834F1?assemblyId=deposited . You can navigate there from any Structure Summary Page of a CSM structure, by clicking on “1D-3D View” (underneath the image), e.g.: https://www.rcsb.org/structure/AF_AFQ38834F1. Did that answer your question? 
Q: (Dennis) Does this only work for Uniprot groups, or also for %sequence identity clusters? (For AF structures, the Uniprot group will typically only contain the one structure!)
A: If you’re referring to the groups 1D-3D alignment tool (e.g., https://www.rcsb.org/groups/3d-sequence/polymer_entity/Q38834), then yes, that only works for UniProt groups for now. We are considering adding more extensive functionality later…so keep an eye out! (Also feel free to ask about new features via our “contact us” button—the more feedback we receive, the more we can prioritize requests). And yes, if the AF structure doesn’t have a corresponding experimental structure from the same UniProt sequence, then it will be alone in the groups alignment. 
Q (Janaka) Would it be possible to align multiple experimental structures to a potential predicted structure (e.g; hypothetical gene with no clear function), and based on the experimental structure motifs and their mapped specific function, suggest the potential function of the hypothetical protein?
A (Shuchi): yes you can compare multiple structures from RCSB.org and your computed model. Please review https://www.rcsb.org/docs/tools/pairwise-structure-alignment and if you have more questions write to us and we will be happy to help you.
Speaker: Qizhi Zhang, Ph.D. - KBase, Argonne National Laboratory
  Synopsis: In this section, I will introduce four (4) apps in KBase that can be used for querying RCSB data and integrating with existing KBase genomic data: 
Purpose of each app will be described. These apps will be demonstrated in a KBase narrative with required input and output data/report examples presented.
Relevant material:
  KBase Narrative: https://narrative.kbase.us/narrative/130799
Data files used for this demo:
  https://github.com/kbaseapps/ProteinStructureUtils/tree/bcc51c1c57f8a286102254e80999fb2a6910d813/test/KBase_RCSB_crash_course_sample_files
Q (name): Where is cocrystallization data found in PDB? Can you show it in this example structure page: https://www.rcsb.org/structure/5OC0
A (name): you can scroll down to the small molecule section of the structure summary page of https://www.rcsb.org/structure/5OC0
Speaker: Christopher Henry, Ph.D. - KBase, Argonne National Laboratory
  Synopsis: In this session, Dr. Henry will further expand on the capabilities of the RCSB tools available in KBase. This includes various insights offered by publications, cocrystallization data, and protein complex information. Dr. Henry will discuss some of the future plans to improve connections between these tools and other tools and data in the KBase platform. Finally, questions and feedback will be collected. 
Relevant material:
Q (Mark): Where is cocrystallization data found in PDB? Can you show it in this example structure page: https://www.rcsb.org/structure/5OC0
A (Brinda):… @Mark Fisher, you can scroll down to the small molecule section of the structure summary page of https://www.rcsb.org/structure/5OC0
Speaker: Stephen K. Burley, M.D.,D. Phil. - RCSB Protein Data Bank, Rutgers
Comment (John Hasty): This would help more people contribute and develop advances in their own “pipeline” and then share them.