This module defines functions for parsing header data from PDB files.
A data structure for storing information on chemical components (or heterogens) in PDB structures.
A Chemical instance has the following attributes:
Attribute | Type | Description (RECORD TYPE) |
---|---|---|
resname | str | residue name (or chemical component identifier) (HET) |
name | str | chemical name (HETNAM) |
chain | str | chain identifier (HET) |
resnum | int | residue (or sequence) number (HET) |
icode | str | insertion code (HET) |
natoms | int | number of atoms present in the structure (HET) |
description | str | description of the chemical component (HET) |
synonyms | list | synonyms (HETSYN) |
formula | str | chemical formula (FORMUL) |
pdbentry | str | PDB entry that chemical data is extracted from |
Chemical class instances can be obtained as follows:
In [1]: from prody import *
In [2]: chemical = parsePDBHeader('1zz2', 'chemicals')[0]
In [3]: chemical
Out[3]: <Chemical: B11 (1ZZ2_A_362)>
In [4]: chemical.name
Out[4]: 'N-[3-(4-FLUOROPHENOXY)PHENYL]-4-[(2-HYDROXYBENZYL) AMINO]PIPERIDINE-1-SULFONAMIDE'
In [5]: chemical.natoms
Out[5]: 33
In [6]: len(chemical)
Out[6]: 33
chain identifier
description of the chemical component
chemical formula
insertion code
chemical name
number of atoms present in the structure
PDB entry that chemical data is extracted from
residue name (or chemical component identifier)
residue (or sequence) number
list of synonyms
A data structure for storing information on polymer components (protein or nucleic) of PDB structures.
A Polymer instance has the following attributes:
Attribute | Type | Description (RECORD TYPE) |
---|---|---|
chid | str | chain identifier |
name | str | name of the polymer (macromolecule) (COMPND) |
fragment | str | specifies a domain or region of the molecule (COMPND) |
synonyms | list | synonyms for the polymer (COMPND) |
ec | list | associated Enzyme Commission numbers (COMPND) |
engineered | bool | indicates that the polymer was produced using recombinant technology or by purely chemical synthesis (COMPND) |
mutation | bool | indicates presence of a mutation (COMPND) |
comments | str | additional comments |
sequence | str | polymer chain sequence (SEQRES) |
dbrefs | list | sequence database records (DBREF[1|2] and SEQADV), see DBRef |
modified | list | modified residues (SEQMOD)
when modified residues are present, each will be
represented as: (resname, resnum, icode, stdname,
comment)
|
pdbentry | str | PDB entry that polymer data is extracted from |
Polymer class instances can be obtained as follows:
In [7]: polymer = parsePDBHeader('2k39', 'polymers')[0]
In [8]: polymer
Out[8]: <Polymer: UBIQUITIN (2K39_A)>
In [9]: polymer.pdbentry
Out[9]: '2K39'
In [10]: polymer.chid
Out[10]: 'A'
In [11]: polymer.name
Out[11]: 'UBIQUITIN'
In [12]: polymer.sequence
Out[12]: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG'
In [13]: len(polymer.sequence)
Out[13]: 76
In [14]: len(polymer)
Out[14]: 76
In [15]: dbref = polymer.dbrefs[0]
In [16]: dbref.database
Out[16]: 'UniProt'
In [17]: dbref.accession
Out[17]: 'P62972'
In [18]: dbref.idcode
Out[18]: 'UBIQ_XENLA'
chain identifier
additional comments
sequence database reference records
list of associated Enzyme Commission numbers
indicates that the molecule was produced using recombinant technology or by purely chemical synthesis
specifies a domain or region of the molecule
modified residues
indicates presence of a mutation
name of the polymer (macromolecule)
PDB entry that polymer data is extracted from
polymer chain sequence
list of synonyms for the molecule
A data structure for storing reference to sequence databases for polymer components in PDB structures. Information if parsed from DBREF[1|2] and SEQADV records in PDB header.
database accession code
sequence database, one of UniProt, GenBank, Norine, UNIMES, or PDB
database abbreviation, one of UNP, GB, NORINE, UNIMES, or PDB
list of differences between PDB and database sequences, (resname, resnum, icode, dbResname, dbResnum, comment)
initial residue numbers, (resnum, icode, dbnum)
database identification code, i.e. entry name in UniProt
ending residue numbers, (resnum, icode, dbnum)
Return header data dictionary for pdb. This function is equivalent to parsePDB(pdb, header=True, model=0, meta=False), likewise pdb may be an identifier or a filename.
List of header records that are parsed.
Record type | Dictionary key(s) | Description |
---|---|---|
HEADER | classification
deposition_date
identifier
|
molecule classification
deposition date
PDB identifier
|
TITLE | title | title for the experiment or analysis |
SPLIT | split | list of PDB entries that make up the whole structure when combined with this one |
COMPND | polymers | see Polymer |
EXPDTA | experiment | information about the experiment |
NUMMDL | n_models | number of models |
MDLTYP | model_type | additional structural annotation |
AUTHOR | authors | list of contributors |
JRNL | reference |
|
DBREF[1|2] | polymers | see Polymer and DBRef |
SEQADV | polymers | see Polymer |
SEQRES | polymers | see Polymer |
MODRES | polymers | see Polymer |
HELIX | polymers | see Polymer |
SHEET | polymers | see Polymer |
HET | chemicals | see Chemical |
HETNAM | chemicals | see Chemical |
HETSYN | chemicals | see Chemical |
FORMUL | chemicals | see Chemical |
REMARK 2 | resolution | resolution of structures, when applicable |
REMARK 4 | version | PDB file version |
REMARK 350 | biomoltrans | biomolecular transformation lines (unprocessed) |
Header records that are not parsed are: OBSLTE, CAVEAT, SOURCE, KEYWDS, REVDAT, SPRSDE, SSBOND, LINK, CISPEP, CRYST1, ORIGX1, ORIGX2, ORIGX3, MTRIX1, MTRIX2, MTRIX3, and REMARK X not mentioned above.
Assign secondary structure from header dictionary to atoms. header must be a dictionary parsed using the parsePDB(). atoms may be an instance of AtomGroup, Selection, Chain or Residue. ProDy can be configured to automatically parse and assign secondary structure information using confProDy(auto_secondary=True) command. See also confProDy() function.
The Dictionary of Protein Secondary Structure, in short DSSP, type single letter code assignments are used:
- G = 3-turn helix (310 helix). Min length 3 residues.
- H = 4-turn helix (alpha helix). Min length 4 residues.
- I = 5-turn helix (pi helix). Min length 5 residues.
- T = hydrogen bonded turn (3, 4 or 5 turn)
- E = extended strand in parallel and/or anti-parallel beta-sheet conformation. Min length 2 residues.
- B = residue in isolated beta-bridge (single pair beta-sheet hydrogen bond formation)
- S = bend (the only non-hydrogen-bond based assignment).
- C = residues not in one of above conformations.
See http://en.wikipedia.org/wiki/Protein_secondary_structure#The_DSSP_code for more details.
Following PDB helix classes are omitted:
- Right-handed omega (2, class number)
- Right-handed gamma (4)
- Left-handed alpha (6)
- Left-handed omega (7)
- Left-handed gamma (8)
- 2 - 7 ribbon/helix (9)
- Polyproline (10)
Secondary structures are assigned to all atoms in a residue. Amino acid residues without any secondary structure assignments in the header section will be assigned coil (C) conformation. This can be prevented by passing coil=False argument.
Return atoms after applying biomolecular transformations from header dictionary. Biomolecular transformations are applied to all coordinate sets in the molecule.
Some PDB files contain transformations for more than 1 biomolecules. A specific set of transformations can be choosen using biomol argument. Transformation sets are identified by numbers, e.g. "1", "2", ...
If multiple biomolecular transformations are provided in the header dictionary, biomolecules will be returned as AtomGroup instances in a list().
If the resulting biomolecule has more than 26 chains, the molecular assembly will be split into multiple AtomGroup instances each containing at most 26 chains. These AtomGroup instances will be returned in a tuple.
Note that atoms in biomolecules are ordered according to chain identifiers.