A Python3 script to clean up the PDB file
Most of time, the PDB files are complicated, which have lots of redundant information as shown below.
- ANISOU (data copied from 1lk2.pdb)
ATOM 1 N GLY A 1 66.440 45.780 5.177 1.00 14.10 N ANISOU 1 N GLY A 1 1908 1789 1659 99 -37 -3 N ATOM 2 CA GLY A 1 65.947 45.284 3.863 1.00 12.08 C ANISOU 2 CA GLY A 1 1484 1486 1620 47 -39 75 C ATOM 3 C GLY A 1 64.961 46.275 3.303 1.00 10.99 C ANISOU 3 C GLY A 1 1471 1204 1500 36 50 108 C ATOM 4 O GLY A 1 64.683 47.291 3.943 1.00 11.91 O ANISOU 4 O GLY A 1 1390 1542 1593 -61 88 -19 O
The simplest way is to delete the ANISOU lines.
-
HETATM
-
non-standard amino acid residues (data copied from 2o2x.pdb)
ATOM 821 OD1 ASP A 112 25.580 11.019 35.906 1.00 12.28 O ATOM 822 OD2 ASP A 112 24.586 9.016 35.848 1.00 11.81 O HETATM 823 N MSE A 113 25.018 10.050 30.641 1.00 9.26 N HETATM 824 CA MSE A 113 25.494 10.026 29.262 1.00 9.59 C HETATM 825 C MSE A 113 24.291 9.758 28.359 1.00 8.63 C HETATM 826 O MSE A 113 23.362 9.026 28.750 1.00 9.51 O HETATM 827 CB MSE A 113 26.563 8.959 29.078 1.00 8.81 C HETATM 828 CG MSE A 113 27.157 8.896 27.700 1.00 8.23 C HETATM 829 SE MSE A 113 28.681 7.732 27.499 0.75 12.65 SE HETATM 830 CE MSE A 113 30.013 8.895 28.258 1.00 18.85 C ATOM 831 N VAL A 114 24.306 10.362 27.178 1.00 7.56 N ATOM 832 CA VAL A 114 23.308 10.072 26.129 1.00 7.93 C
Since PDBSlicer could not deal with non-standard residues, the simplest way is to delete them.
- Missing Residue(s) or so-called Sequence Gap(s) (data copied from 1nzj.pdb)
ATOM 1756 O LEU A 222 48.274 3.534 34.949 1.00 27.98 O ANISOU 1756 O LEU A 222 3531 3513 3584 47 -26 6 O ATOM 1757 CB LEU A 222 45.906 2.133 33.476 1.00 26.02 C ANISOU 1757 CB LEU A 222 3274 3295 3315 29 58 -5 C ATOM 1758 N ASN A 223 47.050 5.216 34.027 1.00 28.96 N ANISOU 1758 N ASN A 223 3698 3595 3707 53 7 21 N ATOM 1759 CA ASN A 223 47.326 6.262 35.028 1.00 29.62 C ANISOU 1759 CA ASN A 223 3782 3735 3737 11 12 -25 C ATOM 1760 C ASN A 223 48.230 5.851 36.192 1.00 30.19 C ANISOU 1760 C ASN A 223 3871 3833 3767 45 -3 -6 C ATOM 1761 O ASN A 223 47.951 6.165 37.354 1.00 31.23 O ANISOU 1761 O ASN A 223 4074 3965 3824 76 26 -70 O ATOM 1762 CB ASN A 223 46.003 6.831 35.561 1.00 29.98 C ANISOU 1762 CB ASN A 223 3798 3785 3804 36 15 8 C ATOM 1763 N ALA A 237 50.141 13.856 28.172 1.00 30.51 N ANISOU 1763 N ALA A 237 3895 3875 3821 32 28 -17 N ATOM 1764 CA ALA A 237 50.857 13.904 26.900 1.00 30.22 C ANISOU 1764 CA ALA A 237 3816 3837 3827 7 2 7 C ATOM 1765 C ALA A 237 52.347 13.656 27.124 1.00 30.06 C ANISOU 1765 C ALA A 237 3809 3808 3803 26 11 -16 C ATOM 1766 O ALA A 237 52.869 13.962 28.189 1.00 30.54 O ANISOU 1766 O ALA A 237 3901 3866 3834 34 -52 -8 O ATOM 1767 CB ALA A 237 50.648 15.254 26.254 1.00 30.32 C ANISOU 1767 CB ALA A 237 3814 3832 3871 20 0 0 C ATOM 1768 N LEU A 238 53.035 13.117 26.121 1.00 29.79 N ANISOU 1768 N LEU A 238 3760 3773 3785 9 -4 -12 N ATOM 1769 CA LEU A 238 54.470 12.845 26.250 1.00 29.52 C ANISOU 1769 CA LEU A 238 3743 3740 3733 9 9 -18 C
The bold font lines indicate the discontinuous sequence numbers (223 ...empty... 237) due to the missing residues. We called this case as sequence gap. It is a very serious problem because the Ramachandran subunit is defined by three adjacent residues. It is immpossible to directly choose residue number series (222, 223, 237) and (2233, 237, 238) as the members of the Ramachandran subunit. The solution is that treat the peptide as segments, e.g. from beginning to residue number 223, then from residue number 237 to the end. If the PDB file has more than one gap, we divide it into several segments based on the locations of the gaps. Note: the discontinuous sequence number between different chains also treated as 'gap', just because it is easy for programming.
Improvement (Nov. 16, 2017) In the printing and report format, the chain ID was added aside to the sequence number, e.g. ('A:223', 'A:237'). Previously, only the sequence numbers between gap(s) were showed.
- alternate locations (data copied from 3ife.pdb)
ATOM 21 CE1 PHE A -4 40.991 47.856 19.364 1.00 27.65 C ATOM 22 CE2 PHE A -4 41.948 49.936 20.068 1.00 28.56 C ATOM 23 CZ PHE A -4 40.841 49.190 19.686 1.00 28.08 C ATOM 24 N AGLN A -3 46.967 45.549 21.004 0.50 23.13 N ATOM 25 N BGLN A -3 46.998 45.555 20.982 0.50 22.90 N ATOM 26 CA AGLN A -3 48.373 45.164 21.046 0.50 23.16 C ATOM 27 CA BGLN A -3 48.400 45.139 20.949 0.50 22.72 C ATOM 28 C AGLN A -3 48.567 43.661 20.812 0.50 22.63 C ATOM 29 C BGLN A -3 48.554 43.631 20.764 0.50 22.37 C ATOM 30 O AGLN A -3 49.384 43.259 19.986 0.50 21.47 O ATOM 31 O BGLN A -3 49.344 43.191 19.930 0.50 21.20 O ATOM 32 CB AGLN A -3 49.002 45.601 22.384 0.50 23.61 C ATOM 33 CB BGLN A -3 49.160 45.591 22.201 0.50 23.04 C ATOM 34 CG AGLN A -3 48.488 44.854 23.614 0.50 25.25 C ATOM 35 CG BGLN A -3 50.631 45.172 22.185 0.50 23.58 C ATOM 36 CD AGLN A -3 46.975 44.863 23.719 0.50 25.86 C ATOM 37 CD BGLN A -3 51.390 45.619 23.424 0.50 24.16 C ATOM 38 OE1AGLN A -3 46.364 43.846 24.056 0.50 23.20 O ATOM 39 OE1BGLN A -3 50.935 46.485 24.167 0.50 26.65 O ATOM 40 NE2AGLN A -3 46.361 45.990 23.375 0.50 25.35 N ATOM 41 NE2BGLN A -3 52.563 45.035 23.640 0.50 27.37 N ATOM 42 N SER A -2 47.792 42.842 21.521 1.00 21.72 N ATOM 43 CA SER A -2 47.888 41.386 21.401 1.00 22.23 C ATOM 44 C SER A -2 47.402 40.921 20.036 1.00 19.65 C ATOM 45 O SER A -2 48.008 40.034 19.456 1.00 20.72 O
- special cases in alternate locations (data copied from 5DXX.pdb)
ATOM 448 N MET A 61 48.127 9.414 21.012 1.00 8.02 N ANISOU 448 N MET A 61 952 878 1219 -50 501 95 N ATOM 449 CA AMET A 61 47.494 8.918 22.231 0.58 8.39 C ANISOU 449 CA AMET A 61 1091 827 1271 24 428 219 C ATOM 450 CA BMET A 61 47.420 8.922 22.202 0.42 8.88 C ANISOU 450 CA BMET A 61 1144 895 1334 -61 457 185 C ATOM 451 C MET A 61 47.346 7.404 22.267 1.00 8.78 C ANISOU 451 C MET A 61 1223 782 1330 59 378 169 C ATOM 452 O MET A 61 46.991 6.766 21.272 1.00 10.08 O ANISOU 452 O MET A 61 1398 943 1491 14 75 122 O ATOM 453 CB AMET A 61 46.118 9.546 22.410 0.58 8.06 C ANISOU 453 CB AMET A 61 903 838 1320 372 380 212 C ATOM 454 CB BMET A 61 45.980 9.455 22.241 0.42 8.97 C ANISOU 454 CB BMET A 61 930 991 1486 2 458 168 C ATOM 455 CG AMET A 61 46.138 11.063 22.501 0.58 8.72 C ANISOU 455 CG AMET A 61 1253 809 1251 307 330 165 C ATOM 456 CG BMET A 61 45.805 10.973 22.171 0.42 9.66 C ANISOU 456 CG BMET A 61 1110 1045 1516 57 274 111 C ATOM 457 SD AMET A 61 44.516 11.746 22.852 0.58 9.87 S ANISOU 457 SD AMET A 61 1393 1136 1221 329 227 121 S ATOM 458 SD BMET A 61 44.071 11.452 21.925 0.42 11.25 S ANISOU 458 SD BMET A 61 1357 1344 1573 -97 206 -11 S ATOM 459 CE AMET A 61 43.632 11.262 21.374 0.58 8.79 C ANISOU 459 CE AMET A 61 912 1125 1304 448 193 62 C ATOM 460 CE BMET A 61 43.308 10.818 23.419 0.42 10.70 C ANISOU 460 CE BMET A 61 1120 1371 1573 26 296 -15 C ... ATOM 2041 N ARG A 268 68.983 -6.030 20.233 1.00 12.62 N ANISOU 2041 N ARG A 268 1676 819 2299 101 -35 523 N ATOM 2042 CA BARG A 268 68.988 -4.603 20.530 0.60 12.88 C ANISOU 2042 CA BARG A 268 1398 984 2513 141 107 402 C ATOM 2043 CA CARG A 268 68.989 -4.603 20.527 0.40 12.83 C ANISOU 2043 CA CARG A 268 1483 920 2471 82 157 473 C ATOM 2044 C ARG A 268 67.641 -3.953 20.247 1.00 11.56 C ANISOU 2044 C ARG A 268 1170 935 2286 -23 70 342 C ATOM 2045 O ARG A 268 66.930 -4.345 19.316 1.00 12.73 O ANISOU 2045 O ARG A 268 1496 1160 2181 -23 37 354 O ATOM 2046 CB BARG A 268 70.061 -3.890 19.701 0.60 15.09 C ANISOU 2046 CB BARG A 268 1382 1451 2901 308 108 405 C ATOM 2047 CB CARG A 268 70.065 -3.894 19.699 0.40 14.76 C ANISOU 2047 CB CARG A 268 1631 1189 2787 164 317 597 C ATOM 2048 CG BARG A 268 71.428 -4.538 19.755 0.60 20.70 C ANISOU 2048 CG BARG A 268 2380 2183 3300 506 138 227 C ATOM 2049 CG CARG A 268 71.458 -4.466 19.860 0.40 18.82 C ANISOU 2049 CG CARG A 268 2367 1677 3108 350 438 578 C ATOM 2050 CD BARG A 268 72.280 -3.968 20.869 0.60 24.96 C ANISOU 2050 CD BARG A 268 3408 2535 3540 600 301 -84 C ATOM 2051 CD CARG A 268 72.378 -3.477 20.542 0.40 22.04 C ANISOU 2051 CD CARG A 268 3150 1893 3329 467 727 531 C ATOM 2052 NE BARG A 268 73.616 -4.559 20.871 0.60 27.23 N ANISOU 2052 NE BARG A 268 3846 2843 3658 816 402 -233 N ATOM 2053 NE CARG A 268 73.461 -3.031 19.670 0.40 25.28 N ANISOU 2053 NE CARG A 268 3900 2201 3505 545 882 478 N ATOM 2054 CZ BARG A 268 74.606 -4.169 20.074 0.60 29.73 C ANISOU 2054 CZ BARG A 268 4396 3111 3790 1084 535 -418 C ATOM 2055 CZ CARG A 268 74.657 -3.607 19.612 0.40 28.07 C ANISOU 2055 CZ CARG A 268 4528 2513 3625 423 993 395 C ATOM 2056 NH1BARG A 268 74.412 -3.180 19.206 0.60 30.54 N ANISOU 2056 NH1BARG A 268 4601 3217 3787 1268 624 -504 N ATOM 2057 NH1CARG A 268 74.925 -4.665 20.369 0.40 29.81 N ANISOU 2057 NH1CARG A 268 4898 2709 3720 472 964 299 N ATOM 2058 NH2BARG A 268 75.794 -4.766 20.144 0.60 30.36 N ANISOU 2058 NH2BARG A 268 4511 3196 3828 1150 633 -562 N ATOM 2059 NH2CARG A 268 75.586 -3.125 18.795 0.40 28.14 N ANISOU 2059 NH2CARG A 268 4497 2583 3613 248 1151 380 N
In this case (5DXX.pdb), there are three different types of the alternative locations, A
, B
, and C
high-lighted with the bold font. However, they distribute with irregular way. For instance, in sequence 61, A
and B
appeared, whereas in sequence 268, B
and C
emerged. As a result, it is impossible to simply use the pdb_info[(pdb_info.Alt_Loc == ' ') | (pdb_info.Alt_Loc == 'A')]
because that would delete all B
and C
labeled atoms in sequence 268!
Improvement or Debug (Sep. 04, 2017) By using pandas df.groupby() on the ['Seq_Num', 'ChainID'] columns, we can focus on each specific residue and keep the first alternative location, no matter the first one is 'A' or 'B' or 'C'. The code is show as following
#### delete the redundant alternate locations, only keep the first apperance
if altloc:
groups = pdb_info.groupby(['Seq_Num', 'ChainID'], sort=False)
pdb_info = groups.apply(lambda x:
x.drop_duplicates(subset=["AtomTyp"],
keep='first')
if len(groups['Alt_Loc']) >= 2 else x)
- insertion codes
ATOM 1258 CD1 ILE A 185 4.002 11.557 18.921 1.00 19.47 C ANISOU 1258 CD1 ILE A 185 2567 2632 2200 -66 -252 125 C ATOM 1259 N PRO A 186 6.584 15.226 16.396 1.00 16.95 N ANISOU 1259 N PRO A 186 2324 2351 1766 -93 -218 271 N ATOM 1260 CA PRO A 186 6.984 16.463 15.718 1.00 17.27 C ANISOU 1260 CA PRO A 186 2382 2394 1786 -103 -219 330 C ATOM 1261 C PRO A 186 6.139 17.642 16.167 1.00 19.26 C ANISOU 1261 C PRO A 186 2626 2598 2094 -86 -245 374 C ATOM 1262 O PRO A 186 4.907 17.532 16.301 1.00 18.40 O ANISOU 1262 O PRO A 186 2500 2480 2011 -67 -280 374 O ATOM 1263 CB PRO A 186 6.742 16.159 14.234 1.00 20.29 C ANISOU 1263 CB PRO A 186 2785 2831 2092 -115 -244 345 C ATOM 1264 CG PRO A 186 6.728 14.695 14.124 1.00 25.31 C ANISOU 1264 CG PRO A 186 3421 3497 2701 -115 -240 282 C ATOM 1265 CD PRO A 186 6.252 14.151 15.432 1.00 19.86 C ANISOU 1265 CD PRO A 186 2702 2765 2078 -100 -238 244 C ATOM 1266 N ASP A 186A 6.812 18.768 16.413 1.00 16.88 N ANISOU 1266 N ASP A 186A 2335 2266 1814 -93 -227 410 N ATOM 1267 CA ASP A 186A 6.193 20.046 16.803 1.00 18.33 C ANISOU 1267 CA ASP A 186A 2517 2396 2051 -76 -248 453 C ATOM 1268 C ASP A 186A 5.389 19.957 18.110 1.00 21.71 C ANISOU 1268 C ASP A 186A 2920 2782 2548 -46 -251 420 C ATOM 1269 O ASP A 186A 4.477 20.754 18.337 1.00 23.99 O ANISOU 1269 O ASP A 186A 3201 3034 2879 -21 -276 447 O ATOM 1270 CB ASP A 186A 5.342 20.626 15.640 1.00 20.86 C ANISOU 1270 CB ASP A 186A 2848 2731 2345 -71 -295 510 C ATOM 1271 CG ASP A 186A 6.138 20.870 14.377 1.00 27.21 C ANISOU 1271 CG ASP A 186A 3681 3578 3078 -102 -290 551 C ATOM 1272 OD1 ASP A 186A 7.316 21.272 14.485 1.00 27.14 O ANISOU 1272 OD1 ASP A 186A 3686 3563 3064 -125 -254 561 O ATOM 1273 OD2 ASP A 186A 5.578 20.677 13.277 1.00 34.63 O ANISOU 1273 OD2 ASP A 186A 4630 4560 3967 -104 -324 575 O ATOM 1274 N SER A 186B 5.742 18.999 18.983 1.00 16.28 N ANISOU 1274 N SER A 186B 2218 2098 1871 -47 -223 364 N ATOM 1275 CA SER A 186B 5.050 18.813 20.239 1.00 16.16 C ANISOU 1275 CA SER A 186B 2178 2050 1911 -22 -220 332 C ATOM 1276 C SER A 186B 6.014 18.876 21.407 1.00 16.84 C ANISOU 1276 C SER A 186B 2267 2109 2024 -28 -181 302 C ATOM 1277 O SER A 186B 7.167 18.490 21.277 1.00 17.17 O ANISOU 1277 O SER A 186B 2317 2170 2035 -52 -156 289 O ATOM 1278 CB SER A 186B 4.378 17.452 20.244 1.00 17.47 C ANISOU 1278 CB SER A 186B 2323 2250 2066 -18 -229 294 C ATOM 1279 OG SER A 186B 3.785 17.181 21.503 1.00 16.37 O ANISOU 1279 OG SER A 186B 2158 2085 1978 2 -220 264 O ATOM 1280 N LYS A 187 5.518 19.323 22.546 1.00 14.62 N ANISOU 1280 N LYS A 187 1974 1786 1795 -5 -177 290 N
As shown above, the same sequence number (186) labeled with two insertion codes (A and B), however, there are two kinds of residues ASP and SER! The simplest way is to delete the residues labeled by insertion codes.
- When I save the cleaned results, I found another alignment issue... (data copied from 1BTY.pdb)
ATOM 1 N ILE A 16 35.700 19.589 20.234 1.00 10.94 N ATOM 2 CA ILE A 16 35.550 20.497 19.066 1.00 10.97 C ATOM 3 C ILE A 16 36.807 20.237 18.234 1.00 9.79 C ATOM 4 O ILE A 16 37.894 20.256 18.772 1.00 10.26 O ATOM 5 CB ILE A 16 35.544 21.989 19.514 1.00 11.47 C ATOM 6 CG1 ILE A 16 34.399 22.321 20.484 1.00 12.32 C ATOM 7 CG2 ILE A 16 35.560 22.968 18.278 1.00 12.30 C ATOM 8 CD1 ILE A 16 33.034 22.335 19.785 1.00 13.18 C ATOM 9 HA ILE A 16 34.673 20.230 18.499 1.00 10.47 H ATOM 10 HB ILE A 16 36.473 22.161 20.042 1.00 11.57 H ATOM 11 HG12 ILE A 16 34.396 21.655 21.334 1.00 11.90 H ATOM 12 HG13 ILE A 16 34.579 23.313 20.881 1.00 12.02 H ATOM 13 HG21 ILE A 16 34.717 22.818 17.621 1.00 11.98 H ATOM 14 HG22 ILE A 16 35.548 23.994 18.620 1.00 12.00 H ATOM 15 HG23 ILE A 16 36.462 22.839 17.694 1.00 11.56 H ATOM 16 HD11 ILE A 16 32.786 21.397 19.326 1.00 12.90 H ATOM 17 HD12 ILE A 16 32.266 22.577 20.509 1.00 12.70 H ATOM 18 HD13 ILE A 16 33.010 23.114 19.032 1.00 12.55 H ATOM 19 N VAL A 17 36.640 20.021 16.964 1.00 11.69 N ATOM 20 CA VAL A 17 37.785 19.760 16.052 1.00 10.93 C ATOM 21 C VAL A 17 37.896 21.020 15.170 1.00 9.18 C ATOM 22 O VAL A 17 36.905 21.499 14.639 1.00 11.67 O ATOM 23 CB VAL A 17 37.466 18.517 15.170 1.00 12.02 C ATOM 24 CG1 VAL A 17 38.603 18.296 14.156 1.00 11.39 C ATOM 25 CG2 VAL A 17 37.202 17.225 16.050 1.00 13.94 C ATOM 26 H VAL A 17 35.748 20.036 16.564 1.00 11.21 H ATOM 27 HA VAL A 17 38.694 19.634 16.621 1.00 10.27 H ATOM 28 HB VAL A 17 36.577 18.735 14.593 1.00 11.73 H ATOM 29 HG11 VAL A 17 39.545 18.156 14.663 1.00 11.37 H ATOM 30 HG12 VAL A 17 38.402 17.438 13.536 1.00 11.87 H ATOM 31 HG13 VAL A 17 38.686 19.156 13.503 1.00 11.42 H ATOM 32 HG21 VAL A 17 38.046 16.986 16.679 1.00 12.88 H ATOM 33 HG22 VAL A 17 36.338 17.370 16.683 1.00 13.55 H ATOM 34 HG23 VAL A 17 36.989 16.368 15.427 1.00 13.73 H ATOM 35 N GLY A 18 39.101 21.479 15.085 1.00 10.03 N ATOM 36 CA GLY A 18 39.440 22.677 14.271 1.00 12.83 C ATOM 37 C GLY A 18 38.928 24.015 14.824 1.00 14.65 C ATOM 38 O GLY A 18 38.710 24.947 14.072 1.00 14.74 O ATOM 39 H GLY A 18 39.816 21.025 15.573 1.00 10.61 H ATOM 40 HA2 GLY A 18 40.513 22.729 14.176 1.00 11.84 H ATOM 41 HA3 GLY A 18 39.023 22.532 13.283 1.00 11.64 H
where "HG12", "HG13", "HG21", "HG22", "HG23", "HD11", "HD12", and "HD13" are one character left-shifted compared with the preceding lines. Improvement or Debug (Jul. 31, 2017) Implemented two printing formats to deal with this issue.
- Usually, PDB files do not contain hydrogen atoms (might due to the highly dynamic of the motion of hydrogen atoms or the limitation of the X-ray resolution). However, in some PDB file, e.g. 5JRY.pdb, hydrogen atoms were recorded.
ATOM 1 N MET A 1 3.164 22.103 135.939 1.00 28.43 N ANISOU 1 N MET A 1 3558 4003 3241 245 418 -535 N ATOM 2 CA MET A 1 3.182 20.676 135.533 1.00 27.33 C ANISOU 2 CA MET A 1 3398 3863 3124 254 483 -543 C ATOM 3 C MET A 1 3.889 20.519 134.187 1.00 26.06 C ANISOU 3 C MET A 1 3199 3710 2993 137 508 -477 C ATOM 4 O MET A 1 3.671 21.292 133.254 1.00 26.86 O ANISOU 4 O MET A 1 3353 3787 3064 176 487 -391 O ATOM 5 CB MET A 1 1.755 20.132 135.441 1.00 27.72 C ANISOU 5 CB MET A 1 3472 3915 3143 245 541 -550 C ATOM 6 H MET A 1 2.770 22.181 136.733 1.00 34.12 H ATOM 7 HA MET A 1 3.659 20.162 136.203 1.00 32.80 H ATOM 8 N LEU A 2 4.740 19.510 134.085 1.00 23.71 N ANISOU 8 N LEU A 2 2783 3443 2781 -63 570 -535 N ATOM 9 CA LEU A 2 5.389 19.232 132.817 1.00 21.69 C ANISOU 9 CA LEU A 2 2418 3187 2638 -244 572 -551 C ATOM 10 C LEU A 2 4.358 18.763 131.793 1.00 20.93 C ANISOU 10 C LEU A 2 2144 3167 2643 -261 488 -592 C ATOM 11 O LEU A 2 3.268 18.294 132.137 1.00 21.65 O ANISOU 11 O LEU A 2 2148 3333 2746 -381 548 -733 O ATOM 12 CB LEU A 2 6.449 18.148 133.007 1.00 20.95 C ANISOU 12 CB LEU A 2 2388 3018 2555 -362 577 -509 C ATOM 13 CG LEU A 2 7.526 18.465 134.041 1.00 20.97 C ANISOU 13 CG LEU A 2 2473 2948 2547 -376 504 -471 C ATOM 14 CD1 LEU A 2 8.523 17.324 134.149 1.00 21.51 C ANISOU 14 CD1 LEU A 2 2624 2963 2586 -440 390 -438 C ATOM 15 CD2 LEU A 2 8.236 19.754 133.671 1.00 21.33 C ANISOU 15 CD2 LEU A 2 2560 2944 2601 -381 439 -472 C ATOM 16 H LEU A 2 4.955 18.978 134.726 1.00 28.45 H ATOM 17 HA LEU A 2 5.819 20.035 132.484 1.00 26.03 H ATOM 18 HB2 LEU A 2 6.007 17.331 133.287 1.00 25.14 H ATOM 19 HB3 LEU A 2 6.894 18.000 132.158 1.00 25.14 H ATOM 20 HG LEU A 2 7.110 18.586 134.909 1.00 25.16 H ATOM 21 HD11 LEU A 2 9.193 17.553 134.812 1.00 25.81 H ATOM 22 HD12 LEU A 2 8.053 16.519 134.418 1.00 25.81 H ATOM 23 HD13 LEU A 2 8.943 17.189 133.285 1.00 25.81 H ATOM 24 HD21 LEU A 2 8.916 19.941 134.336 1.00 25.60 H ATOM 25 HD22 LEU A 2 8.646 19.648 132.798 1.00 25.60 H ATOM 26 HD23 LEU A 2 7.588 20.475 133.648 1.00 25.60 H
Improvement (Nov. 16, 2017) In this case, I added a new option in my PDB_cleaner script enabling the users to choose whether remove all hydrogen atoms or not.
Improvement (Dec. 13, 2017) Modified the script by using 'Element' (at column 76:78 in PDB file) as the condition.
(previously, I used pdb_info.ResName.str.startswith("H")
, which is slower).
- ligands/solvents
Some of the PDB files contains ligands/solvents (e.g. 1HQ2.pdb, which contains MG, CL, ACT, APC, PH2, HOH). Those information is listed below the last TER
line of the protein chains. Here, only parts of them are shown as an example.
TER 1288 TRP A 158 HETATM 1289 MG MG A 161 -2.797 1.884 19.740 1.00 6.70 MG ANISOU 1289 MG MG A 161 1097 971 478 15 338 -47 MG HETATM 1290 MG MG A 162 -5.869 3.399 19.011 1.00 7.19 MG ANISOU 1290 MG MG A 162 855 1269 610 165 324 -165 MG HETATM 1291 CL CL A 163 -16.840 -10.191 19.213 1.00 15.49 CL ANISOU 1291 CL CL A 163 2248 1922 1713 -61 398 8 CL HETATM 1292 C ACT A 164 -6.064 -1.027 24.199 1.00 37.58 C ANISOU 1292 C ACT A 164 7931 1868 4482 143 -3066 -290 C HETATM 1293 O ACT A 164 -6.343 -1.714 23.182 1.00 14.54 O ANISOU 1293 O ACT A 164 1249 1159 3116 -81 364 -9 O HETATM 1294 OXT ACT A 164 -6.052 0.230 24.235 1.00 21.94 O ANISOU 1294 OXT ACT A 164 2886 1823 3627 17 -396 -183 O HETATM 1295 CH3 ACT A 164 -5.715 -1.844 25.481 1.00 22.07 C ANISOU 1295 CH3 ACT A 164 2913 2265 3206 1466 -77 -427 C ... HETATM 1308 PG APC A 171 -7.079 2.750 21.870 1.00 6.88 P ANISOU 1308 PG APC A 171 911 790 915 71 401 -193 P HETATM 1309 O1G APC A 171 -6.616 1.344 22.152 1.00 9.60 O ANISOU 1309 O1G APC A 171 871 835 1940 100 305 -71 O HETATM 1310 O2G APC A 171 -8.226 3.192 22.715 1.00 6.03 O ANISOU 1310 O2G APC A 171 938 744 609 31 592 49 O HETATM 1311 O3G APC A 171 -7.294 3.042 20.400 1.00 7.08 O ANISOU 1311 O3G APC A 171 758 1522 408 29 337 -202 O HETATM 1312 PB APC A 171 -4.370 3.764 21.854 1.00 6.31 P ANISOU 1312 PB APC A 171 909 1010 479 80 206 -23 P HETATM 1313 O1B APC A 171 -4.334 3.108 20.508 1.00 6.48 O ANISOU 1313 O1B APC A 171 1088 961 411 197 230 187 O HETATM 1314 O2B APC A 171 -3.797 5.194 21.941 1.00 8.43 O ANISOU 1314 O2B APC A 171 1093 812 1296 -168 237 134 O HETATM 1315 O3B APC A 171 -5.859 3.729 22.378 1.00 6.62 O ANISOU 1315 O3B APC A 171 943 913 661 23 285 -98 O HETATM 1316 PA APC A 171 -1.763 2.605 22.706 1.00 6.54 P ANISOU 1316 PA APC A 171 934 931 619 -41 125 -54 P HETATM 1317 O1A APC A 171 -1.570 2.627 21.218 1.00 6.39 O ANISOU 1317 O1A APC A 171 1032 863 534 -110 85 176 O HETATM 1318 O2A APC A 171 -1.020 3.644 23.495 1.00 7.30 O ANISOU 1318 O2A APC A 171 1040 900 833 -144 100 -132 O HETATM 1319 C3A APC A 171 -3.506 2.689 22.980 1.00 6.45 C ANISOU 1319 C3A APC A 171 1048 908 494 -10 50 -250 C HETATM 1320 O5' APC A 171 -1.282 1.138 23.187 1.00 7.04 O ANISOU 1320 O5' APC A 171 1051 899 726 99 216 199 O HETATM 1321 C5' APC A 171 -1.316 0.838 24.562 1.00 6.44 C ANISOU 1321 C5' APC A 171 1106 788 551 17 188 29 C HETATM 1322 C4' APC A 171 -1.315 -0.661 24.737 1.00 6.29 C ANISOU 1322 C4' APC A 171 1058 1000 331 -50 125 -50 C HETATM 1323 O4' APC A 171 -2.428 -1.236 24.248 1.00 6.72 O ANISOU 1323 O4' APC A 171 1097 794 660 -168 83 -206 O HETATM 1324 C3' APC A 171 -0.144 -1.406 24.035 1.00 5.09 C ANISOU 1324 C3' APC A 171 840 635 461 114 -260 152 C HETATM 1325 O3' APC A 171 1.112 -1.294 24.673 1.00 8.26 O ANISOU 1325 O3' APC A 171 951 1232 954 -47 -388 51 O HETATM 1326 C2' APC A 171 -0.589 -2.790 23.968 1.00 6.71 C ANISOU 1326 C2' APC A 171 899 800 850 -297 285 216 C HETATM 1327 O2' APC A 171 -0.030 -3.702 24.840 1.00 7.41 O ANISOU 1327 O2' APC A 171 1086 962 766 50 -20 387 O HETATM 1328 C1' APC A 171 -2.056 -2.689 24.271 1.00 6.78 C ANISOU 1328 C1' APC A 171 966 1005 607 69 -51 -28 C HETATM 1329 N9 APC A 171 -3.025 -3.347 23.426 1.00 6.15 N ANISOU 1329 N9 APC A 171 878 1013 448 -50 4 190 N HETATM 1330 C8 APC A 171 -4.109 -4.250 23.860 1.00 5.38 C ANISOU 1330 C8 APC A 171 963 639 442 128 139 -256 C HETATM 1331 N7 APC A 171 -4.834 -4.707 22.964 1.00 6.53 N ANISOU 1331 N7 APC A 171 1035 948 498 64 114 -303 N HETATM 1332 C5 APC A 171 -4.263 -4.110 21.750 1.00 6.10 C ANISOU 1332 C5 APC A 171 827 920 570 95 98 -94 C HETATM 1333 C6 APC A 171 -4.711 -4.298 20.433 1.00 6.59 C ANISOU 1333 C6 APC A 171 929 875 698 -65 341 -235 C HETATM 1334 N6 APC A 171 -5.694 -5.022 20.065 1.00 5.61 N ANISOU 1334 N6 APC A 171 1039 448 646 -77 172 -3 N HETATM 1335 N1 APC A 171 -3.967 -3.599 19.483 1.00 5.96 N ANISOU 1335 N1 APC A 171 967 811 486 -27 349 -21 N HETATM 1336 C2 APC A 171 -2.985 -2.876 19.852 1.00 6.29 C ANISOU 1336 C2 APC A 171 770 937 683 62 -16 219 C HETATM 1337 N3 APC A 171 -2.475 -2.631 21.110 1.00 7.13 N ANISOU 1337 N3 APC A 171 1440 715 552 -17 159 88 N HETATM 1338 C4 APC A 171 -3.227 -3.343 22.105 1.00 5.85 C ANISOU 1338 C4 APC A 171 888 820 514 95 -85 145 C HETATM 1339 N1 PH2 A 181 -7.610 6.951 18.003 1.00 6.05 N ANISOU 1339 N1 PH2 A 181 1052 952 296 -94 78 -172 N HETATM 1340 C2 PH2 A 181 -7.491 7.106 19.276 1.00 6.24 C ANISOU 1340 C2 PH2 A 181 751 1399 220 93 132 13 C HETATM 1341 C3 PH2 A 181 -8.350 8.309 19.918 1.00 12.59 C ANISOU 1341 C3 PH2 A 181 2656 1496 632 1012 -233 -450 C HETATM 1342 N4 PH2 A 181 -9.107 9.037 19.073 1.00 8.47 N ANISOU 1342 N4 PH2 A 181 1999 603 614 77 272 74 N HETATM 1343 N5 PH2 A 181 -9.913 9.531 16.981 1.00 6.33 N ANISOU 1343 N5 PH2 A 181 859 710 836 -209 210 247 N HETATM 1344 C6 PH2 A 181 -10.042 9.367 15.609 1.00 6.17 C ANISOU 1344 C6 PH2 A 181 1209 401 734 -75 256 195 C HETATM 1345 N6 PH2 A 181 -10.760 10.085 14.925 1.00 7.94 N ANISOU 1345 N6 PH2 A 181 1288 739 992 607 231 -6 N HETATM 1346 N7 PH2 A 181 -9.294 8.335 15.116 1.00 6.20 N ANISOU 1346 N7 PH2 A 181 1142 360 855 36 21 218 N HETATM 1347 C8 PH2 A 181 -8.506 7.520 15.753 1.00 5.42 C ANISOU 1347 C8 PH2 A 181 958 690 411 48 430 -24 C HETATM 1348 O8 PH2 A 181 -7.860 6.607 15.236 1.00 6.00 O ANISOU 1348 O8 PH2 A 181 592 1089 600 260 34 -140 O HETATM 1349 C9 PH2 A 181 -8.445 7.773 17.143 1.00 6.49 C ANISOU 1349 C9 PH2 A 181 857 1146 461 217 -8 -87 C HETATM 1350 C10 PH2 A 181 -9.183 8.808 17.707 1.00 6.03 C ANISOU 1350 C10 PH2 A 181 1024 777 489 36 384 5 C HETATM 1351 C11 PH2 A 181 -6.690 6.344 20.210 1.00 7.41 C ANISOU 1351 C11 PH2 A 181 1201 1148 469 155 -176 -66 C HETATM 1352 O4 PH2 A 181 -5.798 5.476 19.528 1.00 8.32 O ANISOU 1352 O4 PH2 A 181 1571 989 601 322 507 -127 O HETATM 1353 O HOH A 201 -1.216 0.760 18.974 1.00 7.10 O ANISOU 1353 O HOH A 201 1048 1158 492 106 60 -57 O HETATM 1354 O HOH A 202 -10.274 -11.677 16.318 1.00 6.62 O ANISOU 1354 O HOH A 202 865 817 835 -50 295 115 O HETATM 1355 O HOH A 203 -7.887 3.653 14.884 1.00 8.89 O ANISOU 1355 O HOH A 203 1091 1226 1061 -110 356 -28 O
Improvement (Dec. 13, 2017) PDB_cleaner is able to report them in the final report.txt file.
- Numpy (version 1.9.1 or above)
- Pandas (version 1.19.2 or above)
This script can be run in both Linux and Windows system. The command is shown below,
$python pdb_cleaner.py
Then, the program will ask you to specified the directory that the PDB files located, and how to deal with multiple chains (keep all the chains or just one of them).
If you choose "one", the program will choose the longest chain in the PDB file (if all chains have the same length, the first chain will be kept).
- Workflow:
-
Collect all the PDB files in the given directory;
-
In each PDB file, check the following items:
2.1 ligands;
2.2. alternate locations;
2.3. non-standard amino acid residues;
2.4. negative sequence numbers (less important);
2.5. sequence gaps;
2.6. insertion code;
2.7. multiple chains;
2.8. hydrogen atoms;
2.9. ** to do: missing atoms; **
2.10. ** to do: keep ligands/solvents or not. Currently, all ligands/solvents are removed. **
-
Clean the PDB files if the aforementioned items exist, with following options if protein has multiple chains;
3.1. remove hydrogein, if the user specified "y";
3.2. keep all chains if the user specified "all";
3.3. keep the longest chain (or the 1st chain, if all chains have the same length), if the user specified "one".
-
Save the cleaned PDB files one by one;
-
Save the summary report.