-
Notifications
You must be signed in to change notification settings - Fork 1
/
IUPACProtocolsDemoR.Rmd
307 lines (249 loc) · 10.8 KB
/
IUPACProtocolsDemoR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
title: 'WorldFAIR Chemistry: Protocol Services'
author: "WorldFAIR Chemistry"
date: "2023-10-25"
output:
pdf_document: default
html_document: default
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r IUPAC graphic, out.width='25%', fig.align='center', echo=FALSE}
knitr::include_graphics('book/images/iupac_and_wf.jpg')
```
## Interactive Demonstration
This notebook is intended as an interactive demonstration of the services
being proposed by the IUPAC WorldFAIR Chemistry D3.3 project team.
A complete description of the project is available at
<https://iupac.github.io/WFChemProtocols/intro.html>.
This notebook is an RMarkdown version of the original Jupyter Notebook,
which is available at
<https://github.com/IUPAC/WFChemProtocols/blob/main/IUPACProtocolsDemo.ipynb>.
## Resolver Summary
While more detail is provided in the documentation linked above, in short
what is described here is a web service called a "resolver" that performs
two main functions:
1. Check for the presence of a chemical record in the hosting organization's database.
2. Validate the machine-readable chemical structure according to the hosting organization's rules.
## Resolver Base URL
The service being proposed in this project is a regular HTTP web service,
using standard CGI URL syntax, and a well-defined data model for the
information returned. This demonstration uses a prototype service hosted
by PubChem, using JSON as the response format (although in principle it
could be XML or any other structured data format).
One key point of this proposal is that the base URL for the resolver CGI
would vary from one institution to another, but the inputs (CGI arguments)
and outputs (JSON data) would be standard, the same for any organization
implementing the service. So simply by switching the base URL, one can run
the same query on multiple different sites, without otherwise needing to
change any code.
In R, this could look like this:
```{r preparation}
library(httr2)
resolver_base_url <- "https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi"
```
When called without any arguments, the resolver will return some information
about what inputs and outputs it can handle.
```{r example with no input}
req <- request(resolver_base_url)
res <- req |> req_perform()
```
```{r display results}
# display the input and results
req$url
# display raw data
res |> resp_body_string() |> cat()
# # display as R object - suppressed for readability
# res |> resp_body_json()
```
## Chemical Lookup
The resolver service can check to see whether a given chemical is present in
the host organization's database. Examples are below, but note that in this
interactive document, one can edit the inputs to query whatever
chemical is desired.
First, to look up by SMILES string:
```{r request SMILES query}
req <- req |>
req_url_query(smiles = "CCCC")
# perform request
res <- req |> req_perform()
```
```{r display SMILES results}
# display URL
req$url
# display raw data
res |> resp_body_string() |> cat()
# display as R object
res |> resp_body_json()
```
In this example code, the URL is first constructed using `req_url_query` and
then encoded and retrieved using the response functions from the `httr2` package.
The resulting data indicates that there is indeed a
matching record in the host's database, and various record fields are provided
that would allow the user to get more information directly from the hosting
site; this is not intended for full record retrieval, but rather a simplified
response that says whether the chemical is found and where to go to get more
detail. So in this case the user can follow the link to the full PubChem record:
<https://pubchem.ncbi.nlm.nih.gov/compound/7843>
Or see an image of the chemical structure
(although not terribly interesting in this case!):
<!-- If html, renders live -->
<!-- ![https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843](https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843) -->
<!-- If PDF, use pre-downloaded file -->
```{r butane img, out.width='25%', fig.align='center', echo=FALSE}
knitr::include_graphics('book/images/imgsrv_7843.png')
```
<https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843>
If the chemical is not in the database, the response would be something like
this, where an empty result means nothing was found (this could also potentially
be indicated by an HTTP 404 response, but is not done that way in this sample
implementation):
```{r non-existing SMILES}
req <- request(resolver_base_url) |>
req_url_query(smiles = "CCCC(Br)CC(F)(Cl)CCC")
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
The resolver can handle multiple input formats for the chemical structure,
as listed in the previous section. So all of these would return the same
result, which can be verified by (un)commenting various query lines below:
```{r multiformat lookup}
# # SMILES
# req <- request(resolver_base_url) |>
# req_url_query(smiles = "CCCC")
# InChI
req <- request(resolver_base_url) |>
req_url_query(inchi = "InChI=1S/C4H10/c1-3-4-2/h3-4H2,1-2H3")
# # InChIKey
# req <- request(resolver_base_url) |>
# req_url_query(inchikey = "IJDNQMDRQITEOD-UHFFFAOYSA-N")
# # Name
# req <- request(resolver_base_url) |>
# req_url_query(name = "butane")
## Request and display results
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
Note that the InChI full string needs to be URL-encoded in order to be passed
as an argument to the CGI, as would some SMILES strings with special characters.
This is handled automatically with the `httr2` functions in these examples.
## Chemical Structure Validation
The second major function of the resolver is to check the validity of chemical
structures. That is, when a user inputs a SMILES string or an SDF file (for
example, as export from some chemical drawing package or ELN), does the host
organization confirm that the structure is valid? Does it have the right number
of defined stereocenters, isotopes, etc.? Sometimes chemists draw complex
structures in a way where stereochemistry is implied by the drawing, but may
not be interpreted as such by a machine. This tool will allow the chemist to
verify that the structure is perceived by the chemical software in the same
way as by the chemist themselves.
When called with this special action argument, the resolver returns some basic
statistics about what it sees in the structure. Note this may vary somewhat from
organization to organization, especially for edge cases where different
chemical software packages produce slightly different results. This is
expected, and part of the idea here is to ask "What does PubChem think of
this structure?" vs. "What does EPA think of this structure?"
```{r validate butane}
req <- request(resolver_base_url) |>
req_url_query(smiles = "CCCC") |>
req_url_query(action = "validate_structure")
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
If there is a problem with the input structure, there should some
human-readable message that indicates what the error is. Again this will
vary by organization, the message itself is not part of this standard,
but basic things like valence checks on organic structures will presumably
be handled similarly.
```{r invalid structure}
req <- request(resolver_base_url) |>
req_url_query(smiles = "CC(C)(C)(C)C") |>
req_url_query(action = "validate_structure")
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
Here is an example where the organization's specific rules come into play.
PubChem, which is designed mainly for drug-like chemicals, rejects isotopes
with half-life less than 1 millisecond. This may not be the case for other
databases with different purposes and goals. So even though 5H exists
(at least in a laboratory), it's not considered valid in PubChem.
```{r 5H invalid in PubChem}
req <- request(resolver_base_url) |>
req_url_query(smiles = "C[5H]") |>
req_url_query(action = "validate_structure")
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
Here is a more complex example, a larger structure (Prostaglandin D_2) with
multiple stereocenters, both sp3 and sp2. Note the response data indicates
how many defined vs. undefined stereocenters are present, which may assist
the user in matching their expectations to the machine result.
```{r Prostaglandin}
## input_smiles <- "CCCCC[C@@H](/C=C/[C@@H]1[C@H]([C@H](CC1=O)O)C/C=C\CCCC(=O)O)O"
## Since the SMILES contains an escape character, need to use a raw string r"()"
## wrapper, see https://r4ds.hadley.nz/strings#sec-raw-strings
input_smiles <- r"(CCCCC[C@@H](/C=C/[C@@H]1[C@H]([C@H](CC1=O)O)C/C=C\CCCC(=O)O)O)"
req <- request(resolver_base_url) |>
req_url_query(smiles = input_smiles) |>
req_url_query(action = "validate_structure")
# perform request
res <- req |> req_perform()
# display
req$url
# display raw data
res |> resp_body_string() |> cat()
```
Finally, it may be helpful to chemists, who are trained to interpret chemical
structures visually, to see a computer-generated image of their input, again
to see if it matches what the chemist thinks should be there. So the resolver
can also return an image file, with an appropriate output format request.
Note, the image display uses the resolver URL directly.
```{r display Prostaglandin}
input_smiles <- r"(CCCCC[C@@H](/C=C/[C@@H]1[C@H]([C@H](CC1=O)O)C/C=C\CCCC(=O)O)O)"
req <- request(resolver_base_url) |>
req_url_query(smiles = input_smiles) |>
req_url_query(action = "validate_structure") |>
req_url_query(format = "png")
# perform request
res <- req |> req_perform()
# display
req$url
# # display png
# knitr::include_graphics(req$url)
# for PDF display, have to use a saved image:
knitr::include_graphics('book/images/resolver_prostaglandin.png')
```
## Conclusion
It is our hope this this notebook provides a clear overview of the expected
functionality of the resolver being proposed by this IUPAC project.
These working examples should give the user a chance to see how to submit
these web service requests, without having to know any programming, and to
be able to change the inputs with their own SMILES strings etc. in order to
see how the resolver responds to their unique cases.
This R Markdown version was produced by [schymane](https://github.com/schymane/)
based on the Jupyter Notebook committed by
[PaulThiessen](https://github.com/IUPAC/WFChemProtocols/commits?author=PaulThiessen) on [6 Oct. 2023](https://github.com/IUPAC/WFChemProtocols/commit/d5d01c131c87e41703af19f3b28756595c7e92ee).
We would be happy to get feedback, please see here for details. Thank you!
https://iupac.github.io/WFChemProtocols/demo.html