Skip to content

Commit

Permalink
Add more info to README, add csv files.
Browse files Browse the repository at this point in the history
  • Loading branch information
davidlmobley committed Dec 12, 2017
1 parent 57c2426 commit 3630095
Show file tree
Hide file tree
Showing 3 changed files with 55,589 additions and 0 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,36 @@ Our hope is that the community will get involved with curation of the dataset pr
Suggested improvements should come in via pull requests, where each pull request provides proposed modifications (including potentially supporting tools/scripts, data, references, or links to the same) and a clear explanation of these changes.
Thus, over time the current, curated database is expected to move away from simply reflecting the contents of the Excel spreadsheet and become more valuable.

Some specific points of curation which will be needed include:
- Separation of different types of data; for example, the main tab in the database Excel spreadsheet (and the data in `guthrie_database.csv`) contains not just hydration free energies but other properties with other units, e.g. the entries for phenol include values reported in mg/L, g/m^3, etc.
- unit handling; values are present in kJ/mol and kcal/mol
- checking of molecule names against SMILES and stereochemistry; I (DLM) previously gave Peter some tools to help with this but I do not know if he has used them

## Manifest
- `GuthrieDatabase_April14.zip`: Guthrie database (Excel spreadsheet) as it was provided
- `guthrie_database.csv`: Exported csv file of main tab of Excel spreadsheet
- `guthrie_references_and_status.csv`: Additional tab of Excel spreadsheet which provides definitions of the references and reports on Peter's progress in extracting data from those references; may highlight other areas where more data is still available

There is also data/curation work in an additional tab of the spreadsheet, Sheet 2, which may be useful but is not present here as a separate file yet.

## Using the dataset

The data set can be loaded easily in Python using `pandas`, for example as:
```
python
import pandas
db = pandas.read_csv('guthrie_database.csv', encoding='latin1')
data = db[db.Name=='phenol']
```
to load the database and extract all data with a molecule named phenol

## Authors
### Primary author
- J. Peter Guthrie (University of Western Ontario)

### Other contributors
- David L. Mobley, UC Irvine, who maintains this repository
- Probably students and others who worked with Dr. Guthrie over the years, but I (DLM) do not have their information

## Acknowledgments
- James Guthrie, who made this data available and gave permission to post it publicly; he does not want any credit for this, but he should certainly be acknowledged.
Expand Down
Loading

0 comments on commit 3630095

Please sign in to comment.