Skip to content

Commit

Permalink
Merge pull request #18 from hachmannlab/wrapper_aatish
Browse files Browse the repository at this point in the history
Updated documentation, tutorials, regression metrics
  • Loading branch information
aditya1707 committed Nov 9, 2021
2 parents 2e0643e + a33f050 commit 214c2ad
Show file tree
Hide file tree
Showing 31 changed files with 110,165 additions and 272 deletions.
27 changes: 18 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,7 @@ Please check the [ChemML website](https://hachmannlab.github.io/chemml) for more
ChemML is developed in the Python 3 programming language and makes use of a host of data analysis and ML libraries(accessible through the Anaconda distribution), as well as domain-specific libraries.
The development follows a strictly modular and object-oriented design to make the overall code as flexible and versatile as possible.

The format of library is similar to the well known libraries like Scikit-learn. ChemML will be soon available
via graphical user interface provided by [ChemEco](https://github.com/hachmannlab/chemeco).
ChemEco is a general-purpose framework for data mining without coding. It also interfaces with many of the libraries that supply methods for the
representation, preprocessing, analysis, mining, and modeling of large-scale chemical data sets.
The format of library is similar to the well known libraries like Scikit-learn.


## Latest Version:
Expand All @@ -44,12 +41,14 @@ Here is a list of external libraries that will be installed with chemml:
- matplotlib
- seaborn
- lxml
- openpyxl
- ipywidgets

Since conda installation is not available for ChemML yet, we recommend installing rdkit and openbabel (please install openbabel 2.x not openbabel 3.x) in a conda virtual environment prior to installing ChemML. For doing so, you need to follow the conda installer:
We strongly recommend you to install ChemML in an Anaconda environment. The instructions to create the environment, install ChemML’s dependencies, and subsequently install Chemml using the Python Package Index (PyPI) via pip are as follows:

conda create --name my_chemml_env python=3.6
source activate my_chemml_env
conda install -c conda-forge rdkit openbabel
conda create --name chemml_env python=3.8
source activate chemml_env
conda install -c conda-forge openbabel rdkit nb_conda_kernels python-graphviz
pip install chemml

## Citation:
Expand Down Expand Up @@ -93,6 +92,13 @@ Please cite the use of ChemML as:
year = {2018}
}

@article{vishwakarma2019towards,
title={Towards autonomous machine learning in chemistry via evolutionary algorithms},
author={Vishwakarma, Gaurav and Haghighatlari, Mojtaba and Hachmann, Johannes},
journal={ChemRxiv preprint},
year={2019}
}

## License:
ChemML is copyright (C) 2014-2018 Johannes Hachmann and Mojtaba Haghighatlari, all rights reserved.
ChemML is distributed under 3-Clause BSD License (https://opensource.org/licenses/BSD-3-Clause).
Expand All @@ -102,17 +108,20 @@ ChemML is distributed under 3-Clause BSD License (https://opensource.org/license
### Maintainers:
- Johannes Hachmann, hachmann@buffalo.edu
- Mojtaba Haghighatlari
- Aditya Sonpal
- Aditya Sonpal, adityaso@buffalo.edu
- Aatish Pradhan, aatishpr@buffalo.edu
University at Buffalo - The State University of New York (UB)

### Contributors:
- Doaa Altarawy (MolSSI): scientific advice and software mentor
- Gaurav Vishwakarma (UB): automated model optimization
- Ramachandran Subramanian (UB): Magpie descriptor library port
- Bhargava Urala Kota (UB): library database
- Aditya Sonpal (UB): graph convolution NNs
- Srirangaraj Setlur (UB): scientific advice
- Venugopal Govindaraju (UB): scientific advice
- Krishna Rajan (UB): scientific advice
- Aatish Pradhan (UB): Jupyter GUI developer

- We encourage any contributions and feedback. Feel free to fork and make pull-request to the "development" branch.

Expand Down
4 changes: 2 additions & 2 deletions chemml/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# __name__ = "chemml"
__version__ = "0.8"
__author__ = ["Mojtaba Haghighatlari (mojtabah@buffalo.edu)", "Johannes Hachmann (hachmann@buffalo.edu)"]
__version__ = "1.0"
__author__ = ["Aditya Sonpal (adityaso@buffalo.edu)", "Garuav Vishwakarma (gvishwak@buffalo.edu) ", "Aatish Pradhan (aatishpr@buffalo.edu)","Mojtaba Haghighatlari (mojtabah@buffalo.edu)", "Johannes Hachmann (hachmann@buffalo.edu)"]


# import sys
Expand Down
11 changes: 11 additions & 0 deletions chemml/datasets/GA_files/error_metric.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
def error_metric(y_true,y_pred):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
ndata = len(y_true)
y_mean = np.mean(y_true)
e = y_true - y_pred
ae = np.absolute(e)
se = np.square(e)
var = np.mean(np.square(y_true - y_mean))
MAE = np.mean(ae)
return MAE
27 changes: 27 additions & 0 deletions chemml/datasets/GA_files/ga_eval.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
def ga_eval(indi):

layers = [indi[i] for i in range(2,5) if indi[i] != 0]
#print(np.exp(indi[0]))

#count iterations of GA
count=open("tmp.txt", "a")
count.write("GA search iteration in process... \n")
count.close()
file = open("tmp.txt","r")
Counter = 0
# Reading number of lines from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("GA search iteration in process... ",Counter)
mlp = MLPRegressor(alpha=np.exp(indi[0]), activation=indi[1], hidden_layer_sizes=tuple(layers),learning_rate='invscaling', max_iter=10,early_stopping=True)
ga_search = single_obj(mlp=mlp, x=X.values, y=Y.values,n_splits=n_splits)
#print("GA search iteration in process...")
f=open("GA.txt", "a")
f.write("%f %s %d %d %d %f \n" %(float(np.exp(indi[0])), str(indi[1]), int(indi[2]), int(indi[3]), int(indi[4]),float(ga_search)))
f.close()
#gui_return ={"ga_search": ga_search}
#print(gui_return)
return ga_search
12 changes: 12 additions & 0 deletions chemml/datasets/GA_files/single_obj.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
def single_obj(mlp, x, y, n_splits=n_splits):
n_splits=n_splits
kf = KFold(n_splits) # cross validation based on Kfold (creates 5 validation train-test sets)
accuracy_kfold = []
for training, testing in kf.split(x):
mlp.fit(x[training], y[training])
y_pred = mlp.predict(x[testing])
y_pred, y_act =y_pred.reshape(-1,1), y[testing].reshape(-1,1)
model_accuracy=mae(y_act,y_pred) # evaluation metric: mae
accuracy_kfold.append(model_accuracy) # creates list of accuracies for each fold
#print("def single_obj - completed")
return np.mean(accuracy_kfold)
1 change: 1 addition & 0 deletions chemml/datasets/GA_files/space.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
space = ({'alpha': {'uniform': [np.log(0.0001), np.log(0.1)], 'mutation': [0, 1]}},{'activation': {'choice': ['identity', 'logistic', 'tanh', 'relu']}},{'neurons1': {'choice': range(0,220,20)}},{'neurons2': {'choice': range(0,220,20)}},{'neurons3': {'choice': range(0,220,20)}})
6 changes: 6 additions & 0 deletions chemml/datasets/GA_files/test_hyperparameters.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
def test_hyp(mlp, x, y, xtest, ytest):
mlp.fit(x, y)
ypred = mlp.predict(xtest)
acc=mae(ytest,ypred)
# print(" test_hyp completed ")
return np.mean(acc)
1 change: 1 addition & 0 deletions chemml/models/keras/mlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ def save(self, path, filename):
obj_dict['path_to_file'] = path +'/'+ filename+'.h5'
obj_df = pd.DataFrame.from_dict(obj_dict,orient='index')
obj_df.to_csv(path+'/'+filename+'_chemml_model.csv')
print("File saved as "+path+"/"+filename+"_chemml_model.csv")

def load(self, path_to_model):
"""
Expand Down
2 changes: 1 addition & 1 deletion chemml/utils/utilities.py
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ def regression_metrics(y_true, y_predicted, nfeatures = None):
metrics_dict['AE'] = [list(ae)]
metrics_dict['SE'] = [list(se)]

var = np.mean(np.square(y_predicted - y_mean))
var = np.mean(np.square(y_true - y_mean))

metrics_dict['ME'] = np.mean(e)
# mean absolute error
Expand Down
2 changes: 1 addition & 1 deletion chemml/wrapper/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,7 @@ def references(self,host,function):
# ref_p = "@misc{chollet2015keras,title={Keras},author={Chollet, Fran\c{c}ois and others},year={2015},publisher={GitHub},howpublished={\url{https://github.com/keras-team/keras}},}"
ref_p = "ABCD"
self.refs['keras'] = {'url': ref_g, 'paper': ref_p}
elif function in ['GA_DEAP']:
elif function in ['GA']:
ref_g = "https://github.com/deap/deap"
ref_p = """@article{DEAP_JMLR2012,
author = " F\'elix-Antoine Fortin and Fran\c{c}ois-Michel {De Rainville} and Marc-Andr\'e Gardner and Marc Parizeau and Christian Gagn\'e ",
Expand Down
226 changes: 226 additions & 0 deletions chemml/wrapper/chemml_cml/chemml_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -1868,3 +1868,229 @@ def fit(self):
# step7: delete all inputs from memory
del self.inputs

class GA(BASE):
def fit(self):
self.paramFROMinput()
# txt_files = list(self.parameters.keys())[:4] #keys = evaluate, space, error_metric, objecive
# ga_eval = self.parameters[txt_files[0]]
# space = self.parameters[txt_files[1]]
# error_metric = self.parameters[txt_files[2]]
# single_obj = self.parameters[txt_files[3]]

# for key in self.parameters:
# print(key," : ", self.parameters[key])


try:
from chemml.optimization import GeneticAlgorithm
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from datetime import date, datetime


import numpy as np

# for files in list(self.parameters.keys())[:4]: #keys = evaluate, space, error_metric, objecive
# with open(self.parameters[files],'r') as f:
# contents = f.read()
# # print("files: ", files, "contents: ", contents)
# files = compile(contents, "<string>", "exec")

for key in list(self.parameters.keys()):
if key == 'fitness':
final_fit=[]
fitness = str(self.parameters[key])[-6:-3]
fitness = fitness[0]+fitness[1]+fitness[2]
final_fit.append(fitness)
final_fit = tuple(final_fit)
# print("fitness: ",final_fit)
# print("type(fitness): ", type(final_fit))
elif key == 'pop_size':
pop_size = self.parameters[key]
elif key == 'crossover_size':
crossover_size = self.parameters[key]
elif key == 'mutation_size':
mutation_size = self.parameters[key]
elif key == 'n_splits':
global n_splits
n_splits = self.parameters[key]
elif key == 'crossover_type':
crossover_type = self.parameters[key]
elif key == 'mutation_prob':
mutation_prob = self.parameters[key]
elif key == 'initial_population':
initial_population = self.parameters[key]
elif key == 'n_generations':
n_generations = self.parameters[key]
elif key == 'early_stopping':
early_stopping = self.parameters[key]
elif key == 'init_ratio':
init_ratio = self.parameters[key]
elif key == 'crossover_ratio':
crossover_ratio = self.parameters[key]
elif key == 'algorithm':
global algorithm
algorithm = self.parameters[key]

#default in chemml.optimizaiton.geneticalgorithm
if 'early_stopping' not in list(self.parameters.keys())[4:]:
early_stopping = 10

with open(self.parameters['error_metric'],'r') as f:
contents = f.read()
# print("files: ", files, "contents: ", contents)
code = compile(contents, "<string>", "exec")
loc = {}
try:
exec(code,globals(), loc)
global mae
mae = loc['error_metric']
except:
print("Something wrong with the code...")
print("error_metric: ", mae)
print("type(error_metric): ",type(mae))

with open(self.parameters['space'],'r') as f:
contents = f.read()
code = compile(contents, "<string>", "exec")
loc = {}
try:
exec(code,globals(), loc)
space = loc['space']
except:
print("Something wrong with the code...")
print("Space: ", space)
print("type(space): ",type(space))


with open(self.parameters['single_obj'],'r') as f:
contents = f.read()
# print("files: ", files, "contents: ", contents)
code = compile(contents, "<string>", "exec")
loc = {}
try:
exec(code,globals(), loc)
global single_obj
single_obj = loc['single_obj']
except:
print("Something wrong with the code...")
print("single_obj: ", single_obj)
print("type(single_obj): ",type(single_obj))

# print("single_obj: ", single_obj)
# print("type(single_obj): ",type(single_obj))

with open(self.parameters['evaluate'],'r') as f:
contents = f.read()
# print("files: ", files, "contents: ", contents)
code = compile(contents, "<string>", "exec")
loc = {}
try:
exec(code,globals(), loc)
ga_eval = loc['ga_eval']
except:
print("Something wrong with the code...")
print("ga_eval: ", ga_eval)
print("type(ga_eval): ",type(ga_eval))
# print("ga_eval: ", ga_eval)
# print("type(ga_eval): ",type(ga_eval))

with open(self.parameters['test_hyperparameters'],'r') as f:
contents = f.read()
# print("files: ", files, "contents: ", contents)
code = compile(contents, "<string>", "exec")
loc = {}
try:
exec(code,globals(), loc)
test_hyp = loc['test_hyp']
except:
print("Something wrong with the code...")
print("test_hyperparameters: ", test_hyperparameters)
print("type(test_hyperparameters): ",type(test_hyperparameters))


##### GA happening here#########
def ga_mlpregressor(x_train, y_train, x_test, y_test, al=algorithm,n_splits=n_splits,n_generations=n_generations,early_stopping=early_stopping):
global X
global Y
X=x_train
Y=y_train
print("Hyperparameter optimization is a time consuming process - do not shutdown Kernel....\n")
print('Total GA search iterations = ', n_generations*pop_size)
gann = GeneticAlgorithm(evaluate=ga_eval, space=space, fitness=final_fit, pop_size = pop_size, crossover_size=crossover_size, mutation_size=mutation_size, algorithm=al)
global MLPRegressor
from sklearn.neural_network import MLPRegressor
global KFold
from sklearn.model_selection import KFold
import warnings
warnings.filterwarnings("ignore")
best_ind_df, best_individual = gann.search(n_generations=n_generations, early_stopping=early_stopping) # set pop_size<30, n_generations*pop_size = no. of times GA runs
print("GeneticAlgorithm - complete!")

all_items = list(gann.fitness_dict.items())
all_items_df = pd.DataFrame(all_items, columns=['hyperparameters', 'Accuracy_score'])
print("\n\ngenetic algorithm results for each generation: \n", best_ind_df, "\n\nbest particle: ", best_individual, "\n")
print("Calculating accuracy on test data....")
l = [best_individual['neurons1'], best_individual['neurons2'], best_individual['neurons3']]
layers = [i for i in l if i != 0]
ga_mlp = MLPRegressor(alpha=np.exp(best_individual['alpha']), activation=best_individual['activation'], hidden_layer_sizes=tuple(layers), learning_rate='invscaling', max_iter=20, early_stopping=True)
ga_accuracy_test = test_hyp(mlp=ga_mlp, x=X, y=Y, xtest=x_test, ytest=y_test)
print("\n\nTest set error_metric (default = MAE) for the best GA hyperparameter: ", ga_accuracy_test, "\n")
return all_items_df , best_ind_df


#Read data here
self.required('dfx_train', req=True)
dfx_train= self.inputs['dfx_train'].value
self.required('dfy_train', req=True)
dfy_train= self.inputs['dfy_train'].value
self.required('dfx_test', req=True)
dfx_test= self.inputs['dfx_test'].value
self.required('dfy_test', req=True)
dfy_test= self.inputs['dfy_test'].value


# dfx_train = self.inputs['dfx_train'].value
# dfy_train = self.inputs['dfy_train'].value
# dfx_test = self.inputs['dfx_test'].value
# dfy_test = self.inputs['dfy_test'].value

# type of ML model defined here
for key in list(self.parameters.keys()):
if key == 'ml_model':
ml_model = self.parameters[key]
if ml_model == 'MLPRegressor':

best_ind_df, best_individual = ga_mlpregressor(x_train=dfx_train, y_train=dfy_train, x_test=dfx_test,y_test=dfy_test, al = algorithm, n_splits=n_splits, n_generations=n_generations, early_stopping=early_stopping)
# print(all_items_df)


os.remove("tmp.txt") #remove tmp file to count umber of GA iterations
os.remove("GA.txt") #remove file with all GA iterations

# now = datetime.now() #to save with current date and time
# dt_string = now.strftime("%m-%d-%Y %H-%M-%S")
# all_items_df.to_csv('best_ind_df' + str(dt_string) + '.csv')
# best_ind_df.to_csv('best_individual' + str(dt_string) + '.csv')
# print("GA DONEE!!!")

except Exception as err:
msg = '@Task #%i(%s): '%(self.iblock+1, self.Task) + type(err).__name__ + ': '+ str(err)
raise TypeError(msg)

order = [edge[1] for edge in self.Base.graph if edge[0] == self.iblock]
for token in set(order):
if token not in self.outputs:
msg = "@Task #%i(%s): not a valid output token '%s'" % (self.iblock + 1, self.Task, token)
raise NameError(msg)
elif token == 'best_ind_df':
self.set_value(token, best_ind_df)
self.outputs[token].count = order.count(token)
self.Base.send[(self.iblock, token)] = self.outputs[token]
elif token == 'best_individual':
self.set_value(token, best_individual)
self.outputs[token].count = order.count(token)
self.Base.send[(self.iblock, token)] = self.outputs[token]

# step7: delete all inputs from memory
del self.inputs
Loading

0 comments on commit 214c2ad

Please sign in to comment.