This is a project focused providing capabilities for analyzing EEG data stored on the Hadoop Filesystem (HDFS) using the Apache Spark library
The data must be stored on HDFS.
Each experiment must have 3 files: ".vmrk" , ".vhdr" and ".eeg" file.
Because usually the ".vmrk" and ".vhdr" and ".eeg" files have the same name but different suffix, pointing the application to just a ".eeg" will suffice.
If either the ".vmrk" and ".vhdr" related files are missing, the application will fail.
Another source type is the "info.txt" files which contain where the format is eg.
- Stankov/Stankov_2014_01.egg 2
- Stankov/Stankov_2014_02.egg 2
- <location of a .eeg file>
So the input must follow this rule:
- Required input:
- < location of a .eeg file> < target number> *< optional values>
- it will extract the .vhdr and .vmrk file by substituting .eeg suffix with the appropriate
OR
- it will extract the .vhdr and .vmrk file by substituting .eeg suffix with the appropriate
- < location of info.txt file>
- it will loop over all of the files and also store the guessed number
- < location of a .eeg file> < target number> *< optional values>
The concerned steps are:
- Channel selection -> selecting the Fz, Cz and Pz channels)
- Epoch extraction -> the raw EEG signal is split into fixed-size segments, commonly referred to as epochs where length of epochs was set to 1000 ms
- Baseline correction -> To compensate for possible shifts in the baseline, the average of 100 ms before each epoch was subtracted from the whole epoch
The low level logic is implemented in the OffLineDataProvider class, specifically in the processEEGFiles() method.
The concerned steps are:
- Interval selection -> In each epoch, only the 512ms time interval starting 175 ms after the beginning of each epoch were used.
- Discrete wavelet transform -> for each EEG channel, 5-level DWT using Daubechies 8 mother wavelet was performed. Based on the results, 16 approximation coefficients of level 5 were used from each EEG channel. Finally, all three results of DVVT were concatenated to form a feature vector of dimension 48.
- Vector normalizing -> the feature vectors were normalized to contain only samples between -1 and 1
The low level implementation is found in the class WaveletTransform and specifically the method extractFeatures(double[][] epoch).
Various classifiers were implemented, each implementing the crucial IClassifier interface with methods:
- void setFeatureExtraction(IFeatureExtraction fe); -> to set the feature extraction method
- void train(List<double[][]> epochs, List targets, IFeatureExtraction fe); -> to train the classifier
- ClassificationStatistics test(List<double[][]> epochs, List targets); -> to get the performance of a classifier
- void save(String file); -> to save a trained classifier
- void load(String file); -> to load a saved classifier
The various implementations are:
- Logistic Regression Classifier
- Support Vector Machine Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Neural Network Classifier
The Machine learning classifier implementation can be found in the Classification package
This library was develop in a modular mannder so that different components could be placed together to make the whole pipeline. The options the user has for specifying the workflow are the following:
- Id (REQUIRED - the id of the job)
- id = {some_integer}*
- Input file (REQUIRED - choose what type of input you want to provide)
- info_file={*path_to_file}
OR - eeg_file={*path_to_file} AND guessed_num={*number}
- Feature extraction (REQUIRED - choose the method used for classification)
- fe = {dwt-8}
- Classification
-
a) (REQUIRED - choose whether to train or load a classifier)
train_clf = {svm,dt,logreg,rf}
OR
load_clf={svm,dt,logreg,rf} AND load_name={src/main/resources/Classifiers/(name) } -
b) (OPTIONAL - choose whether to save the classifier and give him a name)
save_clf={true,false}
AND
save_name={*name} -
c) (REQUIRED - the specific configuration for each classifier)
config_clf_param -> can't be condensed as it's very specific for each classifier
- Saving the results (REQUIRED - choose where to save the classifier performance)
- result_path={*path_to_file}
The configuration must be passed in a special format of query parameters, ie. param1=value1¶ma2=value2&...
For example the query parameters might be:
info_file=/user/digitalAssistanceSystem/data/numbers/info.txt&
fe=dwt-8&
train_clf=rf&
save_clf=true&
save_name=rf_random_forest_test_dorian&
config_num_trees=100&
config_max_depth=5&
config_min_instances_per_node=1&
config_max_bins=32&
config_feature_subset=auto&
config_impurity=gini&
result_path=/Users/dorianbeganovic/spark_server/results/1000329970.txt
- Logistic Regression Classifier
- config_step_size -> default=1.0
- config_num_iterations -> default=100
- config_mini_batch_fraction -> default = 1.0
- Support Vector Machine Classifier
- config_step_size -> default=1.0
- config_num_iterations -> default=100
- config_mini_batch_fraction -> default = 1.0
- config_reg_param -> default = 0.01
- Decision Tree Classifier
- config_max_depth -> default=5
- config_max_bins -> default=32
- config_min_instances_per_node -> default = 1
- config_impurity -> default = gini ; options={gini,entropy}
- Random Forest Classifier
- config_max_depth -> default=5
- config_max_bins -> default=32
- config_min_instances_per_node -> default = 1
- config_impurity -> default = gini ; options={gini,entropy}
- config_feature_subset -> default=auto ; options={auto,all,sqrt,log2}
- config_num_trees -> default=100
-
Neural Network Classifier
General classifier options
-
config_seed -> default=12345
-
config_num_iterations -> default = 1000
-
config_learning_rate -> default = 0.1
-
config_momentum -> default = 0.5
-
config_weight_init -> default = xavier, options = {xavier, zero, sigmoid, uniform, relu}
-
config_updater -> default = nesterovs, options = {sgd, adam, nesterovs, adagrad, rmsprop} (Good information: https://deeplearning4j.org/updater)
-
config_optimization_algo -> default = conjugate_gradient, options = {line_gradient_descent, lbfgs, conjugate_gradient, stochastic_gradient_descent}
-
config_loss_function -> default = xent, options = { mse, xent, squared_loss , negativeloglikelihood}
-
config_pretrain -> default = false
-
config_backprop -> default = true
Per layer classifier options
-
config_layer(i)_layer_type -> default = output, options = {"output", "dense", "auto_encoder", "rbm", "graves_lstm"}
-
config_layer(i)_drop_out -> default = 0.-
-
config_layer(i)_activation_function -> default = sigmoid, options = {"sigmoid", "softmax", "relu", "tanh", "identity", "softplus", "elu"}
-
config_layer(i)_n_out -> default = 2
If you have trouble understanding what abbreviations like "sgd" mean, I suggest googleing with a query like "dl4j sgd neural network meaning"
The application uses Apache Maven for dependency management and packaging.
To build the fat jar file, please use the command:
- mvn package
To skip the tests during the compilation, use the command:
- mvn package -DskipTests
The project is compatible with Spark 1.6.2 version as well as Apache Hadoop 2.6.0
There are two common ways in which you can use this application:
-
Development on local machine
- you need to have Hadoop file system running on your machine
- you also need to install and configure Apache Spark
- you can run the code easily with IntellijIdea
-
Production use on the server
- you can use Spark Submit or a workflow manager like Oozie to submit jobs to the JAR file
This is an example of a script that will submit a job to Apache Spark, also note the specific --conf parameters which are passed (you should make sure to also pass them)
spark-submit
--class cz.zcu.kiv.Main
--master local[*]
--conf spark.driver.host=localhost
--conf spark.executor.extraJavaOptions=-Dlogfile.name=/Users/dorianbeganovic/spark_server/logs/1721658280
--conf spark.driver.extraJavaOptions=-Dlogfile.name=/Users/dorianbeganovic/spark_server/logs/1721658280
/Users/dorianbeganovic/gsoc/Spark_EEG_Analysis/target/spark_eeg-1.2-jar-with-dependencies.jar
"info_file=/user/digitalAssistanceSystem/data/numbers/info.txt&fe=dwt-8&load_clf=logreg&load_name=logreg_0000&config_reg_param=0.01&config_mini_batch_fraction=1.0&config_num_iterations=100&config_step_size=1.0&result_path=/Users/dorianbeganovic/spark_server/results/1721658280.txt"