Skip to content

A mining model is created by applying an algorithm to data, but it is more than an algorithm or a metadata container: it is a set of data, statistics, and patterns that can be applied to new data to generate predictions and make inferences about relationships.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



4 Commits

Repository files navigation

Static Badge Static Badge Static Badge Static Badge Static Badge Static Badge

Data Modeling



A mining model is created by applying an algorithm to data, but it is more than an algorithm or a metadata container: it is a set of data, statistics, and patterns that can be applied to new data to generate predictions and make inferences about relationships.


Predictive modeling is the process by which a model is created to predict an outcome. If the outcome is categorical it is called classification and if the outcome is numerical it is called regression. Descriptive modeling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar. Finally, association rules can find interesting associations amongst observations.


install data modeling with pip.


    pip install data-model-patterns
    pip install


Classification is a data mining task of predicting the value of a categorical variable (target or class) by building a model based on one or more numerical and/or categorical variables (predictors or attributes).

    # import classifier model
    from DataModeling import classifier


Using a sample data (weather nominal dataset).

    # import pandas library
    import pandas as pd
    # sample data (weather nominal dataset)
    data = pd.DataFrame({
                 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
                 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
                 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
                 'Windy': ['False', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'True'],
                 'Play Golf': ['N', 'N', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N']


ZeroR is the simplest classification method which relies on the target and ignores all predictors. ZeroR classifier simply predicts the majority category (class). Although there is no predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for other classification methods.


python code

    Perform Zero-R Classifier
    Parameters ->
     - data (DataFrame): the dataset containing the predictor variables and target column.
     - target_column (str): the name of the target column in the dataset.

zero-r classifier:

    # initialize the zero-r classifier
    zero_r = classifier.ZeroR()
    # fit the classifier, 'Play Golf')


    # predict the most frequent value
    #  calculate the model accuracy
    zero_r.score(data, 'Play Golf')
    # get the data summary
    zero_r.summary(data, 'Play Golf')


OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule". To create a rule for a predictor, we construct a frequency table for each predictor against the target. It has been shown that OneR produces rules only slightly less accurate than state-of-theart classification algorithms while producing rules that are simple for humans to interpret.


python code

    Perform One-R Classifier
    Parameters ->
     - data (DataFrame): the dataset containing the predictor variables and target column.
     - target_column (str): the name of the target column in the dataset.

one-r classifier:

    # initialize the zero-r classifier
    one_r = classifier.OneR()
    # fit the classifier, 'Play Golf')


    # data prediction returns as list
    # show data best predictor and accuracy
    attribute, rule, accuracy = one_r.best_predictor()
    # get the data summary
    one_r.summary(data, 'Play Golf')

Naive Bayesian

The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.


python code

    Perform naive bayes classifier
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.

naive bayesian classifier:

    # initialize the naive bayes classifier
    nb_classifier = classifier.NaiveBayesian()
    #  - before train the model
    #  - convert categorical values to numerical
    #  - since we have categorical values / ignore it if the data contains numerical values
    data = pd.get_dummies(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy'])
    # separate feature and target values
    X = data.drop('Play Golf', axis=1)
    y = data['Play Golf']
    # fit the model with the dataset, y)


    # get model accuracy score
    # get model confusion matrix
    # get classification report

Decision Tree - Classification

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.


python code

    Perform decision tree
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     - feature_name (str): the feature column names to apply rules.

decision tree:

    # initialize the decision tree classifier
    decisiontree = classifier.DecisionTree()
    #  - before train the model
    #  - convert categorical values to numerical
    #  - since we have categorical values / ignore it if the data contains numerical values
    data = pd.get_dummies(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy'])
    # separate feature and target values
    X = data.drop('Play Golf', axis=1)
    y = data['Play Golf']
    # fit the model with the dataset, y)


    # get model accuracy score
    # get model confusion matrix
    # get classification report
    # visualize tree rules


Regression is a data science task of predicting the value of target (numerical variable) by building a model based on one or more predictors (numerical and categorical variables).

    # import regression model
    from DataModeling import regression


We create a sample data to be use, from Dataset class in data exploration analysis module.

Full documentation and sample: Github

install library

    pip install data-exploration-analysis

generate dataset:

    # import data exploration library
    from DataExploration import analysis
    # initialize the dataset class
    dataset = analysis.Dataset()
    # create a sample data
    sample_data = {
          'Feature1': ('float', 1, 10),
          'Feature2': ('float', 18, 23),
          'Target': ('float', 1, 3)
    # generate the sample data
    data = dataset.make_dataset(sample_data, n_instance=10)

Decision Tree - Regression

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.


python code

    Perform decision tree
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     - feature_name (str): the feature column names to apply rules.

decision tree:

    # initialize the decision tree regressor
    decisiontree = regression.DecisionTree()
    # train the model
    X = data.drop('Target', axis=1)
    y = data['Target']
    # fit the model with the dataset, y)


    # get model evaluation metrics
    # visualize data tree rules

Multiple Linear Regression

Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable (target) and one or more independent variables (predictors).


python code

    Perform multi linear regression
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.


    # initialize the mlr model 
    mlr = regression.MultipleLinearRegression()
    # fit the model with the dataset, y)


    # get model evaluation metrics
    # get coefficient and intercept of the model

K-Nearest Neighbor

K nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.


python code

    Perform KNN
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.


    # initialize knn model
    knn = regression.KNearestNeighbor()
    # fit the model with the dataset, y)


    # get model evaluation metrics


cluster is a subset of data which are similar. Clustering (also called unsupervised learning) is the process of dividing a dataset into groups such that the members of each group are as similar (close) as possible to one another, and different groups are as dissimilar (far) as possible from one another. Clustering can uncover previously undetected relationships in a dataset.

    # import cluster model
    from DataModeling import clustering


We create a sample data to be use, from Dataset class in data exploration analysis module.

Full documentation and sample: Github

install library

    pip install data-exploration-analysis

generate dataset:

    # import data exploration library
    from DataExploration import analysis
    # initilize the dataset class
    dataset = analysis.Dataset()
    # create a sample data representing pixel height/width
    sample_data = {
          'px_height': ('float', 0.0, 1000.0),
          'px_width': ('float', 0.0, 1000.0)
    # generate the sample data
    data = dataset.make_dataset(sample_data, n_instance=10)

Heirarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.


python code

    Perform Heirarchical
    Parameters ->
     - X (DataFrame): the selected column in dataset to be cluster.

Heirarchical clustering:

    # initialize heirarchical clustering model
    heirarchical = clustering.HeirarchicalClustering()
    # fit the model with dataset


    # visualize cluster dendrogram
    # get cluster labels

K-Means Clustering

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data.


python code

    Perform Kmeans
    Parameters ->
     - X (DataFrame): the selected column/feature in dataset to be cluster.
     - n_clusters (int): initialize the number of clusters.
     - feature_names (list): column name converted to list.
     - range_length (int):  clusters length to show scores.


    # initialize the kmeans clustering model
    # default number of cluster -> 3
    kmeans = clustering.KMeansClustering(n_clusters=3)
    # - before train the model
    # - extract the feature column names to list
    col_names = data.columns.tolist()
    # - convert the dataframe to numpy array
    X = data.values
    # fit the model with the dataset converted to array


    # get cluster centers
    # get kmeans model inertia
    # get cluster labels
    # visualize the cluster data points
    kmeans.plot_clusters(X, xlabel='px_height', ylabel='px_width')
    # get model silhouette score default length -> 10
    kmeans.score(X, range_length=10) 

Association Rules

Association Rules find all sets of items (itemsets) that have support greater than the minimum support and then using the large itemsets to generate the desired rules that have confidence greater than the minimum confidence. The lift of a rule is the ratio of the observed support to that expected if X and Y were independent.

    from DataModeling import association


Using a sample data (transaction dataset). each column represent an item and row represent as binary (1 -> if item is present and 0 -> if not).

    data = pd.DataFrame({
                'Milk': [1, 1, 1, 0, 0],
                'Bread': [1, 0, 1, 0, 1],
                'Butter': [0, 1, 0, 1, 1],
                'Beer': [0, 1, 1, 1, 1],
                'Eggs': [1, 0, 0, 1, 0]

Association Rules - Apriori

The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset is also a frequent itemset. The algorithm can therefore, reduce the number of candidates being considered by only exploring the itemsets whose support count is greater than the minimum support count. All infrequent itemsets can be pruned if it has an infrequent subset.


python code

    Perform Association Rule
    Parameters ->
     - data (DataFrame): the dataset being used.
     - min_support (float): minimum support threshold for an itemset. (to be considered frequent)
     - min_confidence (float): minimum confidence for a rule. (to be considered strong)
     - min_lift (float): minimum lift for a rule. (to be considered interesting)

Apriori - Association Rule

    # initialize association rule model
    ar_model = association.AssociationRules()
    # fit the model with dataset


    # get frequent itemsets
    # get association rules

Plot Configuration

Arguments Value
style ggplot, bmh, dark_background, fivethirtyeight, grayscale
xlabel label name in X-axis
ylabel label name in Y-axis


Static Badge


Static Badge

Documentation Reference

Data Mining

Data Modeling

Feel free to contribute to this library by submitting issues or pull requests to the repository.


A mining model is created by applying an algorithm to data, but it is more than an algorithm or a metadata container: it is a set of data, statistics, and patterns that can be applied to new data to generate predictions and make inferences about relationships.








No packages published
