Data Modeling

A mining model is created by applying an algorithm to data, but it is more than an algorithm or a metadata container: it is a set of data, statistics, and patterns that can be applied to new data to generate predictions and make inferences about relationships.

Modeling

Predictive modeling is the process by which a model is created to predict an outcome. If the outcome is categorical it is called classification and if the outcome is numerical it is called regression. Descriptive modeling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar. Finally, association rules can find interesting associations amongst observations.

Installation

install data modeling with pip.

CLI:

    pip install data-model-patterns

    pip install https://github.com/christiangarcia0311/data-exploration-analysis/raw/main/dist/data_model_patterns-3.1.0.tar.gz

Classification

Classification is a data mining task of predicting the value of a categorical variable (target or class) by building a model based on one or more numerical and/or categorical variables (predictors or attributes).

    # import classifier model
    from DataModeling import classifier

Dataset

Using a sample data (weather nominal dataset).

    # import pandas library
    import pandas as pd
    
    # sample data (weather nominal dataset)
    data = pd.DataFrame({
                 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
                 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
                 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
                 'Windy': ['False', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'True'],
                 'Play Golf': ['N', 'N', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N']
    })

Zero-R

ZeroR is the simplest classification method which relies on the target and ignores all predictors. ZeroR classifier simply predicts the majority category (class). Although there is no predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for other classification methods.

Sample/Usage:

python code

    """
    Perform Zero-R Classifier
    
    Parameters ->
     - data (DataFrame): the dataset containing the predictor variables and target column.
     - target_column (str): the name of the target column in the dataset.
    
    """

zero-r classifier:

    # initialize the zero-r classifier
    zero_r = classifier.ZeroR()

    # fit the classifier
    zero_r.fit(data, 'Play Golf')

results/output:

    # predict the most frequent value
    zero_r.predict(data)
    
    #  calculate the model accuracy
    zero_r.score(data, 'Play Golf')
    
    # get the data summary
    zero_r.summary(data, 'Play Golf')

One-R

OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule". To create a rule for a predictor, we construct a frequency table for each predictor against the target. It has been shown that OneR produces rules only slightly less accurate than state-of-theart classification algorithms while producing rules that are simple for humans to interpret.

Sample/Usage:

python code

    """
    Perform One-R Classifier
    
    Parameters ->
     - data (DataFrame): the dataset containing the predictor variables and target column.
     - target_column (str): the name of the target column in the dataset.
    
    """

one-r classifier:

    # initialize the zero-r classifier
    one_r = classifier.OneR()

    # fit the classifier
    one_r.fit(data, 'Play Golf')

results/output:

    # data prediction returns as list
    one_r.predict(data)
    
    # show data best predictor and accuracy
    attribute, rule, accuracy = one_r.best_predictor()
    
    # get the data summary
    one_r.summary(data, 'Play Golf')

Naive Bayesian

The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.

Sample/Usage:

python code

    """
    Perform naive bayes classifier
    
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
    
    """

naive bayesian classifier:

    # initialize the naive bayes classifier
    nb_classifier = classifier.NaiveBayesian()

    #  - before train the model
    #  - convert categorical values to numerical
    #  - since we have categorical values / ignore it if the data contains numerical values
    data = pd.get_dummies(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy'])
    
    # separate feature and target values
    X = data.drop('Play Golf', axis=1)
    y = data['Play Golf']

    # fit the model with the dataset
    nb_classifier.fit(X, y)

results/output:

    # get model accuracy score
    nb_classifier.score()
    
    # get model confusion matrix
    nb_classifier.confusionmatrix()
    
    # get classification report
    print(nb_classifier.report())

Decision Tree - Classification

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Sample/Usage:

python code

    """
    Perform decision tree
    
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     - feature_name (str): the feature column names to apply rules.
       
    """

decision tree:

    # initialize the decision tree classifier
    decisiontree = classifier.DecisionTree()

    #  - before train the model
    #  - convert categorical values to numerical
    #  - since we have categorical values / ignore it if the data contains numerical values
    data = pd.get_dummies(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy'])
    
    # separate feature and target values
    X = data.drop('Play Golf', axis=1)
    y = data['Play Golf']

    # fit the model with the dataset
    decisiontree.fit(X, y)

results/output:

    # get model accuracy score
    decisiontree.score()
    
    # get model confusion matrix
    decisiontree.confusionmatrix()
    
    # get classification report
    decisiontree.report()
    
    # visualize tree rules
    decisiontree.tree_rules(feature_name=X.columns)

Regression

Regression is a data science task of predicting the value of target (numerical variable) by building a model based on one or more predictors (numerical and categorical variables).

    # import regression model
    from DataModeling import regression

Dataset

We create a sample data to be use, from Dataset class in data exploration analysis module.

Full documentation and sample: Github

install library

    pip install data-exploration-analysis

generate dataset:

    # import data exploration library
    from DataExploration import analysis
    
    # initialize the dataset class
    dataset = analysis.Dataset()

    # create a sample data
    sample_data = {
          'Feature1': ('float', 1, 10),
          'Feature2': ('float', 18, 23),
          'Target': ('float', 1, 3)
          }
          
    # generate the sample data
    data = dataset.make_dataset(sample_data, n_instance=10)

Decision Tree - Regression

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Sample/Usage:

python code

    """
    Perform decision tree
    
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     - feature_name (str): the feature column names to apply rules.
       
    """

decision tree:

    # initialize the decision tree regressor
    decisiontree = regression.DecisionTree()

    # train the model
    X = data.drop('Target', axis=1)
    y = data['Target']

    # fit the model with the dataset
    decisiontree.fit(X, y)

results/output:

    # get model evaluation metrics
    decisiontree.evaluation_metrics()
    
    # visualize data tree rules
    decisiontree.tree_rules(feature_name=X.columns)

Multiple Linear Regression

Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable (target) and one or more independent variables (predictors).

Sample/Usage:

python code

    """
    Perform multi linear regression
    
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     
    """

MLR

    # initialize the mlr model 
    mlr = regression.MultipleLinearRegression()

    # fit the model with the dataset
    mlr.fit(X, y)

results/output:

    # get model evaluation metrics
    mlr.evaluation_metrics()
    
    # get coefficient and intercept of the model
    mlr.coef_intercept()

K-Nearest Neighbor

K nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.

Sample/Usage:

python code

    """
    Perform KNN
    
    Parameters ->
     - X (matrix): the feature column in dataset.
     - y (array): the target column in dataset.
     
    """

KNN

    # initialize knn model
    knn = regression.KNearestNeighbor()

    # fit the model with the dataset
    knn.fit(X, y)

results/output

    # get model evaluation metrics
    knn.evaluation_metrics()

Clustering

cluster is a subset of data which are similar. Clustering (also called unsupervised learning) is the process of dividing a dataset into groups such that the members of each group are as similar (close) as possible to one another, and different groups are as dissimilar (far) as possible from one another. Clustering can uncover previously undetected relationships in a dataset.

    # import cluster model
    from DataModeling import clustering

Dataset

We create a sample data to be use, from Dataset class in data exploration analysis module.

Full documentation and sample: Github

install library

    pip install data-exploration-analysis

generate dataset:

    # import data exploration library
    from DataExploration import analysis
    
    # initilize the dataset class
    dataset = analysis.Dataset()

    # create a sample data representing pixel height/width
    sample_data = {
          'px_height': ('float', 0.0, 1000.0),
          'px_width': ('float', 0.0, 1000.0)
          }
          
    # generate the sample data
    data = dataset.make_dataset(sample_data, n_instance=10)

Heirarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.

Sample/Usage:

python code

    """
    Perform Heirarchical
    
    Parameters ->
     - X (DataFrame): the selected column in dataset to be cluster.
     
    """

Heirarchical clustering:

    # initialize heirarchical clustering model
    heirarchical = clustering.HeirarchicalClustering()

    # fit the model with dataset
    heirarchical.fit(X)

results/output:

    # visualize cluster dendrogram
    heirarchical.dendrogram()
    
    # get cluster labels
    heirarchical.labels()

K-Means Clustering

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data.

Sample/Usage:

python code

    """
    Perform Kmeans
    
    Parameters ->
     - X (DataFrame): the selected column/feature in dataset to be cluster.
     - n_clusters (int): initialize the number of clusters.
     - feature_names (list): column name converted to list.
     - range_length (int):  clusters length to show scores.
     
    """

KMeans

    # initialize the kmeans clustering model
    # default number of cluster -> 3
    kmeans = clustering.KMeansClustering(n_clusters=3)

    # - before train the model
    # - extract the feature column names to list
    col_names = data.columns.tolist()
    
    # - convert the dataframe to numpy array
    X = data.values

    # fit the model with the dataset converted to array
    kmeans.fit(X)

results/output:

    # get cluster centers
    kmeans.centers(col_names)
    
    # get kmeans model inertia
    kmeans.inertia()
    
    # get cluster labels
    kmeans.labels()
    
    # visualize the cluster data points
    kmeans.plot_clusters(X, xlabel='px_height', ylabel='px_width')
    
    # get model silhouette score default length -> 10
    kmeans.score(X, range_length=10)

Association Rules

Association Rules find all sets of items (itemsets) that have support greater than the minimum support and then using the large itemsets to generate the desired rules that have confidence greater than the minimum confidence. The lift of a rule is the ratio of the observed support to that expected if X and Y were independent.

    from DataModeling import association

Dataset

Using a sample data (transaction dataset). each column represent an item and row represent as binary (1 -> if item is present and 0 -> if not).

    data = pd.DataFrame({
                'Milk': [1, 1, 1, 0, 0],
                'Bread': [1, 0, 1, 0, 1],
                'Butter': [0, 1, 0, 1, 1],
                'Beer': [0, 1, 1, 1, 1],
                'Eggs': [1, 0, 0, 1, 0]
    })

Association Rules - Apriori

The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset is also a frequent itemset. The algorithm can therefore, reduce the number of candidates being considered by only exploring the itemsets whose support count is greater than the minimum support count. All infrequent itemsets can be pruned if it has an infrequent subset.

Sample/Usage:

python code

    """
    Perform Association Rule
    
    Parameters ->
     - data (DataFrame): the dataset being used.
     - min_support (float): minimum support threshold for an itemset. (to be considered frequent)
     - min_confidence (float): minimum confidence for a rule. (to be considered strong)
     - min_lift (float): minimum lift for a rule. (to be considered interesting)
     
    """

Apriori - Association Rule

    # initialize association rule model
    ar_model = association.AssociationRules()

    # fit the model with dataset
    ar_model.fit(data)

results/output:

    # get frequent itemsets
    print(ar_model.frequent())
    
    # get association rules
    print(ar_model.associationrules())

Plot Configuration

Arguments	Value
`style`	ggplot, bmh, dark_background, fivethirtyeight, grayscale
`xlabel`	label name in X-axis
`ylabel`	label name in Y-axis

License

Author

Documentation Reference

Data Mining

Data Modeling

Feel free to contribute to this library by submitting issues or pull requests to the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DataModeling		DataModeling
dist		dist
images		images
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling

Modeling

Installation

Classification

Dataset

Zero-R

One-R

Naive Bayesian

Decision Tree - Classification

Regression

Dataset

Decision Tree - Regression

Multiple Linear Regression

K-Nearest Neighbor

Clustering

Dataset

Heirarchical Clustering

K-Means Clustering

Association Rules

Dataset

Association Rules - Apriori

Plot Configuration

License

Author

Documentation Reference

About

Releases 1

Packages

Languages

License

christiangarcia0311/data-model-patterns

Folders and files

Latest commit

History

Repository files navigation

Data Modeling

Modeling

Installation

Classification

Dataset

Zero-R

One-R

Naive Bayesian

Decision Tree - Classification

Regression

Dataset

Decision Tree - Regression

Multiple Linear Regression

K-Nearest Neighbor

Clustering

Dataset

Heirarchical Clustering

K-Means Clustering

Association Rules

Dataset

Association Rules - Apriori

Plot Configuration

License

Author

Documentation Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages