Skip to content

daqcri/PFD_Demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PFD_Demo

Patterns (or regex-like rules) are widely used to discover meta-knowledge in a given domain, e.g. a Year column should contain only four digits, and thus a value like "1980-" would be erroneous. In addition, data dependencies across columns, e.g. Postal Code uniquely determines City is an important type of integrity constraints (ICs), which have been extensively studied. A promising, but not explored, direction is to leverage patterns to model the dependencies
(or meta-knowledge) between partial values across columns. For instance, in an employee ID "F-9-107", "F" determines the finance department.

We propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, similar to Armstrong's axioms for FDs, as well as the analysis for the consistency and implication of a set of PFDs.
Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of dirty data.

Docker Installation

You need to install Docker first, then proceed to the following instructions.

Get the code

# clone using https
git clone https://github.com/daqcri/PFD_Demo.git

cd PFD_Demo

Build and run:

docker build -t pfd_demo .
docker run -it -p 8050:8050 pfd_demo
Open a web browser and write http://0.0.0.0:8050/ in the address bar.

Run from docker image:

docker run -it -p 8050:8050 qahtanaa/pfd_discovery

Required Parameters

Min support (K): the minimum number of records in which a pattern should appear to consider as a candidate for a PFD (better to use K > 3)
Allowed violations(𝜹): the max ratio of patterns that are different from the main pattern to report PFD (𝜹 = 1 is a good choice).
Min Coverage (𝜸): 
    coverage of a PFD is the number of records that contain its patterns. 
    A dependency between A and B is reported only if the coverage of the PFDs accumulate a coverage that is larger than a given threshold
    (from our empirical studies, using 𝜸 > 10 reduces the chance to report meaningless PFDs).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published