PFD_Demo

Patterns (or regex-like rules) are widely used to discover meta-knowledge in a given domain, e.g. a Year column should contain only four digits, and thus a value like "1980-" would be erroneous. In addition, data dependencies across columns, e.g. Postal Code uniquely determines City is an important type of integrity constraints (ICs), which have been extensively studied. A promising, but not explored, direction is to leverage patterns to model the dependencies
(or meta-knowledge) between partial values across columns. For instance, in an employee ID "F-9-107", "F" determines the finance department.

We propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, similar to Armstrong's axioms for FDs, as well as the analysis for the consistency and implication of a set of PFDs.
Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of dirty data.

Docker Installation

You need to install Docker first, then proceed to the following instructions.

Get the code

# clone using https
git clone https://github.com/daqcri/PFD_Demo.git

cd PFD_Demo

Build and run:

docker build -t pfd_demo .
docker run -it -p 8050:8050 pfd_demo
Open a web browser and write http://0.0.0.0:8050/ in the address bar.

Run from docker image:

docker run -it -p 8050:8050 qahtanaa/pfd_discovery

Required Parameters

Min support (K): the minimum number of records in which a pattern should appear to consider as a candidate for a PFD (better to use K > 3)
Allowed violations(𝜹): the max ratio of patterns that are different from the main pattern to report PFD (𝜹 = 1 is a good choice).
Min Coverage (𝜸): 
    coverage of a PFD is the number of records that contain its patterns. 
    A dependency between A and B is reported only if the coverage of the PFDs accumulate a coverage that is larger than a given threshold
    (from our empirical studies, using 𝜸 > 10 reduces the chance to report meaningless PFDs).

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
assets		assets
components		components
data		data
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
extra.py		extra.py
license.txt		license.txt
pfd.py		pfd.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PFD_Demo

Docker Installation

Required Parameters

About

Releases

Packages

Languages

License

daqcri/PFD_Demo

Folders and files

Latest commit

History

Repository files navigation

PFD_Demo

Docker Installation

Required Parameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages