-
Notifications
You must be signed in to change notification settings - Fork 0
/
trusted.html
173 lines (152 loc) · 10 KB
/
trusted.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
<!DOCTYPE HTML>
<!--
Monochromed by TEMPLATED
templated.co @templatedco
Released for free under the Creative Commons Attribution 3.0 license (templated.co/license)
-->
<html>
<head>
<title>Trusted Data Cleaning</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta name="description" content="" />
<meta name="keywords" content="" />
<link href='http://fonts.googleapis.com/css?family=Oxygen:400,300,700' rel='stylesheet' type='text/css'>
<!--[if lte IE 8]><script src="js/html5shiv.js"></script><![endif]-->
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script src="js/skel.min.js"></script>
<script src="js/skel-panels.min.js"></script>
<script src="js/init.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-18293748-5', 'auto');
ga('send', 'pageview');
</script>
<noscript>
<link rel="stylesheet" href="css/skel-noscript.css" />
<link rel="stylesheet" href="css/style.css" />
</noscript>
<!--[if lte IE 8]><link rel="stylesheet" href="css/ie/v8.css" /><![endif]-->
<!--[if lte IE 9]><link rel="stylesheet" href="css/ie/v9.css" /><![endif]-->
</head>
<body>
<!-- Header -->
<div id="header">
<div class="container">
<!-- Logo -->
<div id="logo">
<h1><a href="#">Trusted Data Cleaning</a></h1>
<span style="color:black">Automatically and Conservatively Cleaning Diryt Data</span>
</div>
<!-- Nav -->
<nav id="nav">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="systems.html">Systems</a></li>
<li class="active"><a href="trusted.html">Trusted Data Cleaning</a></li>
<li><a href="userguided.html">User Guided</a></li>
<li><a href="others.html">Others</a></li>
<li><a href="resources.html">Resources</a></li>
</ul>
</nav>
</div>
</div>
<!-- Header -->
<!-- Main -->
<div id="main">
<div class="container">
<div class="row">
<!-- Content -->
<div id="content" class="12u skel-cell-important">
<section>
<header>
<h2>Outline</h2>
</header>
<p>One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this project, we study rule-based data cleaning methods for trusted data cleaning.</p>
</section>
<section>
<header>
<h2>Fixing Rules</h2>
</header>
<p>We first discuss fixing rules, a special case of Sherlock rules.</p>
<p>Consider a database D of travel records for a research institute, specified by the schema: Travel (name, country, capital, city, conf), where a Travel tuple specifies a person, identified by name, who has traveled to conference (conf), held at the city of the country with capital. One Travel instance is shown in the figure below. All errors are highlighted and their correct values are given between brackets. For instance, r2[capital] = Shanghai is wrong, whose correct value is Beijing.</p>
<p><img src="images/srule1.jpg" width="600" alt=""></p>
</section>
</div>
<div id="sidebar" class="12u">
<section>
<header>
<h2>Key observation</h2>
</header>
<p>Data cleaning is not magic; it cannot guess something from nothing. What it does is to make decisions from evidence. Certain data patterns of semantically related values can provide evidence to precisely capture and rectify data errors. For example, when values (China, Shanghai) for attributes (country,capital) appear in a tuple, it suffices to judge that the tuple is about China, and Shanghai should be Beijing, the capital of China. In contrast, the values (China, Tokyo) are not enough to decide which value is wrong.</p>
<header>
<h2>Fixing rules</h2>
</header>
<p>A fixing rule contains an evidence pattern, a set of negative patterns, and a fact value. Given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely tell which attribute is wrong, and the fact indicates how to correct it. The figure below shows two fixing rules. For the first fixing rule φ1, its evidence pattern, negative patterns and the fact are China, {Shanghai, Hongkong}, and Beijing, respectively. It states that for a tuple t, if its country is China and its capital is either Shanghai or Hongkong, capital should be updated to Beijing. For instance, consider the database D. Rule φ1 detects that r2 [capital] is wrong, since r2 [country] is China, but r2 [capital] is Shanghai. It will then update r2[capital] to Beijing. </p>
<p>Similarly, the second fixing rule φ2 states that for a tuple t, if its country is Canada, but its capital is Toronto, then its capital is wrong and should be Ottawa. It detects that r4[capital] is wrong, and then will correct it to Ottawa. After applying φ1 and φ2 , two errors, r2 [capital] and r4 [capital], can be repaired.</p>
<p><img src="images/srule2.jpg" width="1000" alt=""></p>
</section>
</div>
<div id="content" class="12u skel-cell-important">
<section>
<header>
<h2>Sherlock Rules</h2>
</header>
<p>Consider the database DEMP of Fig. 1 containing employee records, specified by the following schema: EMP (name,dept,nation,capital,bornat,officePhn), where each EMP tuple specifies an employee, identified by his/her name, department (dept), nationality (nation) with its capital, the city he/she was born at (bornat), and office phone number (officePhn). All errors are marked, e.g., t2[capital] = Shanghai is wrong, whose correct value is Beijing. </p>
<p>Consider the employee table in Fig. 1, the capital table in Fig. 2, and the phone table in Fig. 3.</p>
<p><img src="images/srule3.jpg" width="600" alt=""></p>
<p>We next discuss the three cases depicted in Fig. 4 which make use of three Sherlock rules φ1–φ3.</p>
<p>Case (i) Proof positive, proof negative and correction.
<br>Rule φ1 states that for a tuple t in DEMP, if its name matches the name of an r tuple in MPHN, and t[officePhn] matches r[mobile], then φ1 validates that t[name] is correct (proof positive), and t[officePhn] is wrong (proof negative). Moreover, it will rectify t[officePhn] to r[officePhn]. </p>
<p>Consider t3 in DEMP and r3 in MPHN, φ1 works as follows (see Fig. 4 case (i)). Firstly, t3[name] is matched with r3[name], and t3[officePhn] with r3[mobile]. It then detects that t3 is about Ian, but someone messed up his office number with his mobile number. Consequently, t3[name] is marked as correct and t3[officePhn] as wrong. Since the office number of Ian is available in r3[officePhn], φ1 will update t3[officePhn] to r3[officePhn], which is 27364928. </p>
<p>Case (ii) Proof positive and proof negative.
<br>Often times, not all evidences are available. Assume that the column officePhn is missing in PHN, i.e., we consider a revised schema PHN′ (name, mobile). Rule φ2 states that given a tuple t in DEMP, if its name matches the name of a tuple r in MPH′, and t[officePhn] matches r[mobile], then φ2 validates that t[name] is correct and t[officePhn] is wrong. </p>
<p>Again, consider t3 in DEMP and r3 in MPHN′, φ2 works similar to φ1 (see Fig. 4 case (ii)), which validates that t3[name] is correct and t3[officePhn] is wrong. However, due to the missing column officePhn in PHN′, φ2 cannot update t3 [officePhn]. </p>
<p>Case (iii) Proof positive.
<br>Rule φ3 states that for a tuple t in DEMP, if t[country,capital] matches s[country,capital] of an s tuple in MCAP, it will mark t[country,capital] as correct. </p>
<p>Consider t1 in DEMP and s1 in MCAP. Since both country (i.e., China) and capital (i.e., Beijing) match, φ3 will mark t1 [country, capital] as correct.</p>
<p><img src="images/srule4.jpg" width="1000" alt=""></p>
</section>
<section>
<header>
<h2>Publications</h2>
</header>
<p>
<br><b><i>Proof Positive and Negative in Data Cleaning</i></b>
<br><font color="blue">The 31st International Conference on Data Engineering (ICDE), Seoul, Korea, 2015</font>
<br><i>Matteo Interlandi and Nan Tang</i>
<br>
<br><b><i>Towards Dependable Data Repairing with Fixing Rules</i></b>
<br><font color="blue">ACM SIGMOD Conference on Management of Data (SIGMOD), Snowbird, USA, 2014</font>
<br><i>Jiannan Wang and Nan Tang</i>
<br>
</p>
</section>
</div>
<!-- /Content -->
</div>
</div>
</div>
<!-- Main -->
<!-- Footer -->
<div id="footer">
<div class="container">
<div class="row">
<div class="12u">
<section>
<header>
<h2>Future Work</h2>
</header>
<p>Rule-based data cleaning is more desirable in industries than integrity constraint-based algorithms. We plan to extend our idea to more diversified data formats.</p>
<p>
</p>
</section>
</div>
</div>
</div>
</div>
<!-- Footer -->
</body>
</html>