-
Notifications
You must be signed in to change notification settings - Fork 0
/
userguided.html
175 lines (146 loc) · 12.6 KB
/
userguided.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
<!DOCTYPE HTML>
<!--
Monochromed by TEMPLATED
templated.co @templatedco
Released for free under the Creative Commons Attribution 3.0 license (templated.co/license)
-->
<html>
<head>
<title>FALCON</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta name="description" content="" />
<meta name="keywords" content="" />
<link href='http://fonts.googleapis.com/css?family=Oxygen:400,300,700' rel='stylesheet' type='text/css'>
<!--[if lte IE 8]><script src="js/html5shiv.js"></script><![endif]-->
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
<script src="js/skel.min.js"></script>
<script src="js/skel-panels.min.js"></script>
<script src="js/init.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-18293748-5', 'auto');
ga('send', 'pageview');
</script>
<noscript>
<link rel="stylesheet" href="css/skel-noscript.css" />
<link rel="stylesheet" href="css/style.css" />
</noscript>
<!--[if lte IE 8]><link rel="stylesheet" href="css/ie/v8.css" /><![endif]-->
<!--[if lte IE 9]><link rel="stylesheet" href="css/ie/v9.css" /><![endif]-->
</head>
<body>
<!-- Header -->
<div id="header">
<div class="container">
<!-- Logo -->
<div id="logo">
<h1><a href="#">User Guided Data Cleaning</a></h1>
<span style="color:black">It is hard to clean data without human in the loop</span>
</div>
<!-- Nav -->
<nav id="nav">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="systems.html">Systems</a></li>
<li><a href="trusted.html">Trusted Data Cleaning</a></li>
<li class="active"><a href="userguided.html">User Guided</a></li>
<li><a href="others.html">Others</a></li>
<li><a href="resources.html">Resources</a></li>
</ul>
</nav>
</div>
</div>
<!-- Header -->
<!-- Main -->
<div id="main">
<div class="container">
<div class="row">
<!-- Content -->
<div id="content" class="12u skel-cell-important">
<section>
<header>
<h2>Abstract</h2>
</header>
<p>
Data cleaning is in general a hard task. Arguably, in practice, human will be heavily involved in different parts of data cleaning.
</p>
</section>
<section>
<header>
<h2>Falcon</h2>
</header>
<p>Falcon is an interactive, deterministic, and declarative data cleaning system, which uses SQL update (SQLU) queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible SQL update queries that can be used to repair the data. The main technical challenge addressed in this work consists in finding a set of SQL update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, FALCON navigates the lattice by interacting with users to gradually validate the set of SQL update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. </p>
</section>
<section>
<p>More examples will be provided later. At this moment please find more details from the paper.</p>
</section>
<section>
<header>
<h2>KATARA</h2>
</header>
<p>Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases (KBs), both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. KATARA is the first knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data.</p>
<p>KATARA consists of three modules: pattern discovery, pattern validation, and data annotation. The pattern discovery module discovers table patterns between a table and a KB. The pattern validation module uses crowdsourcing to select one table pattern. Using the selected table pattern, the data annotation module interacts with the KB and the crowd to annotate data. It also generates possible repairs for erroneous tuples. Moreover, new facts verified by crowd will be used to enrich KBs.</p>
<p>Consider a table T for soccer players in Figure 1. T has no table header, thus its semantics is completely unknown. We assume that we have access to a KB K (e.g., Yago) containing information related to T. KATARA works in the following steps.</p>
<p><img src="images/katara1.png" width="600" alt=""></p>
<li> <b>Pattern discovery.</b> It first discovers table patterns that contain the types of the columns and the relationships between them. A table pattern is represented as a labelled graph Figure 2(a) where a node represents an attribute and its associated type, e.g., "C (capital)" means that the type of attribute C in KB K is capital. A directed edge between two nodes represents the relationship between two attributes, e.g., "B hasCapital C" means that the relationship from B to C in K is hasCapital. A column could have multiple candidate types, e.g., C could also be of type city. However, knowing that the relationship from B to C is hasCapital indicates that capital is a better choice. Since KBs are often incomplete, the discovered patterns may not cover all attributes of a table, e.g., attribute G of table T is not captured by the pattern in Figure 2(a).</li>
<li> <b>Pattern validation.</b> Consider a case where pattern discovery finds two similar patterns: the one in Figure 2(a), and its variant with type location for column C. To select the best table pattern, we send the crowd the question “Which type (capital or location) is more accurate for values (Rome, Pretoria and Madrid)?”. Crowd answers will help choose the right pattern.</li>
<li> <b>Data annotation.</b> Given the pattern in Figure 2(a), KATARA annotates a tuple with the following three labels:</li>
(1) <i>Validated by the KB</i>. By mapping tuple t1 in table T to K, KATARA finds a full match, shown in Figure 2(b) indicating that Rossi (resp. Italy) is in K as a person (resp. country), and the relationship from Rossi to Italy is nationality. Similarly, all other values in t1 w.r.t. attributes A-F are found in K. We consider t1 to be correct w.r.t. the pattern in Figure 2(a) and only to attributes A-F.
<br>
(2) <i>Jointly validated by the KB and the crowd.</i> Consider t2 about Klate, whose explanation is depicted in Figure 2(c). In K, Katara finds that S. Africa is a country, and Pretoria is a capital. However, the relationship from S. Africa to Pretoria is missing. A positive answer from the crowd to the question “Does S. Africa hasCapital Pretoria?” completes the missing mapping. We consider t2 correct and generate a new fact “S. Africa hasCapital Pretoria”.
<br>
(3) <i>Erroneous tuple.</i> For tuple t3, there is also no link from Italy to Madrid in K (Figure 2(d)). A negative answer from the crowd to the question “Does Italy hasCapital Madrid?” confirms that there is an error in t3, At this point, however, we cannot decide which value in t3 is wrong, Italy or Madrid. Katara will then extract related evidences from K, such as Italy hasCapital Rome and Spain hasCapital Madrid, and use these evidences to generate a set of possible repairs for this tuple.
</p>
<p><img src="images/katara2.png" width="1200" alt=""></p>
</section>
<section>
<header>
<h2>Editing Rules</h2>
</header>
<p>Consider an input tuple t1 given in Fig. 1 (a). It specifies a supplier in the uk: name (fn, ln), phone number (area code AC and phone phn), address (street str, city, zip code) and items supplied. Here phn is either home phone or mobile phone, indicated by type (1 or 2, respectively).</p>
<p>A master relation Dm is shown in Fig. 1(b). Each tuple in Dm specifies a person in the uk in terms of the name, home phone (Hphn), mobile phone (Mphn), address, date of birth (DOB) and gender. Some example editing rules are: </p>
<p> ➡ eR1: for an input tuple t, if there exists a master tuple s in Dm with s[zip] = t[zip], then t should be updated by t[AC, str, city] := s[AC, str, city], provided that t[zip] is certain, i.e., it is assured correct by the user. </p>
<p>This rule makes corrections to attributes t[AC] and t[str], by taking values from master data s1. </p>
<p> ➡ eR2: if t[type] = 2 (indicating mobile phone) and if there is a master tuple s with s[Mphn] = t[phn], then t[FN, LN] := s[FN, LN], as long as t[phn, type] is certain. This standardizes t1[FN] by changing Bob to Robert. </p>
<p>As another example, consider input tuple t2 given in Fig. 1(a), in which attributes t2[str,zip] are missing, and t2[AC] and t2[city] are inconsistent. Consider an editing rule </p>
<p> ➡ eR3: if t[type] = 1 (indicating home phone) and if there exists a master tuple s in Dm such that s[AC, phn] = t[AC, Hphn], then t[str, city, zip] := s[str, city, zip], provided that t[type, AC, phn] is certain. </p>
<p>This helps us fix t2[city] and enrich t[str,zip] by taking the corresponding values from the master tuple s1. </p>
<p><a href="#" class="image full"><img src="images/erules.jpg" alt=""></a></p>
</section>
<section>
<header>
<h2>Publications</h2>
</header>
<p>
<br><b><i>Towards Certain Fixes with Editing Rules and Master Data</i></b>
<br><font color="blue">The 36th International Conference on Very Large Data Bases (VLDB), Singapore, 2010</font><font color="red"> (the best paper award)</font>
<br><i>Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu </i>
<br>
<br><b><i>Towards Certain Fixes with Editing Rules and Master Data</i></b>
<br><font color="blue">The VLDB Journal (VLDBJ) 21(2): 213-238, 2012</font><font color="red"> (Special issue: Best Papers of VLDB 2010, invited)</font>
<br><i>Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu </i>
<br>
<br><b><i>KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing </i></b>
<br><font color="blue">ACM SIGMOD Conference on Management of Data (SIGMOD), Melbourne, Australia, 2015 </font>
<br><i>Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang and Yin Ye</i>
<br>
<br><b><i>KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing</i></b>
<br><font color="blue">The 41st International Conference on Very Large Data Bases (VLDB) (Demo), Kohala Coast, Hawai'i, 2015 </font>
<br><i>Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang and Yin Ye</i>
<br>
<br><b><i>Interactive and Deterministic Data Cleaning: A Tossed Stone Raises a Thousand Ripples </i></b>
<br><font color="blue">ACM SIGMOD Conference on Management of Data (SIGMOD), San Francisco, USA, 2016 </font>
<br><i>Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang</i>
</p>
</section>
</div>
<!-- /Content -->
</div>
</div>
</div>
<!-- Main -->
</body>
</html>