-
Notifications
You must be signed in to change notification settings - Fork 2
/
data.tex
458 lines (381 loc) · 24.7 KB
/
data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
\chapter{Fundamental concepts}
\label{chap:fundamental}
\glsresetall
\chapterprecishere{The simple believes everything,
\par\raggedleft but the prudent gives thought to his steps.
\par\raggedleft--- \textup{Proverbs 14:15} (ESV)}
A useful start point for someone studying data science is a definition of the term itself.
In this chapter, I discuss some definitions in literature and provide a definition of my
own. As discussed in \cref{chap:history}, there is no consensus on the definition of data
science. However, they all agree that data science is cross-disciplinary and a very
important field of study.
Another important discussion is the evidence that data science is actually a new science.
I argue that a ``new science'' is not a subject that its basis is built from the ground
up\footnote{As it would as unproductive as creating a ``new math'' for each new
application. All ``sciences'' rely on each other in some way}, but a subject that has a
particular object of study and that meets some criteria.
Once we establish that data science is a new science, we need to understand one core
concept: data. In this book, I focus on structured data, which are data that are organized
in a tabular format. I discuss the importance of understanding the nature of the data we
are working with and how we represent them.
Finally, I discuss two important concepts in data science: database normalization and tidy
data. Database normalization is mainly focused on the data storage. Tidy data is mainly
focused on the requirements of data for analysis. Both concepts interact with each other
and have their mathematical foundations. I bridge the gap between the two concepts by
discussing their common mathematical foundations.
\begin{mainbox}{Chapter remarks}
\boxsubtitle{Contents}
\startcontents[chapters]
\printcontents[chapters]{}{1}{}
\vspace{1em}
\boxsubtitle{Context}
\begin{itemize}
\itemsep0em
\item There is no consensus on the definition of data science.
\item Understanding the nature of data is important to extract knowledge from them.
\item Structured data are data that are organized in a tabular format.
\end{itemize}
\boxsubtitle{Objectives}
\begin{itemize}
\itemsep0em
\item Define data science.
\item Present the main concepts about data theory.
\end{itemize}
\boxsubtitle{Takeaways}
\begin{itemize}
\itemsep0em
\item Data science is a new science that studies the knowledge extraction from
measurable phenomena using computational methods.
\item Database normalization and tidy data are complementary concepts that interact
with each other.
\end{itemize}
\end{mainbox}
{}
\clearpage
\section{Data science definition}
In literature, we can find many definitions and descriptions of data science.
For \textcite{Zumel2019}\footfullcite{Zumel2019}, \emph{``data science is a cross-disciplinary practice that draws
on methods from data engineering, descriptive statistics, data mining, machine learning,
and predictive analytics.''} They compare the area with the operations research, stating
that data science focuses on implementing data-driven decisions and managing their
consequences.
\textcite{Wickham2023}\footfullcite{Wickham2023} state that \emph{``data science is an exciting discipline that
allows you to transform raw data into understanding, insight, and knowledge.''}
\textcite{Hayashi1998}\footfullcite{Hayashi1998} says that data science ``is not only a
synthetic concept to unify statistics, data analysis and their related methods, but also
comprises its results'' and that it ``intends to analyze and understand actual phenomena
with `data.'{}''
I find the first definition too restrictive once new methods and techniques are always
under development. We never know when new ``data-related'' methods will become obsolete
or a trend. Also, \citeauthor{Zumel2019}'s view gives the impression that data science is a
operations research subfield. Although I do not try to prove otherwise, I think it
is much more useful to see it as an independent field of study. Obviously, there are
many intersections between both areas (and many other areas as well). Because of such
intersections, I try my best to keep definitions and
terms standardized throughout chapters, sometimes avoiding popular terms that may generate
ambiguities or confusion.
The second one is not really a definition. However, it states clearly \emph{what} data
science enables us to do. The terms ``understanding,'' ``insight,'' and ``knowledge'' are
very important in the context of data science. They are the goals of a data science
project.
The third definition brings an important aspect behind the data: the phenomena from which
they come. Data science is not only about data, but about understanding the phenomena
they represent.
Note that these definitions do not contradict each other. But, they do not attempt to
emphasize the ``science'' aspect of it. From these thoughts, let us define the term.
\begin{defbox}{Data science}{ds}
Data science is the study of knowledge extraction from
measurable phenomena using computational methods.
\end{defbox}
I want to highlight the meaning of some terms in this definition. \emph{Computational methods} means
that data science methods use computers to handle data and perform the calculations.
\emph{Knowledge} means information that humans can easily understand and/or apply to solve
problems. \emph{Measurable phenomena} are events or processes where raw data can be
quantified in some way\footnote{%
Non-measurable phenomena are related to metaphysics and are not the object of study in
data science. They are be the object of study in other sciences, such as
philosophy, theology, etc. However, many metaphysics concepts are borrowed to
explain data science concepts.%
}. \emph{Raw data} are data collected directly from some source and
that have not been subject to any other transformation by a software program or a human
expert. \emph{Data} is any piece of information that can be digitally stored.
\textcite{Kelleher2018} summarize very well the challenges data science takes up:
``extracting non-obvious and useful patterns from large data sets [\dots]; capturing,
cleaning, and transforming [\dots] data; [storing and processing] big [\dots] data sets;
and questions related to data ethics and regulation.''
Data science naming contrasts with conventional sciences. Usually, a ``science'' is named after
its object of study. Biology is the study of the life, Earth science studies the planet
Earth, and so on. I argue that data science does not study data itself, but how we can
use them to understand a certain phenomenon.
One similar example is ``computer science.'' Computer science is not the study of
computers themselves, but the study of computing and computer systems. Similarly, one
could state that data science studies knowledge extraction\footnote{Related to data
analysis, see \cref{sub:time-analysis}.} and data systems\footnote{Related to data
handling, see \cref{sub:time-handling}.}.
Moreover, the conventional scientific paradigm is
essentially model-driven: we observe a phenomenon related to the object of study, we
reason the possible explanation (the model or hypothesis), and we validate our hypothesis
(most of the time using data, though). In data science, however, we extract the knowledge
directly and primarily from the data. The expert knowledge and reasoning may be taken
into account, but we give data the opportunity to surprise us.
Thus, while the objects of the study in conventional sciences are the phenomena themselves
and the models that can explain them, the objects of the study in data
science are the means (computational methods and processes) that can extract reliable and ethical
knowledge from data acquired from any measurable phenomenon --- and, of course, their
consequences.
\def\verrids{(0,0) circle (20mm)}
\def\verrist{(-2.5,0) circle (15mm)}
\def\verride {(2.5,0) circle (15mm)}
\def\verrics {(0,-2.5) circle (15mm)}
\begin{figurebox}[label=fig:myview]{My view of data science.}
\centering
\begin{tikzpicture}
\begin{scope}
\clip \verrids;
\fill[filled] \verrist;
\fill[filled] \verride;
\fill[filled] \verrics;
\end{scope}
\draw[outline] \verrids node(ds) {};
\draw[outline] \verrist node {Statistics};
\draw[outline, text width=27mm, text centered] \verride node {Philosophy / domain expertise};
\draw[outline] \verrics node {Computer science};
\node[anchor=north,above] at (0, 1) {Data science};
\end{tikzpicture}
\tcblower
Data science is an entire new science. Being a new science
does not mean that its basis is built from the ground up. Most of the subjects in
data science come from other sciences, but its object of study (computational methods
to extract knowledge from measurable phenomena) is particular enough to unfold
new scientific questions -- such as data ethics, data collection, etc.
Note that I emphasize philosophy over domain expertise because, in terms
of scientific knowledge, the former is more general than the latter.
\end{figurebox}
\Cref{fig:myview} shows my view of data science. Data science is an entire new science
that incorporates concepts from other sciences. In the next section, I argue the reasons
to understand data science as a new science.
\section{The data science continuum}
In the previous section, I argued that data science is a new science defining its object
of study. This is just the first step to establish a new science, especially because the
object of study in data science is not new. Computer science, statistics, and other
sciences have been studying methods to process data for a long time.
One key aspect of the establishment of a new science is the social demand and the
importance of the object of study in our society. Many say that ``data is the new oil.''
This is because the generation, storage and processing of data has increased exponentially
in the last decades. As a consequence, whoever holds the data and can effectively extract
knowledge from them has a competitive advantage.
As a consequence of the demand, a set of methods are developed and then experiments are
designed to assess their effectiveness. If the methods are effective, they gain
credibility, are widely accepted, and become the foundation of a new scientific
discipline.
Usually, a practical consequence of academic recognition is the creation of a new courses
and programs in universities. This is the case of data science. Many universities have
created data science programs in the last years.
Once efforts to develop the subject increase, it is natural that methodologies evolve and
that questions not particularly related to any other science. This effect produces what I
call the ``data science continuum.''
In a continuum, the subject is not a new science yet. It is a set of methods and
techniques borrowed from other sciences. However, some principles emerge that are
connected with more than one already established science. (For instance, a traditional
computational method adapted to assume statistical properties of the data.) With time,
the premises and hypothesis of new methods become distinctive. The particular properties
of the methods lead to the inception of methodologies to validate them. While validating
the methods, new questions arise.
\begin{figurebox}[label=fig:continuum]{The data science continuum.}
\centering
\begin{tikzpicture}[node distance=10mm and 3mm]
% Base Layer: Established Sciences
\node (stats) [block] {Statistics};
\node (cs) [block, right=of stats] {Computer science};
\node (ds) [block, right=of cs] {Philosophy and others};
% A box around the Base Layer
\node (basebox) [draw, dashed, inner sep=0.5cm, fit={(stats) (ds)}, label=above:{Established sciences}] {};
% Middle Layer: Emergence of Principles
\node (principles) [darkblock, below=of basebox, minimum width=6cm, text width=5cm] {Emergence of principles};
% Top Layer: Unique Methods and Validation
\node (methods) [block, below=of principles, minimum width=6cm, text width=5cm] {Unique methods};
\node (validation) [darkblock, below=of methods, minimum width=6cm, text width=5cm] {Validation and new challenges};
\node (science) [block, below=of validation, minimum width=6cm, text width=5cm] {Data science};
% Arrows
\draw[-{Stealth}] (basebox) -- (principles);
\draw[-{Stealth}] (principles) -- (methods);
\draw[-{Stealth}] (methods) -- (validation);
\draw[-{Stealth}] (validation) -- (science);
\end{tikzpicture}
\tcblower
The data science continuum is the process of development of data science as a new
science. It began by borrowing methods and techniques from established sciences. Over
time, distinct principles emerged that spanned multiple disciplines. As these
principles developed, new methods and their premises became unique. This uniqueness
led to the creation of specific methodologies for validating these methods. During the
validation process, new questions and challenges arose, further distinguishing data
science from its parent disciplines.
\end{figurebox}
The data science continuum is an instance of this process; see \cref{fig:continuum}. At
first glance, data science seems like just a combination of computer science, statistics,
linear algebra, etc. However, the principles and priorities of data science are not the
same as the ones in that disciplines. Similarly, the accepted methodologies in data
science differ, and keep evolving, from the ones in the other sciences. New questions
like arise such as:
\begin{itemize}
\itemsep0em
\item How can we guarantee that the data we are using are reliable?
\item How can we collect data in a way that does not bias our conclusions?
\item How can we guarantee that the data we are using are ethical?
\item How can we present our results in a way that is understandable to non-experts?
\end{itemize}
\section{Fundamental data theory}
As I stated, data science is not a isolated science. It incorporates several concepts
from other fields and sciences. In this section, I explain the basis of each component of
\cref{def:ds}.
\subsection{Phenomena}
\label{sub:phenomena}
Phenomenon is a term used to describe any observable event or process. They are the
source we use to understand the world around us. In general, we use our senses to
perceive phenomena. To make sense of them, we use our knowledge and reasoning.
Philosophy is the study of knowledge and reasoning. It is a very broad field of study
that has been divided into many subfields. One possible starting point is \gls{ontology},
which is the study of being, existence, and reality. Ontology studies what exists and how
we can classify it. In particular, ontology describes the nature of categories,
properties, and relations.
Aristotle (384 -- 322 BC) is one of the first philosophers to study ontology. In
Κατηγορίαι\footnote{For Portuguese readers, I suggest \fullcite{CategoriesUnesp}.}, he
proposed a classification of the world into ten categories. Substance, or οὐσία,
is the most important one. It is the category of being. The other categories
are properties, quantity, quality, relation, place, time, position, state, and action.
Although rudimentary\footnote{Most historians agree that Categories was written before
Aristotle's other works. Many concepts are further developed in his later works.},
Aristotle's categories served as a basis for the development of logical reasoning and
scientific classification, especially in the Western world. The categories are still
still used in many applications, including computer systems and data systems.
Aristotle marked a rupture with many previous philosophers. While Heraclitus (\nth{6}
century -- \nth{5} century BC) defended that everything is in a constant state of flux and
Plato (c. 427 -- 348 BC) defended that only the perfect can be known, Aristotle focused in
the world we can perceive and understand. His practical view also opposed Antisthenes (c.
446 -- 366 BC) view that the predicate determines the object, which leads to the
impossibility of negation and consequently contradiction.
What is the importance of ontology for data science? Describing, which is basically
reducing the complexity of the world to simple, small pieces, is the first step to
understand any phenomenon. Drawing a simplistic parallel, phenomena are like the
substance category, and the data we collect are like the other categories, which describe
the properties, relations, and states of the substance. A person that can easily organize
their thoughts to identify the entities and their properties in a problem is more likely
to collect relevant data. Also, the understanding of logical and grammatical limitations
--- such as univocal and equivocal terms --- is important to avoid errors in data
science applications\footnote{It is very common to see data scientists reducing the
meaning of the columns in a dataset to a single word. Or even worse, the assume
that the same word in different columns have the same meaning. This is a common source
of errors in data science projects.}.
Another important field in Philosophy is epistemology, which is the study of knowledge.
Epistemology elaborates on how we can acquire knowledge and how we can distinguish between
knowledge and opinion. In particular, epistemology studies the nature of knowledge,
justification, and the rationality of belief.
Finally, logic is the study of reasoning. It studies the nature of reasoning and
argumentation. In particular, logic studies the nature of inference, validity, and
fallacies.
I further discuss knowledge and reasoning in \cref{sub:knowledge}.
In the context of a data science project, we usually focus on phenomena from particular domain of
expertise. For example, we may be interested in a phenomena related to the stock
market, or related to the weather, or related to the human
health. Thus, we need to understand the nature of the phenomena we are studying.
Fully understanding the phenomena we are tackling requires both a general knowledge
of epistemology, ontology, and logic, and a particular knowledge of the domain of
expertise.
Observe as well that we do not restrict ourselves to the intellectual understanding of
philosophy. There are several computational methods that implements the concepts of
epistemology, ontology, and logic. For example, we can use a computer to perform
deductive reasoning, to classify objects, or to validate an argument. Also, we have
strong mathematical foundations and computational tools to organize categories, relations, and
properties.
The reason we need to understand the nature of the phenomena we are studying is that we
need to guarantee that the data we are collecting are relevant to the problem we are
trying to solve. Incorrectly perception of the phenomena may lead to incorrect data
collection, which certainly lead to incorrect conclusions.
\subsection{Measurements}
Similarly to Aristotle's work, data scientists focus on the world we can perceive with our
senses (or using external sensors). In a more restrictive way, we focus on the world we
can measure\footnote{Some phenomena might be knowable but not measurable. For example,
the existence of God is a knowable phenomenon, but it is not measurable.}. Measurable
phenomena are
those that we can quantify in some way. For example, the temperature of a room is a
measurable phenomenon because we can measure it using a thermometer. The number of
people in a room is also a measurable phenomenon because we can count them.
When we quantify a phenomenon, we perform data collection. Data collection is the process
of gathering data on targeted phenomenon in an established systematic way.
Systematic means that we have a plan to collect the data and we understand the
consequences of the plan, including the sampling bias. Sampling bias is the influence
that the method of collecting the data has on the conclusions we can draw from them.
Once we have collected the data, we need to store them. Data storage is the process of
storing data in a computer.
To perform those tasks, we need to understand the nature of data. Data are any piece of
information that can be digitally stored. Data can be stored in many different formats.
For example, we can store data in a spreadsheet, in a database, or in a text file. We can
also store data in many different types. For example, we can store data as numbers,
strings, or dates.
In data science, studying data types is important because they need to correctly reflect
the nature of the source phenomenon and be compatible with the computational methods we
are using. Data types also restrict the operations we can perform on the data.
The foundation and tools to understand data types come from computer science. Among the
subfields, I highlight:
\begin{itemize}
\itemsep0em
\item Algorithms and data structures: the study of data types and the computational
methods to manipulate them.
\item Databases: the study of storing and retrieving data.
\end{itemize}
The basic concepts are the same independently of the programming language, hardware
architecture, or the \gls{rdbms} we are using. As a consequence, in this book, I focus on
the concepts and not on the tools.
\subsection{Knowledge extraction}
\label{sub:knowledge}
Like discussed before, knowledge and reasoning are important aspects of data science.
Philosophical and mathematical foundations from epistemology and logic provide us ways
to obtain knowledge from a set of premises and known (and accepted) facts\footnote{In
mathematics, we call the premises and accepted facts as axioms. Corollaries,
lemmas, and theorems are the results of the reasoning process.}.
Deductive reasoning is the process of deriving a conclusion (or new knowledge) from
a set of previous knowledge. Deductive reasoning, thus, enables us to infer
generalization rules from generalization rules.
Important figures that bridged the gap between philosophy and mathematics are
René Descartes (1596 -- 1650) and Gottfried Wilhelm Leibniz (1646 -- 1716). Descartes
was the first to use algebra to solve knowledge problems, effectively creating
methods to mechanize reasoning. Leibniz, after Descartes, envisioned a universal
algebraic language that would encompass logical principles and methods. Their work
influenced the development of calculus, Boolean algebra, and many other fields.
Once we have collected and stored the data, we need to extract knowledge from them.
Knowledge extraction is the process of obtaining knowledge from data. The reasoning
principle here is inductive reasoning. Inductive reasoning is the process of deriving
generalization rules from specific observations. Inductive reasoning and data analysis
are closely related. Refer to \cref{sub:time-analysis} for a timeline of the development
of data analysis.
In data science, we use computational methods to extract knowledge from data. These
computational methods may come from many different fields. In particular, I highlight:
\begin{itemize}
\itemsep0em
\item Statistics: the study of data collection, organization, analysis, interpretation,
and presentation.
\item Machine learning: the study of computational methods that can automatically learn from data.
It is a branch of artificial intelligence.
\item Operations research: the study of computational methods to optimize decisions.
\end{itemize}
Also, many other fields contribute to the development of domain-specific computational
methods to extract knowledge from data. For example, in the field of biology, we have
bioinformatics, which is the study of computational methods to analyze biological data.
Earth sciences have geoinformatics, which is the study of computational methods to
analyze geographical data. And so on.
Each method has its own assumptions and limitations. Thus, we need to understand the
nature of the methods we are using. In particular, we need to understand the
expected input and output of them. Whenever the available data do not match the
requirements of the method, we may perform data preprocessing\footnote{%
It is important to highlight that it is expected that some of the methods assumptions
are not fully met. These methods are usually robust enough to extract valuable
knowledge even when data contain imperfections, errors and noise. However, it is still
useful to perform data preprocessing to adjust data as much as possible.%
}.
Data preprocessing mainly includes data cleaning, data transformation, and data
enhancement. Data cleaning is the process of detecting and correcting (or removing)
corrupt or inaccurate pieces of data. Data transformation is the process of converting
data from one format or type to another. Data enhancement is the process of adding
additional information to the data, usually, by integrating
data from different sources into a single, unified view.
% vim: set spell spelllang=en: