GitHub - USPA-Technology/IntroDataScience: Book Draft: Introduction to Data Science (https://scientistcafe.com/ids/)

This is a draft of the book Pratitioner's Guide to Data Science (The previous name was "Introduction to Data Science")

Please note that this work is written under the Contributor Code of Conduct and released under the CC-BY-NC-SA license. By participating in this project, for example, by submitting a pull request with suggestions or edits, you agree to abide by its terms.

Goal of the Book

This book focuses on data science with an emphasis on industrial experience. It covers a cross-disciplinary subject, combining hands-on experience and problem-solving in a business context. Most introductory books on data science discuss modeling techniques and implementation using R or Python but lack the industrial context. This book seeks to fill that gap by exploring the art of data science in practice.

Some key features of this book are as follows:

It covers both technical and soft skills.
It has a chapter dedicated to the big data cloud environment. In the industry, the practice of data science is often in such an environment.
It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform their analyses with their data and problems whenever possible. The best way to learn data science is to do it!
It focuses on the skills needed to solve real-world industrial problems rather than an academic book.

Notebooks

Chapter	R	Python
Ch4: Big Data Cloud Platform	html, rmd	Create Spark Data, `pyspark` Notebook
Ch5: Data Preprocessing	html, rmd	Notebook
Ch6: Data Wrangling	html, rmd	Notebook
Ch7: Model Tuning Strategy	html, rmd	Notebook
Ch8: Measuring Performance	html, rmd	Notebook
Ch9: Regression Models	html, rmd	Notebook
Ch10: Regularization Methods	html, rmd	Notebook
Ch11: Tree-Based Methods	html, rmd	Notebook
Ch12: Deep Learning	html(DNN, CNN, RNN ) , rmd ( DNN, CNN, RNN )	DNN, CNN, RNN, Tokenizing and Padding, MINST with one hidden layer: step by step

How to run R and Python code

Use R code. You should be able to repeat the R code in your local R console or RStudio in all the chapters except for Chapter 4. The code in each chapter is self-sufficient, and you don't need to run the code in previous chapters first to run the code in the current chapter. For code within a chapter, you do need to run from the beginning. At the beginning of each chapter, there is a code block for installing and loading all required packages. We also provide the .rmd notebooks that include the code to make it easier for you to repeat the code.

To repeat the code on big data and cloud platforms, you need to use Databricks, a cloud data platform. We use Databricks because:

It provides a user-friendly web-based notebook environment that can create a Spark cluster on the fly to run R/Python/Scala/SQL scripts
It has a free community edition that is convenient for teaching purpose

Follow the instructions in section 4.3 on the process of setting up and using the spark environment.

Use Python code. We provide python notebooks for all the chapters on GitHub. Like R notebooks, you should be able to repeat all notebooks in your local machine except for Chapter 4 with reasons stated above. An easy way to repeat the notebook is to import and run in Google Colab. To use Colab, you only need to log in to your Google account in Chrome Browser. To load the notebook to your colab, you can do any of the following:

Click the ''Open in Colab" icon on the top of each linked notebook using the Chrome Brower. It should load the notebook and open it in your Colab.
In your Colab, choose File -> Upload notebook -> GitHub. Copy-paste the notebook's link in the box, search, and select the notebook to load it.

To repeat the code for big data, like running R notebook, you need to set up Spark in Databricks. Follow the instructions in section 4.3 on the process of setting up and using the spark environment. Then, run the "Create Spark Data" notebook to create Spark data frames. After that, you can run the pyspark notebook to learn how to use pyspark.

Short links:

Name		Name	Last commit message	Last commit date
Latest commit History 443 Commits
BackUp		BackUp
Data		Data
Python		Python
R		R
_bookdown_files/IDS_files		_bookdown_files/IDS_files
css		css
images		images
latex		latex
.DS_Store		.DS_Store
.RData		.RData
.gitignore		.gitignore
01-Introduction.Rmd		01-Introduction.Rmd
02-SoftSkills.Rmd		02-SoftSkills.Rmd
03-Dataset.Rmd		03-Dataset.Rmd
04-BigDataCloudPlatform.Rmd		04-BigDataCloudPlatform.Rmd
05-DataPreprocessing.Rmd		05-DataPreprocessing.Rmd
06-DataWrangling.Rmd		06-DataWrangling.Rmd
07-ModelTuning.Rmd		07-ModelTuning.Rmd
08-MeasurePerformance.Rmd		08-MeasurePerformance.Rmd
09-Regression.Rmd		09-Regression.Rmd
10-Regularization.Rmd		10-Regularization.Rmd
11-tree.Rmd		11-tree.Rmd
12-DeepLearning.Rmd		12-DeepLearning.Rmd
19-Appendix.Rmd		19-Appendix.Rmd
20-Reference.Rmd		20-Reference.Rmd
CONDUCT.md		CONDUCT.md
IDS.Rproj		IDS.Rproj
IDS.log		IDS.log
LICENSE		LICENSE
README.md		README.md
_bookdown.yml		_bookdown.yml
_output.yml		_output.yml
advances-in-computational-mathematics.csl		advances-in-computational-mathematics.csl
bibliography.bib		bibliography.bib
index.rmd		index.rmd
krantz.cls		krantz.cls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal of the Book

How to run R and Python code

About

Releases

Packages

Languages

License

USPA-Technology/IntroDataScience

Folders and files

Latest commit

History

Repository files navigation

Goal of the Book

How to run R and Python code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages