Skip to content

USPA-Technology/IntroDataScience

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a draft of the book Pratitioner's Guide to Data Science (The previous name was "Introduction to Data Science")

Please note that this work is written under the Contributor Code of Conduct and released under the CC-BY-NC-SA license. By participating in this project, for example, by submitting a pull request with suggestions or edits, you agree to abide by its terms.

Goal of the Book

This book focuses on data science with an emphasis on industrial experience. It covers a cross-disciplinary subject, combining hands-on experience and problem-solving in a business context. Most introductory books on data science discuss modeling techniques and implementation using R or Python but lack the industrial context. This book seeks to fill that gap by exploring the art of data science in practice.

Some key features of this book are as follows:

  • It covers both technical and soft skills.

  • It has a chapter dedicated to the big data cloud environment. In the industry, the practice of data science is often in such an environment.

  • It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform their analyses with their data and problems whenever possible. The best way to learn data science is to do it!

  • It focuses on the skills needed to solve real-world industrial problems rather than an academic book.

Notebooks

Chapter R Python
Ch4: Big Data Cloud Platform html, rmd Create Spark Data, pyspark Notebook
Ch5: Data Preprocessing html, rmd Notebook
Ch6: Data Wrangling html, rmd Notebook
Ch7: Model Tuning Strategy html, rmd Notebook
Ch8: Measuring Performance html, rmd Notebook
Ch9: Regression Models html, rmd Notebook
Ch10: Regularization Methods html, rmd Notebook
Ch11: Tree-Based Methods html, rmd Notebook
Ch12: Deep Learning html(DNN, CNN, RNN ) , rmd ( DNN, CNN, RNN ) DNN, CNN, RNN, Tokenizing and Padding, MINST with one hidden layer: step by step

How to run R and Python code

Use R code. You should be able to repeat the R code in your local R console or RStudio in all the chapters except for Chapter 4. The code in each chapter is self-sufficient, and you don't need to run the code in previous chapters first to run the code in the current chapter. For code within a chapter, you do need to run from the beginning. At the beginning of each chapter, there is a code block for installing and loading all required packages. We also provide the .rmd notebooks that include the code to make it easier for you to repeat the code.

To repeat the code on big data and cloud platforms, you need to use Databricks, a cloud data platform. We use Databricks because:

  • It provides a user-friendly web-based notebook environment that can create a Spark cluster on the fly to run R/Python/Scala/SQL scripts
  • It has a free community edition that is convenient for teaching purpose

Follow the instructions in section 4.3 on the process of setting up and using the spark environment.

Use Python code. We provide python notebooks for all the chapters on GitHub. Like R notebooks, you should be able to repeat all notebooks in your local machine except for Chapter 4 with reasons stated above. An easy way to repeat the notebook is to import and run in Google Colab. To use Colab, you only need to log in to your Google account in Chrome Browser. To load the notebook to your colab, you can do any of the following:

  • Click the ''Open in Colab" icon on the top of each linked notebook using the Chrome Brower. It should load the notebook and open it in your Colab.

  • In your Colab, choose File -> Upload notebook -> GitHub. Copy-paste the notebook's link in the box, search, and select the notebook to load it.

To repeat the code for big data, like running R notebook, you need to set up Spark in Databricks. Follow the instructions in section 4.3 on the process of setting up and using the spark environment. Then, run the "Create Spark Data" notebook to create Spark data frames. After that, you can run the pyspark notebook to learn how to use pyspark.

Short links:

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 67.9%
  • HTML 28.3%
  • TeX 3.7%
  • Other 0.1%