Predictive Data Analytics with Python

In this course trainees learn how to read, clean, visualize and analyze data effectively using Python and its powerful free libraries Pandas, Seaborn, Scipy, Numpy, Matplotlib, and Statsmodels. They also learn how to interpret the results and use them to make predictions.

Note: This course is only available as part of NCLab’s Data Analyst Career Training program. Click here for more information.

Course Overview

The course focuses on using Python libraries to solve practical applications rather than on the underlying math concepts. Trainees learn enough statistical and analytical concepts and procedures in the tutorials to use these libraries effectively. This foundation is invaluable, whether trainees continue to use free Python libraries for analysis and visualization in their own work, or move on to a commercial analytics/visualization product specific to their industry. Trainees learn how to use Python and its powerful free libraries including Pandas, Numpy, Scipy, Matplotlib, Seaborn, and Statsmodels to read data from files, clean data, present data in visual form, perform qualitative and quantitative analysis of data, interpret data, and make predictions.

Prerequisites

This course has Introduction to Python for Data Science as a prerequisite.

Equipment Requirements

Internet access

Email

One of the following browsers:

  • Google Chrome
  • Mozilla Firefox
  • Microsoft Edge
  • Safari

Course Structure and Length

The course is divided into four Units. Each Unit consists of seven instructional/ practice levels, a quiz, and a master (proficiency) level. Trainees can return to any level or quiz for review. The course is self-paced, and trainees will practice each skill or concept as they go. Automatic feedback is built into the course for both practices and quizzes.

While learning the skills in Predictive Data Analytics with Python, trainees can practice skills and create portfolio artifacts with NCLab’s Python apps. They can use a project idea from NCLab or create their own. This independent practice develops their fluency and confidence as data analysts and programmers.

The time to complete this course is approximately 80 hours. Since the course is self-paced, the amount of time required to complete the course will vary from person to person. Trainees are responsible for learning both the tutorial content and the skills acquired through practice.

At the end of this course, trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.

 

Course Syllabus

Unit 1

  • Import the Pandas, Seaborn and Matplotlib libraries.
  • Understand that Pandas DataFrames are basically Python dictionaries.
  • Enter data into DataFrames using dictionaries.
  • Visualize pairwise data using scatterplots.
  • Enter data into DataFrames by zipping lists of column values.
  • Enter data into DataFrames by using a list of pairs (or tuples) representing rows.
  • Understand that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
  • Know the equation of the line and the meaning of the y-intercept and the slope.
  • Use Seaborn to display the regression plot.
  • Use the regression line to make predictions.
  • Understand confidence intervals.
  • Use Seaborn to display the residual plot.
  • Display the regression and residual plots either in the same figure or in separate figures.
  • Adjust the horizontal limits of the regression and residual plots.
  • Discuss the residual plot as part of every regression analysis.
  • Know that for the regression analysis to be acceptable, the residuals must add up to zero.
  • Understand that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
  • Understand time series, and generate sequences of integers to work with time series data.

Unit 2

  • Know what types of data files are supported by Pandas.
  • Read files from the hard disk or from a URL.
  • Distinguish between continuous, discrete and categorical random variables.
  • Identify CSV files; understand their structure and how to read all or selected columns.
  • Check for delimiters, quotes and empty spaces when reading CSV files.
  • Specify which values in the CSV file should be treated as missing values.
  • Obtain the number of rows and number of columns of a DataFrame.
  • Obtain the list of all column names of a DataFrame.
  • Display the beginning and the end of a DataFrame.
  • Use the Titanic data set to practice analytics.
  • Add new data columns to a DataFrame.
  • Iterate over columns in a DataFrame.
  • Access columns using integer indices.
  • Append new rows.
  • Access rows using integer indices.
  • Understand the difference between global and local row indices.
  • Delete selected rows and/or columns from a DataFrame.
  • Identify missing values in DataFrames using Boolean masks.
  • Sum row and column values in DataFrames.
  • Elegantly count missing values in the rows and columns of a DataFrame.
  • Drop rows and/or columns with missing values.

Unit 3

  • Calculate the minimum, maximum, sum and mean of column values.
  • Calculate variance and standard deviation.
  • Understand why standard deviation is used more often in practice.
  • Plot histograms with Pandas.
  • Use the 68–95–99.7 rule in predictive analysis.
  • Filter rows (= select rows with a given property) in DataFrames.
  • Calculate the median and mode with Pandas.
  • Understand how the median differs from the mean.
  • Know the five values which define a boxplot.
  • Display boxplots with Seaborn, and how to interpret them.
  • Confirm information read from a boxplot by calling the method describe.
  • Understand how a pair of random variables can be correlated.
  • Calculate the Pearson coefficient of correlation R (R-value).
  • Use Pandas to quickly see which quantities in a DataFrame are correlated.
  • Visualize correlation graphically via heatmaps.
  • Annotate heatmaps, set limits for the values, and change color maps.
  • Understand the purpose of using the R-squared value, and its advantages over the R-value.
  • Calculate the R-squared value by squaring all values in the correlation matrix.
  • Search and replace in DataFrames using the method replace.
  • Modify values in entire DataFrames and in individual columns using functions.

Unit 4

  • Why it is important to quantify the results of linear regression analysis.
  • What is statistical hypothesis testing and the null hypothesis.
  • What is the P-value and the standard error of the estimate.
  • How to use the function linregress of the Scipy Stats module.
  • How to use Matplotlib to plot Scipy Stats results.
  • How to use the calculated y-intercept and slope for predictions.
  • How to perform simple linear regression with Statsmodels.
  • About the difference between the training and testing datasets.
  • About two “gotchas” that one needs to pay attention to when using Statsmodels.
  • How to use simple linear regression results to make predictions
  • What is multiple linear regression, and how to do it with Statsmodels.
  • How to use multiple linear regression results to make predictions.
  • How to plot the results of multiple linear regression (and understand the plots).
  • What is multicollinearity of independent variables in multiple linear regression.
  • What are the main limitations of linear regression models.
  • That linear regression does not work for categorical dependent variables.
  • What is logistic regression, and what types of models it is used for.
  • How to recognize when the results of logistic regression are wrong.
  • How to perform logistic regression with Seaborn and Statsmodels.
  • Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.