Predictive Data Analytics with Python

In this course trainees learn how to read, clean, visualize and analyze data effectively using Python and its powerful free libraries Pandas, Seaborn, Scipy, Numpy, Matplotlib, and Statsmodels. They also learn how to interpret the results and use them to make predictions.

Note: This course is only available as part of NCLab’s Data Analyst Career Training program.

Course Overview

The course focuses on using Python libraries to solve practical applications rather than on the underlying math concepts. Trainees learn enough statistical and analytical concepts and procedures in the tutorials to use these libraries effectively. This foundation is invaluable, whether trainees continue to use free Python libraries for analysis and visualization in their own work, or move on to a commercial analytics/visualization product specific to their industry. Trainees learn how to use Python and its powerful free libraries including Pandas, Numpy, Scipy, Matplotlib, Seaborn, and Statsmodels to read data from files, clean data, present data in visual form, perform qualitative and quantitative analysis of data, interpret data, and make predictions.

Prerequisites

This course has Python Fundamentals as a prerequisite.

Student Learning Outcomes (SLO)

  • Import data from CSV files and clean data.
  • Enter data into Pandas data frames, manipulate data frames.
  • Calculate variance and standard deviation.
  • Visualize data using scatterplots.
  • Use Seaborn to display the regression and residual plots.
  • Discuss the residual plot as part of regression analysis.
  • Make predictions based on the results of simple linear regression analysis.
  • Display boxplots with Seaborn and interpret them.
  • Plot histograms with Pandas.
  • Calculate the Pearson coefficient of correlation R (R-value).
  • Visualize correlation graphically via heatmaps.
  • Describe the purpose of using the R-squared value, and its advantages over the R-value.
  • Distinguish between continuous, discrete and categorical random variables.
  • Calculate the P-value and the standard error of the estimate using Scipy and interpret the results.
  • Interpret data using the statistical hypothesis testing and the null hypothesis.
  • Use training and testing datasets to make predictions.
  • Perform multiple linear regression using Statsmodels.
  • Make predictions based on multiple linear regression.
  • Evaluate variable independence in multiple linear regression based on multicollinearity.
  • Perform logistic regression with Seaborn and Statsmodels.
  • Recognize when the results of logistic regression are wrong.

Equipment Requirements

Computer, laptop or tablet with Internet access, email, and one of the following browsers:

  • Google Chrome
  • Mozilla Firefox
  • Microsoft Edge
  • Safari

Course Structure and Length

The course is divided into four Units. Each Unit consists of seven instructional/ practice levels, a quiz, and a master (proficiency) level. Trainees can return to any level or quiz for review. The course is self-paced, and trainees will practice each skill or concept as they go. Automatic feedback is built into the course for both practices and quizzes.

While learning the skills in Predictive Data Analytics with Python, trainees can practice skills and create portfolio artifacts with NCLab’s Python apps. They can use a project idea from NCLab or create their own. This independent practice develops their fluency and confidence as data analysts and programmers.

The time to complete this course is approximately 80 hours. Since the course is self-paced, the amount of time required to complete the course will vary from person to person. Trainees are responsible for learning both the tutorial content and the skills acquired through practice.

At the end of this course, trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.

Course Syllabus

Unit 1

  • Importing the Pandas, Seaborn and Matplotlib libraries.
  • Understanding that Pandas DataFrames are basically Python dictionaries.
  • Entering data into DataFrames using dictionaries.
  • Visualizing pairwise data using scatterplots.
  • Entering data into DataFrames by zipping lists of column values.
  • Entering data into DataFrames by using a list of pairs (or tuples) representing rows.
  • Understanding that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
  • Knowing the equation of the line and the meaning of the y-intercept and the slope.
  • Using Seaborn to display the regression plot.
  • Using the regression line to make predictions.
  • Understanding confidence intervals.
  • Using Seaborn to display the residual plot.
  • Displaying the regression and residual plots either in the same figure or in separate figures.
  • Adjusting the horizontal limits of the regression and residual plots.
  • Discussing the residual plot as part of every regression analysis.
  • Knowing that for the regression analysis to be acceptable, the residuals must add up to zero.
  • Understanding that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
  • Understanding time series, and generate sequences of integers to work with time series data.

Unit 2

  • Knowing what types of data files are supported by Pandas.
  • Reading files from the hard disk or from a URL.
  • Distinguishing between continuous, discrete and categorical random variables.
  • Identifying CSV files; understand their structure and how to read all or selected columns.
  • Checking for delimiters, quotes and empty spaces when reading CSV files.
  • Specifying which values in the CSV file should be treated as missing values.
  • Obtaining the number of rows and number of columns of a DataFrame.
  • Obtaining the list of all column names of a DataFrame.
  • Displaying the beginning and the end of a DataFrame.
  • Using the Titanic data set to practice analytics.
  • Adding new data columns to a DataFrame.
  • Iterating over columns in a DataFrame.
  • Accessing columns using integer indices.
  • Appending new rows.
  • Accessing rows using integer indices.
  • Understanding the difference between global and local row indices.
  • Deleting selected rows and/or columns from a DataFrame.
  • Identifying missing values in DataFrames using Boolean masks.
  • Summing up row and column values in DataFrames.
  • Counting missing values in the rows and columns of a DataFrame.
  • Dropping rows and/or columns with missing values.

Unit 3

  • Calculating the minimum, maximum, sum and mean of column values.
  • Calculating variance and standard deviation.
  • Understanding why standard deviation is used more often in practice.
  • Plotting histograms with Pandas.
  • Using the 68–95–99.7 rule in predictive analysis.
  • Filtering rows (= select rows with a given property) in DataFrames.
  • Calculating the median and mode with Pandas.
  • Understanding how the median differs from the mean.
  • Knowing the five values which define a boxplot.
  • Displaying boxplots with Seaborn, and how to interpret them.
  • Confirming information read from a boxplot by calling the method describe.
  • Understanding how a pair of random variables can be correlated.
  • Calculating the Pearson coefficient of correlation R (R-value).
  • Using Pandas to quickly see which quantities in a DataFrame are correlated.
  • Visualizing correlation graphically via heatmaps.
  • Annotating heatmaps, set limits for the values, and change color maps.
  • Understanding the purpose of using the R-squared value, and its advantages over the R-value.
  • Calculating the R-squared value by squaring all values in the correlation matrix.
  • Searching and replacing in DataFrames using the method replace().
  • Modifying values in entire DataFrames and in individual columns using functions.

Unit 4

  • Why it is important to quantify the results of linear regression analysis.
  • What is statistical hypothesis testing and the null hypothesis.
  • What is the P-value and the standard error of the estimate.
  • How to use the function linregress of the Scipy Stats module.
  • How to use Matplotlib to plot Scipy Stats results.
  • How to use the calculated y-intercept and slope for predictions.
  • How to perform simple linear regression with Statsmodels.
  • About the difference between the training and testing datasets.
  • About two “gotchas” that one needs to pay attention to when using Statsmodels.
  • How to use simple linear regression results to make predictions
  • What is multiple linear regression, and how to do it with Statsmodels.
  • How to use multiple linear regression results to make predictions.
  • How to plot the results of multiple linear regression (and understand the plots).
  • What is multicollinearity of independent variables in multiple linear regression.
  • What are the main limitations of linear regression models.
  • That linear regression does not work for categorical dependent variables.
  • What is logistic regression, and what types of models it is used for.
  • How to recognize when the results of logistic regression are wrong.
  • How to perform logistic regression with Seaborn and Statsmodels.
  • Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.

Capstone Project

Trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.