Predictive Data Analytics with Python Syllabus

Section 1

Importing the Pandas, Seaborn and Matplotlib libraries.
Understanding that Pandas DataFrames are basically Python dictionaries.
Entering data into DataFrames using dictionaries.
Visualizing pairwise data using scatterplots.
Entering data into DataFrames by zipping lists of column values.
Entering data into DataFrames by using a list of pairs (or tuples) representing rows.
Understanding that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
Knowing the equation of the line and the meaning of the y-intercept and the slope.
Using Seaborn to display the regression plot.
Using the regression line to make predictions.
Understanding confidence intervals.
Using Seaborn to display the residual plot.
Displaying the regression and residual plots either in the same figure or in separate figures.
Adjusting the horizontal limits of the regression and residual plots.
Discussing the residual plot as part of every regression analysis.
Knowing that for the regression analysis to be acceptable, the residuals must add up to zero.
Understanding that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
Understanding time series, and generate sequences of integers to work with time series data.

Section 2

Knowing what types of data files are supported by Pandas.
Reading files from the hard disk or from a URL.
Distinguishing between continuous, discrete and categorical random variables.
Identifying CSV files; understand their structure and how to read all or selected columns.
Checking for delimiters, quotes and empty spaces when reading CSV files.
Specifying which values in the CSV file should be treated as missing values.
Obtaining the number of rows and number of columns of a DataFrame.
Obtaining the list of all column names of a DataFrame.
Displaying the beginning and the end of a DataFrame.
Using the Titanic data set to practice analytics.
Adding new data columns to a DataFrame.
Iterating over columns in a DataFrame.
Accessing columns using integer indices.
Appending new rows.
Accessing rows using integer indices.
Understanding the difference between global and local row indices.
Deleting selected rows and/or columns from a DataFrame.
Identifying missing values in DataFrames using Boolean masks.
Summing up row and column values in DataFrames.
Counting missing values in the rows and columns of a DataFrame.
Dropping rows and/or columns with missing values.

Section 3

Calculating the minimum, maximum, sum and mean of column values.
Calculating variance and standard deviation.
Understanding why standard deviation is used more often in practice.
Plotting histograms with Pandas.
Using the 68–95–99.7 rule in predictive analysis.
Filtering rows (= select rows with a given property) in DataFrames.
Calculating the median and mode with Pandas.
Understanding how the median differs from the mean.
Knowing the five values which define a boxplot.
Displaying boxplots with Seaborn, and how to interpret them.
Confirming information read from a boxplot by calling the method describe.

Section 4

Understanding how a pair of random variables can be correlated.
Calculating the Pearson coefficient of correlation R (R-value).
Using Pandas to quickly see which quantities in a DataFrame are correlated.
Visualizing correlation graphically via heatmaps.
Annotating heatmaps, set limits for the values, and change color maps.
Understanding the purpose of using the R-squared value, and its advantages over the R-value.
Calculating the R-squared value by squaring all values in the correlation matrix.
Searching and replacing in DataFrames using the method replace().
Modifying values in entire DataFrames and in individual columns using functions.

Section 5

Why it is important to quantify the results of linear regression analysis.
What is statistical hypothesis testing and the null hypothesis.
What is the P-value and the standard error of the estimate.
How to use the function linregress of the Scipy Stats module.
How to use Matplotlib to plot Scipy Stats results.
How to use the calculated y-intercept and slope for predictions.
How to perform simple linear regression with Statsmodels.
About the difference between the training and testing datasets.
About two “gotchas” that one needs to pay attention to when using Statsmodels.
How to use simple linear regression results to make predictions
What is multiple linear regression, and how to do it with Statsmodels.
How to use multiple linear regression results to make predictions.
How to plot the results of multiple linear regression (and understand the plots).
What is multicollinearity of independent variables in multiple linear regression.
What are the main limitations of linear regression models.
That linear regression does not work for categorical dependent variables.
What is logistic regression, and what types of models it is used for.
How to recognize when the results of logistic regression are wrong.
How to perform logistic regression with Seaborn and Statsmodels.
Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.

Capstone Project

Trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.

Predictive Data Analytics with Python Syllabus

Section 1

Section 2

Section 3

Section 4

Section 5

Capstone Project

Submit a resume

Cookie consent