Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  • Import the Data
  • Clean the Data
  • Split the Data into Training/Test Sets (80% training/20% testing)
  • Create a Model - select an algorithm 
  • Train the Model
  • Make Predictions
  • Evaluate and Improve

Libraries and Tools

NumpyMulti-dimensional arrayPandas
NumPyNumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more.
PandasPandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
MatPlotLibMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
SciKit-LearnSimple and efficient tools for predictive data analysis · Accessible to everybody, and reusable in various contexts
JupyterThe Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser.
AnacondaAnaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.

Getting Started

Install Anaconda

Start a jupyter notebook

Code Block
$ jupyter notebook

Create a new Python3 notebook

Image Added

Import a Dataset

We can get some sample datasets from -

From our Jupyter notebook, we are going to import a downloaded CSV.

Code Block
import panda as pd
df = pd.read_csv('vgsales.csv') 

The function returns a DataFrame object

Image Added

Dataframe Functions:

Image Added

Interesting DataFrame functions:

shapereturns dimensions of dataset


(16598, 11)
describereturns useful statistics about our data


(see above image)

valuesreturns your data

Jupyter Shortcuts

Add Cell AboveCommanda
Add Cell BelowCommandb
Delete Current CellCommanddd
Run current Cell and Stay in CellCommand/Edit<CTRL><ENTER>Run Commands in cell without adding a cell below.
AutocompletionEdit<TAB>Get methods for object
Method DocumentationEdit<SHIFT> <TAB>Get information on method
Make CommentEdit<CMD> /Comment/UnComment

Real Example

Import the data

Code Block
import pandas as pd
df = pd.read_csv('music.csv')

Spit the Data

Create input and output data sets. X = input, y = output.

Since we want to predict the type of music based on age and sex, we create our input data as X and our output as y.

Code Block
import pandas as pd
df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

Train and Do a Prediction

Code Block
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

model = DecisionTreeClassifier()

# train model,y)

# predict
# 21 year old male and 22 year old female
predictions = model.predict([[21,1],[22,0]])

 In the above example, we used 100% of the data for training and 0 for testing our model.

Testing our Model

Code Block
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

#split our data into train and test DataFrames (20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = DecisionTreeClassifier()

# train model,y_train)

# run predict using test data
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)

Image Added

Model Persistence

Saving a Trained Model

Code Block
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

#split our data into train and test DataFrames (20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = DecisionTreeClassifier()

# train model,y_train)

# run predict using test data
# predictions = model.predict(X_test)
# score = accuracy_score(y_test, predictions)

#save our model

Predictions from a Saved Model

Code Block
import joblib

#load our model
model = joblib.load("music-recomender.joblib")

# run predict using test data
predictions = model.predict([[20,1],[21,0]])


Visualizing Decision Trees

Code Block
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

model = DecisionTreeClassifier()

# train model,y)

# export graph of data in dot format

This will output our .dot file. We just need to pull it into VSCode with dot plugin to visualize it.

Image Added


Python Machine Learning Tutorial (Data Science)
