High Level Steps

  • Import the Data
  • Clean the Data
  • Split the Data into Training/Test Sets (80% training/20% testing)
  • Create a Model - select an algorithm 
  • Train the Model
  • Make Predictions
  • Evaluate and Improve

Libraries and Tools

LibraryPurpose
NumPyNumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more.
PandasPandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
MatPlotLibMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
SciKit-LearnSimple and efficient tools for predictive data analysis · Accessible to everybody, and reusable in various contexts
JupyterThe Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser.
AnacondaAnaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.


Getting Started

Install Anaconda

https://www.anaconda.com/products/individual


Start a jupyter notebook

$ jupyter notebook


Create a new Python3 notebook

Import a Dataset

We can get some sample datasets from kaggle.com - https://www.kaggle.com/


From our Jupyter notebook, we are going to import a downloaded CSV.

import panda as pd
df = pd.read_csv('vgsales.csv') 
df

The pd.read function returns a DataFrame object


Dataframe Functions:


Interesting DataFrame functions:

MethodDescriptionExample
shapereturns dimensions of dataset

df.shape

(16598, 11)
describereturns useful statistics about our data

df.describe()

(see above image)

valuesreturns your data

Jupyter Shortcuts

ShortcutModeKeyDescription
Add Cell AboveCommanda
Add Cell BelowCommandb
Delete Current CellCommanddd
Run current Cell and Stay in CellCommand/Edit<CTRL><ENTER>Run Commands in cell without adding a cell below.
AutocompletionEdit<TAB>Get methods for object
Method DocumentationEdit<SHIFT> <TAB>Get information on method
Make CommentEdit<CMD> /Comment/UnComment


Real Example

Import the data

import pandas as pd
df = pd.read_csv('music.csv')
df


Spit the Data

Create input and output data sets. X = input, y = output.

Since we want to predict the type of music based on age and sex, we create our input data as X and our output as y.

import pandas as pd
df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]
y


Train and Do a Prediction

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

model = DecisionTreeClassifier()

# train model
model.fit(X,y)

# predict
# 21 year old male and 22 year old female
predictions = model.predict([[21,1],[22,0]])
predictions

 In the above example, we used 100% of the data for training and 0 for testing our model.


Testing our Model

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

#split our data into train and test DataFrames (20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = DecisionTreeClassifier()

# train model
model.fit(X_train,y_train)

# run predict using test data
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score


Model Persistence


Saving a Trained Model

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

#split our data into train and test DataFrames (20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = DecisionTreeClassifier()

# train model
model.fit(X_train,y_train)

# run predict using test data
# predictions = model.predict(X_test)
# score = accuracy_score(y_test, predictions)

#save our model
joblib.dump(model,"music-recomender.joblib")


Predictions from a Saved Model

import joblib

#load our model
model = joblib.load("music-recomender.joblib")

# run predict using test data
predictions = model.predict([[20,1],[21,0]])

predictions


Visualizing Decision Trees

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

df = pd.read_csv('music.csv')
X = df.drop(columns="genre")
y = df["genre"]

model = DecisionTreeClassifier()

# train model
model.fit(X,y)

# export graph of data in dot format
tree.export_graphviz(model,out_file='music_recomender.dot', 
                    feature_names=['age','gender'], 
                    class_names=sorted(y.unique()),
                    label='all',
                    rounded=True,
                    filled=True)


This will output our .dot file. We just need to pull it into VSCode with dot plugin to visualize it.


References

ReferenceURL
Python Machine Learning Tutorial (Data Science)https://www.youtube.com/watch?v=7eh4d6sabA0
  • No labels