Beginners Guide to Pandas Basics

Quick tutorial to get started with pandas

Madhura Ganguly
5 min readJul 21, 2019

This is the first article in a series on learning practical applications of Pandas package without getting overwhelmed by too many functions. This is meant for those who are new to pandas and want to learn the how to perform the most common data analysis tasks quickly. Hence, this series is going to cover the details needed to launch your pandas learning without going over everything that’s possible.

The Jupyter notebook can be accessed from GitHub and also available at the end of this tutorial. I would encourage you to download the Jupyter notebook and follow along by running the script for maximum retention.

This tutorial covers

  1. Creating dataframes from numpy arrays
  2. Creating dataframes by reading files (.csv, .txt,..)
  3. Creating dataframes from scratch using dictionaries
  4. Add new column to dataframe
  5. Recode existing column in dataframe
  6. Drop column from dataframe

Import Packages

First, we import pandas library and datasets module from sklearn package to get the data set for this tutorial.

import pandas as pd
from sklearn import datasets

Load Data Set

Load iris data set from datasets module that was imported. Iris is a famous data set in the world of statistics and has been used for numerous tutorials.

If you run print(iris) you will see that it returns a dictionary where the first key, value pair is "data" and a 150 x 4 numpy array. The second key, value pair is "target" and list of integer values. The last key, value pair is "feature_names" and a list of names. We shall be using these 3 components to build our pandas dataframe that will be used in the rest of this tutorial.

# import some data to play with
iris = datasets.load_iris()
#print(iris)
print(“Iris data shape :”, iris[‘data’].shape)
print(‘\n’)

Create DataFrame

DataFrames are most useful data structures in python, they are 2 -dimensional tabular data structures with column names, row names or index and data. We use function DataFrame() to convert the numpy array into a pandas dataframe. This is the only function to create a dataframe in pandas. We can check the columns created with the .columns() function.

# Create data frame
df=pd.DataFrame(iris.data,columns=iris.feature_names)
print(df.columns)
Column names

Add new column

Next we see how a new column “target” can be added to an existing dataframe. A new column can be created by assigning a list. This is one of many ways, but this is the most common way of doing this. Learn about other ways in this excellent article.

# Lets add another column to the data set
df[‘target’]=iris[‘target’]
print(“Checking top records: “)
print(df.head())
print(‘\n’)
Column target is added to dataframe

Re-code column

# Create species mapping dictionary
species_mapping={0:iris[‘target_names’][0],1:iris[‘target_names’][1],2:iris[‘target_names’][2]}
print(‘Checking species mapping: ‘)
print(species_mapping)
print(‘\n’)
# Use map to recode column
df[‘Species’] = df[‘target’].map(species_mapping)
print(“Checking if Species is created : “)
print(df.head())
print(‘\n’)
# Use .replace to recode column
df[‘Species_2’] = df[‘target’].replace(species_mapping)
print(“Checking if Species_2 is created : “)
print(df.head())
print(‘\n’)

Drop columns

Columns can be dropped by using the .drop() function with option axis=1 specifying that we want to drop a column. Checking the column names again shows that "Species_2" has been dropped. We can use tolist() function to convert column names from series to list. This is not necessary but just nicer to look at.

# Drop Species_2df.drop([‘Species_2’],axis=1,inplace=True)
print(“Checking if Species_2 is dropped : “)
print(df.columns.tolist())
print(‘\n’)

Read Data Set

Here we have loaded the iris data set from datasets module of sklearn package so that anyone can run the script. But in actual work scenario you would need to read in the data set either from a server or from your local folder. Keeping that in mind lets see how it would work if you have the same file stored in a local folder. For this we shall first write out the dataframe created above and then read it back. This way its reproducible for everyone.

Notice when we write out the dataframe with .to_csv() we specify option index=False, if this is not done an unnamed column with row indexes will be created in the written out file. You can try to write out without this option.

The most common way to read files with Pandas is with read_csv() function. This function has many parameters that can be used to specify how data needs to be read. For example, by default the first row of the data will be considered to be the header and used to create column names, if the file does not have any headers then specify headers=None, unless we specify column names with option names the dataframe will have unnamed columns.Take a look at pandas documentation to learn more about the available options when reading in data.

# Write out file
df.to_csv(‘Iris.csv’,index=False)
# Read back file
iris_df = pd.read_csv(‘Iris.csv’)
print(iris_df.head())

Create DataFrame from scratch

Though not needed that often it is useful to know how to create a pandas dataframe from scratch. We can create a dataframe from a dictionary with values as lists, where dictionary keys are the dataframe column names and the values are column values. There are options to specify datatypes and index values. Take a look at pandas documentation to learn more

# create dictionary
dat_dict = {‘Name’:[‘El’,’Mike’,’Dustin’,’Lucas’,’Max’,’Will’],’Gender’:[‘Girl’,’Boy’,’Boy’,’Boy’,’Girl’,’Boy’]}
# convert to dataframe
dat_df = pd.DataFrame(dat_dict)
# check output
print(dat_df.head())

Inspecting a DataFrame

Now that we have learnt how to create a pandas dataframe be it from an existing loaded data set or reading in an external file or from scratch, lets inspect some properties of a dataframe. type() function returns the class type of "dat_df" as pandas dataframe and that of column "Name" as pandas series. We have not introduced series so far, series are another type of data structure in pandas, they are one dimensional arrays with labels. Each column of a dataframe is a series as you can see from running print(type(dat_df['Name'])). You can create a series from a numpy array in same way as a dataframe from a dictionary just by replacing .DataFrame() function with .series(). Finally we can insert a series as a new column into a dataframe.

# Check class type
print(‘Dataframe class :’, type(dat_df))
print(‘Dataframe column class :’, type(dat_df[‘Name’]))
# import numpy library
import numpy as np
# create numpy array
Sibling = np.array([None,’Nancy’,None,’Erika’,’Billy’,’Jonathan’])
# convert to pandas series
Sibling = pd.Series(Sibling)
print(‘Series :’,Sibling)
# insert series as new column into dataframe
dat_df[‘Sibling’]=Sibling
print(“New dataframe :”)
print(dat_df.head())

Here is the ipython notebook.

That’s all for this tutorial. To summarize we have covered how to read and write out data, create pandas dataframe from .csv file, numpy array and dictionary, add new column to dataframe using lists and series, recode existing column in dataframe and drop column from dataframe. The next tutorial will cover more details around data manipulations using pandas. Feed back is welcome :).

--

--