Introduction To Pandas In Python & Hands-On Exercise (Data Analysis Using Pandas)

Data Analysis Using Pandas in Python
AI/ML

Share Post Now :

HOW TO GET HIGH PAYING JOBS IN AWS CLOUD

Even as a beginner with NO Experience Coding Language

Explore Free course Now

Table of Contents

Loading

In this blog, we are going to cover a brief introduction to Pandas in Python and a small demo to analyze the titanic dataset using the Pandas library in python.

A Quick Glance On Pandas Library 

Pandas is a software library written for the Python programming language for data manipulation and analysis. It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Pandas in Python is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

pandas in python introduction

Pandas Data Structure

The two primary data structures of pandas:

  • Series (1-dimensional)
  • DataFrame (2-dimensional)

handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

pandas datastructure

Pandas is suitable for:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Flexible reshaping and pivoting of data sets
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series data
  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational/statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

Pandas Dataframe

The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

It is similar to SQL tables or the spreadsheets that you work within Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.

pandas dataframes

Step-By-Step Guide To Data Analysis With Pandas In Python

Let’s look at a demo of how we can use pandas to analyze data, deal with missing values, change data types, filter, sort, select specific column(s), deal with duplicate values, drop and add rows and columns, count values, counting unique values.

Note: The demo is performed in Jupyter Notebook. If you do not have Jupyter Notebook set up in your systems, please check our guide to install Anaconda & get started with the Jupyter Notebook blog.

1.) Install the package

Before starting with the demo, you will have to install the package. Run the below command to install the Pandas package.

!pip install pandas

pandas in python: intsall pandas

2.) Importing the Pandas package

Step 2: In your Jupyter notebook, run the following command to install the required package.

import pandas as pd

Short Trick: The common shortcut of Pandas is pd so instead of writing “pandas.” you can write “pd.”, but note that there is a dot after “pd” which is used to call a method from Pandas library.

3.) Import the dataset with read_csv

Now that the package is successfully installed, we will import the dataset.

Note: For this demo, we are using the Titanic Dataset (available on Kaggle)

Step 3: To read a dataset, we are going to use read_csv.

here df stands for dataframe (pandas dataframe)

df = pd.read_csv("//YOUR FILE Path")
importing dataset
Step 4: To view the contents of the dataset we are going to use the df.head().
view the dataset contents
Note: if the () is left blank then the first 5 elements of the table will be displayed. You want a specific number of entries, then you can give the command as df.head(20) to print the first 20 rows.
Step 5: To view the last rows of the table, run the following command df.tail()
pandas in python: last entries from the table
Step 6: Now if you want to read some specific columns only from the dataset, you can use the usecols argument to specify the column names that we want to work with. Let’s work with just PassengerId, Survived, Pclass columns.
df = pd.read_csv("\\Your File Path\\train.csv", usecols= ["PassengerId", "Survived", "Pclass"]) 
df.head()

view specific columnsStep 7: Run the describe() command to get a summary of numeric values in your dataset.

describe method command

4.) Sort Columns based on specific criteria

In this section, we will sort columns based on numeric data, string, etc.
Step 8: Run the below command to view the first 8 lowest ticket prices (fare column).
df.sort_values("Fare").head(8)

sort the numerical columns

Step 9: To view the highest-paying passengers, run the below command.
df.sort_values("Fare", ascending = False).head(8)

pandas in python: highest paying passenger

5.) Count the occurrences of variables

Step 10: Using .value_counts() to count the occurences of each variable in a column.
count value occurance

6.) Data Filtering

Step 11: Run the below command to display those passengers who are female and their fare is less than 100.
df_fare_mask = df["Fare"] < 100
df_sex_mask = df["Sex"] == "female"
df[df_fare_mask & df_sex_mask]

pandas in python: data filtering

7.) Null values (NaN)

One of the most common problems in data science is missing values. To detect them, there is a beautiful method which is called .isnull(). With this method, we can get a boolean series (True or False).
Step 12: Run the below command to show the passengers whose cabin is unknown (NaN).
null_mask = df["Cabin"].isnull()
df[null_mask]

pandas in python: null values

Similarly, we can perform more such operations with the Pandas dataframes to analyze other parameters using the Pandas library.

Related References

Next Task For You…

Data science is a rapidly growing field, and the demand for data science skills and expertise is expected to continue to increase in the coming years. The exponential growth of data has been staggering in recent years.
As data becomes increasingly important in driving business value, the demand for data science professionals is expected to continue to grow in the coming years.

Begin your journey toward becoming a Data Science Expert. Join our FREE CLASS on How to Build a Career in Data Science

Data Science Free Class

Picture of mike

mike

I started my IT career in 2000 as an Oracle DBA/Apps DBA. The first few years were tough (<$100/month), with very little growth. In 2004, I moved to the UK. After working really hard, I landed a job that paid me £2700 per month. In February 2005, I saw a job that was £450 per day, which was nearly 4 times of my then salary.