Most wanted pandas functions (Part 02)


Here, I'm going to extend the last article I have written some time ago.
You guys can read it in here Most wanted pandas functions (Part 01)

1. Applying a function.

When dealing with the data, we often have to use functions. There are few ways to use a function over a DataFrame. We should avoid iterating approaches since it is slow going through the panda's rows.
We can use .apply() in this case.

pandas_funcs.ipynb
# Add $ to values of the Fare column.
def add_dolor_mark(money_str):
    money_str = '$ '+str(money_str)
    return money_str

df['Fare'] = df['Fare'].apply(add_dolor_mark)
df.head() 

2. Renaming columns.

Sometimes it is necessary to rename columns of our DataFrame.
We can pass dictionary {‘column_1’ : ’new_column_1’, ‘column_2’ : ’new_column_2’} to .rename()
Let's try to rename Embarked into Port and Fare into Price.

pandas_funcs.ipynb
df.rename(columns = {'Embarked':'Port', 'Fare':'Price'}, inplace=True)
df.head()

3. Adding prefixes or suffixes to column names.

In this case we can use add_prefix() and add_suffix()

pandas_funcs.ipynb
df = df.add_prefix('X_')
df.head()

4. Conditional replacing.

“Where” can be used to replace the values in rows or columns based on a condition. The way where is works a bit different. It is crucial to identify the way “where” works. "Where" select values based on the condition, then the remaining values are replaced with the specified value.
So "where" is behaving like a masking operation.

pandas_funcs.ipynb
# we need to replace survied people with True
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , True)
df.head()

5. Melting

This is a bit infamous function. But very useful. We can use melt to convert wide DataFrame to narrow ones.
Let's assume we have measured the temperature of 5 cities and for a week. And the data is looking like this.

(we can create a sample DataFrame as follows)

pandas_funcs.ipynb
import pandas as pd

temp = {'City': ['Ofuna','Tokyo','Yokohama','Okayama', 'Tsukuba'],
        'Day_1': [22,23,21,20,19],
        'Day_2': [23,22,24,29,18],
        'Day_3': [25,24,23,22,23],
        'Day_4': [21,20,25,21,20],
        'Day_5': [20,19,25,22,21],
        }

df = pd.DataFrame(temp, columns = ['City', 'Day_1', 'Day_2', 'Day_3', 'Day_4', 'Day_5’])
pandas_funcs.ipynb
df = df.melt(id_vars=['City'])
df

6. Save as csv

In some situations, we have to save processed DataFrame. In that case, we can use the to_csv function.

pandas_funcs.ipynb
df.to_csv('myfile.csv')

7. Query

Among the several filtering methods query is another filtering method that we can quickly and easily access to the subsections of the DataFrame.

pandas_funcs.ipynb
#letes assume that we want to get list of age of 10 and 15
seek_age = [10,15]
df.query("X_Age in @seek_age")

8. Memory usage

How much memory is used by a pandas DataFrame is really beneficial when dealing with large DataFrames. So we can avoid trouble with dead kernel due to out of memory

pandas_funcs.ipynb
df.memory_usage()

For total Memory usage we can use,

pandas_funcs.ipynb
df.memory_usage().sum()

9. Profile report

It generates reports from a pandas DataFrame. The pandas df.describe() function is handy but a little basic for good exploratory data analysis. pandas_profiling extends the capability for quick data analysis.

pandas_funcs.ipynb
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='Pandas Profiling Report', style={'full_width':True}, correlations={'kendall': False})
profile

10. Explode

Let's think that our data set holds multiple entries in a single row. But for our analysis, we need to analyze them on separate rows. In that case, we can use the explode function.
(First lets create example DataFrame)

pandas_funcs.ipynb
df_ = pd.DataFrame({'id':['a','b','c','d'],
                  'humidity %':[62,55,[29,39,81],77],
                  'day':[1,2,3,4]})
df_

pandas_funcs.ipynb
df_.explode('humidity %')

11. Crosstab

It is used to group 2 or more variables and make a summary table. It is used to get insight of columns. Please note that crosstab() works with categorical data.

For example, let's think that we need to get how many people survived/not-survived according to their ports.

pandas_funcs.ipynb
#First lets replace True with 1 in X_Survived column.
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , 1)

port_survived = pd.crosstab(df["X_Port"],df["X_Survived"])
port_survived

There are more many useful functions for analyzing data. I will introduce few more functions in the next article.

Hope this will help you....

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.