Most wanted pandas functions (Part 01)


Hello again.

Among the data scientists who used python, pandas is the most popular data analysis and manipulation library. It contains various functions and methods to perform data preprocessing and analysis.
Here, I'm trying to cover many functions that we can utilize in many projects.For demonstrations, I will use a famous titanic dataset.

First let's install the pandas
pandas_funcs.ipynb
!pip install pandas

1. Import CSV file.

We can import a csv file by using read_csv function. There are many options for advance usage. Please refer the documents for advance usage.

pandas_funcs.ipynb
df = pd.read_csv("titanic.csv")

2. View first N rows.

pandas_funcs.ipynb
df.head(n=7)

3. View last N rows.

pandas_funcs.ipynb
df.tail(n=7)

4. View shape of the DataFrame.

pandas_funcs.ipynb
df.shape

# Output >>
(891, 12)

5. Get column names.

pandas_funcs.ipynb
df.columns

# Output >>
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

6. Drop columns.

We can use the drop function to drop the columns or rows. By passing list of column/rows names to drop we can delete those.

pandas_funcs.ipynb
df.drop(['Name','SibSp','Pclass'], axis=1, inplace=True)

df.columns
# Output >>
Index(['PassengerId', 'Survived', 'Sex', 'Age', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')

Please note that the axis is set to 1 here, which means the function is for columns. If you need to delete rows please use 0.

7. Sampling.

We can get a sample very easily by number of rows or by ratio.

pandas_funcs.ipynb
df.shape
# Output >>
(891, 9)

df_sample_1 = df.sample(n=100)
df_sample_1.shape
# Output >>
(100, 9)

df_sample_2 = df.sample(frac=0.75)
df_sample_2.shape
# Output >>
(668, 9)

8. Find missing values.

It is important to deal with the missing values of the data set. Pandas allow us to check the missing values easily by:

pandas_funcs.ipynb
df.isna().sum()
# Output >>
PassengerId      0
Survived         0
Sex              0
Age            177
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

9. Fill missing values.

Pandas provide a fillna function to fill missing values. It is common to use mean, avg values for this. Note that using method parameters we can fill missing values based on previous or next values. It is useful when we deal with time series data.
For the example : Here there are 177 values missing from the age column. I will use the average age of others to fill the missing values.

pandas_funcs.ipynb
avg = df['Age'].mean()
df['Age'].fillna(value=avg, inplace=True)

10. Drop missing values.

For some projects, sometimes we have to delete entire rows which contain missing values. In that case we can use the dropna function.
In this data set there are 2 values missing from the “Embarked” column. We can remove relevant rows by;

pandas_funcs.ipynb
print(df.shape)
# Output >>
(891, 9)

df.dropna(subset=['Embarked'], axis=0, how='any', inplace=True)
df.shape
# Output >>
(889, 9)

Please note that the axis is set to 1 here, which means the function is for columns. If you need to delete rows please use 0.

11. Conditional Selections.

We need to select rows based on certain conditions. In example below codes show selections of passengers whose gender is male. And age is more than 30 years old.

pandas_funcs.ipynb
male_30_plus = df[(df.Sex == 'male') & (df.Age> 30)]
male_30_plus

12. Conditional Selections with isin

If we need to check several values, we can use isin function, it will return a dataframe which only contains specific values from the column.

Suppose that we need to get passengers ages is 12 and 15,

pandas_funcs.ipynb
df[df['Age'].isin([12,15])]

13. Insert new column.

We can easily add a new column to a dataframe like this.

pandas_funcs.ipynb
group = np.random.randint(10, size=len(df))
df['new_group_name'] = group
df

If you need to add columns at a specific position , insert function can be used.

pandas_funcs.ipynb
df.insert(2, 'new_group_name', group)

14. Ranking

We can rank the values of columns based on their values. Lets rank the passengers according to the ticket price.

pandas_funcs.ipynb
df['rank'] = df['Fare'].rank(method='first', ascending=False).astype('int')
df

15. Groupby

This is a mostly used function in pandas. It is easy to get an overview of the data and relationships among the variables.

pandas_funcs.ipynb
df.groupby(['Embarked','Sex'])['Survived'].count()
# Output >>
Embarked  Sex   
C         female     73
          male       95
Q         female     36
          male       41
S         female    203
          male      441
Name: Survived, dtype: int64

In this example we can get the count of survived people vs, who boarded from 3 ports and gender.

Since this article is becoming long, I will continue from part 2 in later date.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.