Pandas Fundamentals

18484 ワード

pandas Python MachineLearning データサイエンス Python テキストリンク

この記事の日本語版はこちら　⇒　Pandas入門

Introduction

The purpose of the this post is to explain the fundamentals of using the Pandas library for Python.

Overview

Pandas is a Python library that is used to manipulate tabular data (similar to the data in a spreadsheet or database).

It is often used for data analysis and for reading and manipulating datasets used by Machine Learning libraries.

Installation

The latest version can be installed using pip:

pip install pandas

Concepts

To begin with, let's understand the concepts and terminology used by the library.

Series

A Series is a single-dimensioned labeled array that is able to hold any type of data.

DataFrame

A DataFrame is a two-dimensional table-like data structure with multiple rows and columns.

You can consider it as a collection of Series.
Similar in concept to a spreadsheet or database table.

In this post, we will focus most of our attention on how to use DataFrames.

Creating a DataFrame

Often, you will create a DataFrame by loading existing data from a file, or database. For example, to load a CSV (comma-separated values) file:

import pandas as pd 
df = pd.read_csv('example.csv')
>>> df

You can also create DataFrames from Python dictionary objects, where each key in the dictionary corresponds to a column name.

import pandas as pd 

data = {'A': [1, 2, 3],
        'B': ['X', 'Y', 'Z'],
        'C': [0.1, 0.2, 0.3]}

df = pd.DataFrame(data)

>>> df
   A  B    C
0  1  X  0.1
1  2  Y  0.2
2  3  Z  0.3

Accessing Columns

Now, let's look at how we access data within a DataFrame.

Given the following DataFrame, df:

There are two main ways to access all data in a single column:

You can also select data in multiple columns:

Index

Next, let's understand what the 'index' is and how we can use it.

The index is used to select rows or columns in a DataFrame or Series.

There are three main ways to access the index:

loc gets rows (or columns) with particular labels from the index.
iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index. (Deprecated, use loc or iloc instead)

Lets have a look at some examples, starting with loc:

Now let's select the same data using iloc:

Often numeric labels are used in the index so it’s possible that iloc and loc return the same results when selecting rows.

However, if the order of the index has changed, as seen in the example below, the results may not be the same.

Given the following Data Frame, df:

data = {'Col1': ['A', 'B', 'C', 'D'],
        'Col2': [1, 2, 3, 4]}

df = pd.DataFrame(data)

>>> df
  Col1  Col2
0    A     1
1    B     2
2    C     3
3    D     4

After sorting the DataFrame df, by Col2 in descending order the position of the data will change:

df.sort_values('Col2', ascending=False)
  Col1  Col2
3    D     4
2    C     3
1    B     2
0    A     1

Conditions

You can apply one or more conditions to select a subset of the DataFrame. As you can see below there are a few ways of doing this:

Note that if you are using multiple conditions then you will need to enclose each condition with brackets.

For example do this:

df[(df.Col1 == 'A') | (df.Col2 == 2)]
  Col1  Col2
0    A     1
1    B     2

Instead of this:

df[df.Col1 == 'A' | df.Col2 == 2]

TypeError: cannot compare a dtyped [int64] array with a scalar of type [bool]

Arithmetic Operations and Assignment

You can modify the contents of a DataFrame by performing assignment.

You can also perform arithmetic operations on the DataFrame, as we can see in the examples below.

Note that the operations return a modified copy of the DataFrame.

Given the following Data Frame, df:

data = {'Col1': ['A', 'B', 'C', 'D'],
        'Col2': [1, 2, 3, 4],
        'Col3': [1, 2, 3, 4]}

df = pd.DataFrame(data)
>>> df

  Col1  Col2  Col3
0    A     1     1
1    B     2     2
2    C     3     3
3    D     4     4

Let's first try and multiply the value of a single cell.

df.loc[0, 'Col2'] * 2
2

Next let's subtract 1 from all values in Col3.

df['Col3'] - 1
0    0
1    1
2    2
3    3
Name: Col3, dtype: int64

Next let's add 1 to all values in Col3 and assign the result to the original DataFrame.

df['Col3'] += 1

>>> df

  Col1  Col2  Col3
0    A     1     2
1    B     2     3
2    C     3     4
3    D     4     5

Finally let's use a condition to assign a value of 0 to Col2 if the value in Col3 is greater than 3.

df.loc[df.Col3 > 3, 'Col2'] = 0

>>> df

  Col1  Col2  Col3
0    A     1     2
1    B     2     3
2    C     0     4
3    D     0     5

Common Functions

In addition to arithmetic operations Pandas also provides a large number of functions to simplify your DataFrame processing.

Here are some of the most common operations.

Operation	Purpose	Example
df.dropna()	Removes rows with NaN values	df = df.dropna()
df.fillna()	Fills NaN values with the specified value.	df = df.fillna(1)
df.rename()	Renames one or more columns	df.rename(columns={‘old’: ‘new’})
df.sort_values()	Sorts the data by one or more columns	df.sort_values(by=[’col1’])
df.describe()	Provides a summary of the data in the column (mean, max, min, etc)	df.col1.describe()