Pandas Fundamentals
この記事の日本語版はこちら ⇒ Pandas入門
Introduction
The purpose of the this post is to explain the fundamentals of using the Pandas library for Python.
Overview
Pandas is a Python library that is used to manipulate tabular data (similar to the data in a spreadsheet or database).
It is often used for data analysis and for reading and manipulating datasets used by Machine Learning libraries.
Installation
The latest version can be installed using pip
:
pip install pandas
Concepts
To begin with, let's understand the concepts and terminology used by the library.
Series
A Series is a single-dimensioned labeled array that is able to hold any type of data.
DataFrame
A DataFrame is a two-dimensional table-like data structure with multiple rows and columns.
- You can consider it as a collection of Series.
- Similar in concept to a spreadsheet or database table.
In this post, we will focus most of our attention on how to use DataFrames.
Creating a DataFrame
Often, you will create a DataFrame by loading existing data from a file, or database. For example, to load a CSV (comma-separated values) file:
import pandas as pd
df = pd.read_csv('example.csv')
>>> df
You can also create DataFrames from Python dictionary objects, where each key in the dictionary corresponds to a column name.
import pandas as pd
data = {'A': [1, 2, 3],
'B': ['X', 'Y', 'Z'],
'C': [0.1, 0.2, 0.3]}
df = pd.DataFrame(data)
>>> df
A B C
0 1 X 0.1
1 2 Y 0.2
2 3 Z 0.3
Accessing Columns
Now, let's look at how we access data within a DataFrame.
Given the following DataFrame, df
:
There are two main ways to access all data in a single column:
You can also select data in multiple columns:
Index
Next, let's understand what the 'index' is and how we can use it.
The index is used to select rows or columns in a DataFrame or Series.
There are three main ways to access the index:
-
loc
gets rows (or columns) with particular labels from the index. -
iloc
gets rows (or columns) at particular positions in the index (so it only takes integers). -
ix
usually tries to behave like loc but falls back to behaving likeiloc
if a label is not present in the index. (Deprecated, use loc or iloc instead)
Lets have a look at some examples, starting with loc
:
Now let's select the same data using iloc
:
Often numeric labels are used in the index so it’s possible that iloc
and loc
return the same results when selecting rows.
However, if the order of the index has changed, as seen in the example below, the results may not be the same.
Given the following Data Frame, df
:
data = {'Col1': ['A', 'B', 'C', 'D'],
'Col2': [1, 2, 3, 4]}
df = pd.DataFrame(data)
>>> df
Col1 Col2
0 A 1
1 B 2
2 C 3
3 D 4
After sorting the DataFrame df
, by Col2
in descending order the position of the data will change:
df.sort_values('Col2', ascending=False)
Col1 Col2
3 D 4
2 C 3
1 B 2
0 A 1
Conditions
You can apply one or more conditions to select a subset of the DataFrame. As you can see below there are a few ways of doing this:
Note that if you are using multiple conditions then you will need to enclose each condition with brackets.
For example do this:
df[(df.Col1 == 'A') | (df.Col2 == 2)]
Col1 Col2
0 A 1
1 B 2
Instead of this:
df[df.Col1 == 'A' | df.Col2 == 2]
TypeError: cannot compare a dtyped [int64] array with a scalar of type [bool]
Arithmetic Operations and Assignment
You can modify the contents of a DataFrame by performing assignment.
You can also perform arithmetic operations on the DataFrame, as we can see in the examples below.
Note that the operations return a modified copy of the DataFrame.
Given the following Data Frame, df
:
data = {'Col1': ['A', 'B', 'C', 'D'],
'Col2': [1, 2, 3, 4],
'Col3': [1, 2, 3, 4]}
df = pd.DataFrame(data)
>>> df
Col1 Col2 Col3
0 A 1 1
1 B 2 2
2 C 3 3
3 D 4 4
Let's first try and multiply the value of a single cell.
df.loc[0, 'Col2'] * 2
2
Next let's subtract 1
from all values in Col3
.
df['Col3'] - 1
0 0
1 1
2 2
3 3
Name: Col3, dtype: int64
Next let's add 1
to all values in Col3
and assign the result to the original DataFrame.
df['Col3'] += 1
>>> df
Col1 Col2 Col3
0 A 1 2
1 B 2 3
2 C 3 4
3 D 4 5
Finally let's use a condition to assign a value of 0
to Col2
if the value in Col3
is greater than 3
.
df.loc[df.Col3 > 3, 'Col2'] = 0
>>> df
Col1 Col2 Col3
0 A 1 2
1 B 2 3
2 C 0 4
3 D 0 5
Common Functions
In addition to arithmetic operations Pandas also provides a large number of functions to simplify your DataFrame processing.
Here are some of the most common operations.
Operation | Purpose | Example |
---|---|---|
df.dropna() | Removes rows with NaN values | df = df.dropna() |
df.fillna() | Fills NaN values with the specified value. | df = df.fillna(1) |
df.rename() | Renames one or more columns | df.rename(columns={‘old’: ‘new’}) |
df.sort_values() | Sorts the data by one or more columns | df.sort_values(by=[’col1’]) |
df.describe() | Provides a summary of the data in the column (mean, max, min, etc) | df.col1.describe() |
Further Reading
Author And Source
この問題について(Pandas Fundamentals), 我々は、より多くの情報をここで見つけました https://qiita.com/ps_adszew/items/f7cd1caf1d373b65ac78著者帰属:元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。
Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .