TIL-파이썬 판다스 기초 & 설치(Python Pandas basic & install)

Language/Python

TIL-파이썬 판다스 기초 & 설치(Python Pandas basic & install)

청렴결백한 만능 재주꾼 2020. 4. 26. 18:06

What is pandas?

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.

Rich relational tool, built on top of Numpy
Highly efficient, much better performance compare to R
Easy to use, foundation for data analysis in Python

Main Feature

Here are just a few of the things that pandas does well:

Easy handling of missing data(represented as NaN) in floating point as well as non-floating point data
Size mutability: columns can be iinserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes (possible to have multiple labels per tick)
Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

설치하기

터미널 : pip3 install pandas --user

끝에 --user를 빼고 하니 permission 없다고 설치가 안되었다.

Pandas Data type

Data Type	Dim(차원?)	Description	Analogy
Series	1	1D labeled array	Python list, numpy 1D array
DataFrame	2	General 2D labeled tabular structure	Python list of lists(tuples) Numpy 2D array
Panel	3	General 3D labeled array

Pandas dtype

Pandas Type	Native Python Type	Description
object	string	The most general dtype. Will be assigned to your column if column has mixed types(numbers and strings).
int64	int	Numeric characters. 64 refers to the memory allocated to hold this character.
float64	float	Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal.
datetime64,timedelta[ns]	N/A (But see the datetime module in Python's standard library	Values meant to hold time data. Look into these for time series experiments.

DataFrame Create

Create Empty DataFrame

>>>Import pandas as pd

>>>df = pd.DataFrame()

>>>print(df)

Create From list: 1D

>>>df = pd.DataFrame([1,2,3,4,5])

>>>print(df)

df.shape

>>>(5,1)

Create From ndarray:2D

>>>Import numpy as np

>>> data = np.arange(12).reshape(2,6)

>>>df = pd.DataFrame(data)

Set Columns Labels

>>>Import pandas as pd

>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]

>>>df = pd.DataFrame(data, columns=['Name','Age'])

>>>print(df)

Set Row Labels

>>>df = pd.DataFrame(data,columns=['Name','Age'],index=['a','b','c'])

>>>print(df)

Change Labels After Creation

>>>df.index = ['A','B','C']; print(df) ## "index" for orw labels

>>>df.columns = ["Leader_Name","Leader_Age"] ##"columns" for column labels

Specify dtype, change dtype

>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]

>>>df = pd.DataFrame(data, columns=['Name','Age'])

>>>df.Age = df.age.astype(float) ## 'astype' : advanced

>>>print(df) ## access a column by "." column name

Create : From Dict

>>>d = {'col1':[1,2], 'col2':[3,4]}

>>>df = pd.DataFrame(data=d)

>>>print(df)

>>>data = {'Name':['John','Thomas','Stephen','Julia'],'Age':[34,35,36,28]}

>>>df = pd.DataFrame(data)

>>>print(df)

Create: From List of Dicts

>>>import pandas as pd

>>>data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]

>>>df = pd.DataFrame(data)

>>>print(df)

DataFrame Operation

Select by column

>>>print(df['Column_Name'])

Add a Column

adding a new column is similar to adding a key to dict

df['three']=pd.Series([10,20,30], index=['a','b','c'])

-->> 'three'란 이름의 열을 추가하는데 1D 데이터인 Series를 데이터 넣고, 어디 행에 들어갈 것인지 index=로 지정해줌.

Different way to add column

df['four'] = df['one'] + df['three']

-->> 'one'열과 'three'열의 값을 서로 더하여 사번 행에 표시되게 함.

Delete a column

Using Del statement

>>>del df['one']

Using Pop

>>>df.pop('two')

Delete a row

>>>df = pd.DataFrame([[1,2],[3,4]], columns='a','b'])

>>>df = df.drop(0)

-->> 0이라는 행을 삭제함.(df를 만들 때에 row 이름을 지정 안 했기 때문에 임의로 0과 1이 행 이름으로 지정됨)

Slicing

>>>print(df[2:4]) #slicing by row

-->> 몇 번째 행부터 몇번째 행까지

Add Row

>>>df= df.append(df2)

-->> df2란 데이터를 df에 붙여 넣는다.

Select Row

>>>df.iloc[2] #iloc : by integer location

-->> 0,1,2 순서로 2번째 행을 부름.

>>>df.loc['b'] #loc : by row index

-->> ' b'란 행을 부름.

DataFrame Attributes

예제로 쓸 Data 만들기

>>>import numpy as np
>>>dates = pd.date_range('20130101', periods=6)
>>>print(dates)
>>>df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

-->> dates는 행 제목을 짓기 위해 만듦. 20130101부터 6개 뽑아오기

-->>numpy random randn(random number) 평균 0, 표준편차 1에서 난수를 뽑아오는 것이다. 6,4는 6행 4열의 2d 데이터를 만든 것이다. 열 제목은 ABCD로 지정.

Peeking: head/tail

>>>df.head()

>>>df.tail()

-->>입력을 안 했을 때에는 기본 값 5를 받는다. N=5. head는 처음부터 몇 줄 / tail은 끝에서부터 몇 줄로 받는다.

Read Index / Columns

>>>df.index

-->> 흥미로운 것은 index를 저장한 이름까지 뜬다는 것.

>>>df.columns

Values

>>>df.values

-->>이렇게 하면 행 열 제목 빼고 내용물만 나오는데 type으로 보면 그림처럼 numpy.ndarray로 뜬다.

Describe

>>>df.describe()

-->> Display a quick summary of data

DataFrame Manipulation

Transpose

>>>df.T

-->> 행 열이 바뀜.

intro : Sort by Value

>>>df.sort_values(by='B')

-->> B열의 값을 오름차순으로 정렬

>>>df.sort_values(by='2013-01-01',axis=1, ascending=True)

-->>'2013-01-01'의 라인, axis(축)이 1이네 가로를 뜻함. 0은 가로. 오름차순으로 정렬

Filter by Boolean Index

옳고 그름으로 필터링해보자

>>> df[df.A>0]

-->> df의 A에서 0보다 큰 것들만 가져옴.

>>> df[df>0]

-->> df란 데이터 전체에서 0보다 큰 것들만 가져옴.

일단 여기까지.. 더 열심히 공부하자. 아직 초짜 중에 초짜니까.

'Language > Python' 카테고리의 다른 글

TIL-파이썬 기초 마무리,Linux/Terminal 입문, Git 입문(Python basic, Linux/Terminal basic, Git basic) (0)	2020.04.28
TIL-파이썬[판다스,모듈,클래스]리눅스/Python[Pandas,Module,Class],Linux (0)	2020.04.27
TIL-파이썬-장식자(클로져),스코프(범위)[Python-Decorator(Closure), Scope] (0)	2020.04.24
TIL-파이썬 기초문법,중복 함수, 장식자(Python-Nested Function, Decorator) (0)	2020.04.24
TIL-파이썬(Python)-기초 내장 함수(set, dictionary, for , while) (0)	2020.04.23

현재글TIL-파이썬 판다스 기초 & 설치(Python Pandas basic & install)

No error , No gain