Language/Python

TIL-파이썬 판다스 기초 & 설치(Python Pandas basic & install)

청렴결백한 만능 재주꾼 2020. 4. 26. 18:06
반응형

What is pandas?

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.

 

  • Rich relational tool, built on top of Numpy
  • Highly efficient, much better performance compare to R
  • Easy to use, foundation for data analysis in Python

Main Feature

Here are just a few of the things that pandas does well:

 

  • Easy handling of missing data(represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be iinserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

설치하기

터미널 : pip3 install pandas --user

끝에 --user를 빼고 하니 permission 없다고 설치가 안되었다. 

 

Pandas Data type

Data Type

Dim(차원?)

Description

Analogy

Series

1

1D labeled array

Python list, numpy 1D array

DataFrame

2

General 2D labeled tabular structure

Python list of lists(tuples)

Numpy 2D array

Panel

3

General 3D labeled array

 

 

Pandas dtype

Pandas Type Native Python Type Description
object string The most general dtype. Will be assigned to your column if column has mixed types(numbers and strings).
int64 int Numeric characters. 64 refers to the memory allocated to hold this character.
float64 float

Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal.

datetime64,timedelta[ns] N/A (But see the datetime module in Python's standard library Values meant to hold time data. Look into these for time series experiments.

 

 

DataFrame Create 

Create Empty DataFrame

>>>Import pandas as pd

>>>df = pd.DataFrame()

>>>print(df)

 

Create From list: 1D

>>>df = pd.DataFrame([1,2,3,4,5])

>>>print(df)

df.shape

>>>(5,1)

 

Create From ndarray:2D

>>>Import numpy as np

>>> data = np.arange(12).reshape(2,6)

>>>df = pd.DataFrame(data)

 

Set Columns Labels

>>>Import pandas as pd

>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]

>>>df = pd.DataFrame(data, columns=['Name','Age'])

>>>print(df)

 

Set Columns Labels result

Set Row Labels

>>>df = pd.DataFrame(data,columns=['Name','Age'],index=['a','b','c'])

>>>print(df)

 

Set Row Labels result

Change Labels After Creation

>>>df.index = ['A','B','C']; print(df)        ## "index" for orw labels

>>>df.columns = ["Leader_Name","Leader_Age"]  ##"columns" for column labels

 

Changed Labels

Specify dtype, change dtype

>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]

>>>df = pd.DataFrame(data, columns=['Name','Age'])

 

>>>df.Age = df.age.astype(float)     ## 'astype' : advanced

>>>print(df)                                     ## access a column by "." column name

Age column dtype is changed to float

 

Create : From Dict

>>>d = {'col1':[1,2], 'col2':[3,4]}

>>>df = pd.DataFrame(data=d)

>>>print(df)

From dictionary to pd.DataFrame

>>>data = {'Name':['John','Thomas','Stephen','Julia'],'Age':[34,35,36,28]}

>>>df = pd.DataFrame(data)

>>>print(df)

Create: From List of Dicts

>>>import pandas as pd

>>>data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]

>>>df = pd.DataFrame(data)

>>>print(df)

Create From List of Dicts

 

DataFrame Operation

Select by column

>>>print(df['Column_Name'])

 

Add a Column

adding a new column is similar to adding a key to dict

df['three']=pd.Series([10,20,30], index=['a','b','c']) 

-->> 'three'란 이름의 열을 추가하는데 1D 데이터인 Series를 데이터 넣고, 어디 행에 들어갈 것인지 index=로 지정해줌. 

 

Different way to add column

df['four'] = df['one'] + df['three']

-->> 'one'열과 'three'열의 값을 서로 더하여 사번 행에 표시되게 함.

 

Delete a column

Using Del statement

>>>del df['one']

Using Pop

>>>df.pop('two')

 

Delete a row

>>>df = pd.DataFrame([[1,2],[3,4]], columns='a','b'])

>>>df = df.drop(0)

-->> 0이라는 행을 삭제함.(df를 만들 때에 row 이름을 지정 안 했기 때문에 임의로 0과 1이 행 이름으로 지정됨)

 

Slicing

 

>>>print(df[2:4])      #slicing by row

-->> 몇 번째 행부터 몇번째 행까지

 

Add Row

>>>df= df.append(df2) 

-->> df2란 데이터를 df에 붙여 넣는다.

 

Select Row

>>>df.iloc[2]   #iloc : by integer location

-->> 0,1,2 순서로 2번째 행을 부름.

>>>df.loc['b']   #loc : by row index

-->> ' b'란 행을 부름.

 

 

DataFrame Attributes

예제로 쓸 Data 만들기

>>>import numpy as np
>>>dates = pd.date_range('20130101', periods=6)
>>>print(dates)
>>>df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

-->> dates는 행 제목을 짓기 위해 만듦. 20130101부터 6개 뽑아오기

-->>numpy random randn(random number) 평균 0, 표준편차 1에서 난수를 뽑아오는 것이다. 6,4는 6행 4열의 2d 데이터를 만든 것이다. 열 제목은 ABCD로 지정.

 

Peeking: head/tail

Peeking:head/tail

>>>df.head()

>>>df.tail()

-->>입력을 안 했을 때에는 기본 값 5를 받는다. N=5. head는 처음부터 몇 줄 /  tail은 끝에서부터 몇 줄로 받는다.

 

Read Index / Columns

>>>df.index

Reading index

-->> 흥미로운 것은 index를 저장한 이름까지 뜬다는 것.

 

>>>df.columns

Reading columns

Values

>>>df.values

-->>이렇게 하면 행 열 제목 빼고 내용물만 나오는데 type으로 보면 그림처럼 numpy.ndarray로 뜬다.

df.values result

Describe

>>>df.describe()

-->> Display a quick summary of data

 

DataFrame Manipulation

Transpose

>>>df.T

-->> 행 열이 바뀜.

 

intro : Sort by Value

>>>df.sort_values(by='B') 

-->> B열의 값을 오름차순으로 정렬

>>>df.sort_values(by='2013-01-01',axis=1, ascending=True)

-->>'2013-01-01'의 라인, axis(축)이 1이네 가로를 뜻함. 0은 가로. 오름차순으로 정렬

 

 Filter by Boolean Index

옳고 그름으로 필터링해보자

>>> df[df.A>0]

-->> df의 A에서 0보다 큰 것들만 가져옴.

>>> df[df>0]

-->> df란 데이터 전체에서 0보다 큰 것들만 가져옴.


일단 여기까지.. 더 열심히 공부하자. 아직 초짜 중에 초짜니까.

반응형