What is pandas?
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.
- Rich relational tool, built on top of Numpy
- Highly efficient, much better performance compare to R
- Easy to use, foundation for data analysis in Python
Main Feature
Here are just a few of the things that pandas does well:
- Easy handling of missing data(represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be iinserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.
터미널 : pip3 install pandas --user
끝에 --user를 빼고 하니 permission 없다고 설치가 안되었다.
Pandas Data type
Data Type |
Dim(차원?) |
Description |
Analogy |
Series |
1 |
1D labeled array |
Python list, numpy 1D array |
DataFrame |
2 |
General 2D labeled tabular structure |
Python list of lists(tuples) Numpy 2D array |
Panel |
3 |
General 3D labeled array |
Pandas dtype
Pandas Type | Native Python Type | Description |
object | string | The most general dtype. Will be assigned to your column if column has mixed types(numbers and strings). |
int64 | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
float64 | float |
Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. |
datetime64,timedelta[ns] | N/A (But see the datetime module in Python's standard library | Values meant to hold time data. Look into these for time series experiments. |
DataFrame Create
Create Empty DataFrame
>>>Import pandas as pd
>>>df = pd.DataFrame()
Create From list: 1D
>>>df = pd.DataFrame([1,2,3,4,5])
Create From ndarray:2D
>>>Import numpy as np
>>> data = np.arange(12).reshape(2,6)
>>>df = pd.DataFrame(data)
Set Columns Labels
>>>Import pandas as pd
>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]
>>>df = pd.DataFrame(data, columns=['Name','Age'])
Set Row Labels
>>>df = pd.DataFrame(data,columns=['Name','Age'],index=['a','b','c'])
Change Labels After Creation
>>>df.index = ['A','B','C']; print(df) ## "index" for orw labels
>>>df.columns = ["Leader_Name","Leader_Age"] ##"columns" for column labels
Specify dtype, change dtype
>>>data = [['Alexa', 30],['Rose',25],['Jeremy',35]]
>>>df = pd.DataFrame(data, columns=['Name','Age'])
>>>df.Age = df.age.astype(float) ## 'astype' : advanced
>>>print(df) ## access a column by "." column name
Create : From Dict
>>>d = {'col1':[1,2], 'col2':[3,4]}
>>>df = pd.DataFrame(data=d)
>>>data = {'Name':['John','Thomas','Stephen','Julia'],'Age':[34,35,36,28]}
>>>df = pd.DataFrame(data)
Create: From List of Dicts
>>>import pandas as pd
>>>data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]
>>>df = pd.DataFrame(data)
DataFrame Operation
Select by column
Add a Column
adding a new column is similar to adding a key to dict
df['three']=pd.Series([10,20,30], index=['a','b','c'])
-->> 'three'란 이름의 열을 추가하는데 1D 데이터인 Series를 데이터 넣고, 어디 행에 들어갈 것인지 index=로 지정해줌.
Different way to add column
df['four'] = df['one'] + df['three']
-->> 'one'열과 'three'열의 값을 서로 더하여 사번 행에 표시되게 함.
Delete a column
Using Del statement
>>>del df['one']
Using Pop
Delete a row
>>>df = pd.DataFrame([[1,2],[3,4]], columns='a','b'])
>>>df = df.drop(0)
-->> 0이라는 행을 삭제함.(df를 만들 때에 row 이름을 지정 안 했기 때문에 임의로 0과 1이 행 이름으로 지정됨)
>>>print(df[2:4]) #slicing by row
-->> 몇 번째 행부터 몇번째 행까지
Add Row
>>>df= df.append(df2)
-->> df2란 데이터를 df에 붙여 넣는다.
Select Row
>>>df.iloc[2] #iloc : by integer location
-->> 0,1,2 순서로 2번째 행을 부름.
>>>df.loc['b'] #loc : by row index
-->> ' b'란 행을 부름.
DataFrame Attributes
예제로 쓸 Data 만들기
>>>import numpy as np
>>>dates = pd.date_range('20130101', periods=6)
>>>df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
-->> dates는 행 제목을 짓기 위해 만듦. 20130101부터 6개 뽑아오기
-->>numpy random randn(random number) 평균 0, 표준편차 1에서 난수를 뽑아오는 것이다. 6,4는 6행 4열의 2d 데이터를 만든 것이다. 열 제목은 ABCD로 지정.
Peeking: head/tail
-->>입력을 안 했을 때에는 기본 값 5를 받는다. N=5. head는 처음부터 몇 줄 / tail은 끝에서부터 몇 줄로 받는다.
Read Index / Columns
-->> 흥미로운 것은 index를 저장한 이름까지 뜬다는 것.
-->>이렇게 하면 행 열 제목 빼고 내용물만 나오는데 type으로 보면 그림처럼 numpy.ndarray로 뜬다.
-->> Display a quick summary of data
DataFrame Manipulation
-->> 행 열이 바뀜.
intro : Sort by Value
-->> B열의 값을 오름차순으로 정렬
>>>df.sort_values(by='2013-01-01',axis=1, ascending=True)
-->>'2013-01-01'의 라인, axis(축)이 1이네 가로를 뜻함. 0은 가로. 오름차순으로 정렬
Filter by Boolean Index
옳고 그름으로 필터링해보자
>>> df[df.A>0]
-->> df의 A에서 0보다 큰 것들만 가져옴.
>>> df[df>0]
-->> df란 데이터 전체에서 0보다 큰 것들만 가져옴.
일단 여기까지.. 더 열심히 공부하자. 아직 초짜 중에 초짜니까.
