Skip to content

Latest commit

 

History

History
472 lines (398 loc) · 16.2 KB

pandas.md

File metadata and controls

472 lines (398 loc) · 16.2 KB

Pandas

Table of contents

1. Introduction

  • 2 fundamental data structures in Pandas: Series and DataFrame

1.1. Series

  • One dimensional array-like: pd.Series(data, index=index)
  • 2 attributes: data and index
    • data: List, Numpy Array, Scalar, Dictionary (keys to be index, values to be values)
    • Index: can be non-sequential, duplicates, non-numbers
# Creating a series with values: [0.1 0.2 0.3] and index: RangeIndex(start=0, stop=3, step=1)
# List
data = pd.Series([0.1, 0.2, 0.3])
0    0.1
1    0.2
2    0.3
dtype: float64

# NumPy array
print(pd.Series(np.array([11,8,3,1])))

# Scalar
print(pd.Series(100)) print(pd.Series(100, index=range(3)))
0    100
1    100
2    100
dtype: int64
# Index Example
val = np.arange(0, 5)
ind = val * 2 # non-sequential index
data = pd.Series(val, index=ind)

0    0
2    1
4    2
6    3
8    4
dtype: int32

ind = [3, 9, 1, 10, 3] #duplicate index
data = pd.Series(val, index=ind) 

ind = ['a', 'b', 'c', 'd', 'e'] #non-number index
data = pd.Series(val, index=ind)

1.2. DataFrame

  • Two-dimensional: a row index and a column index
  • Each column is a pandas Series
  • “holes” in Series are filled with NaN missing value

Screenshot 2021-10-15 at 13 27 03

#Example of  pd.DataFrame
years = [2000, 2001, 2002, 2003, 2004]
population = {'area_1': [1.1, 1.2, 1.5, 1.7, 2.0],
              'area_2': [0.4, 0.5, 0.7, 0.6, 0.7]}

df = pd.DataFrame(population, index = years)

# Example of set_index() function
marks = { 'Names': ['James', 'Andrew', 'Bob', 'Carlson'],
          'English': [58, 62, 77, 65],
          'Chinese': [55, 32, 64, 80],
          'Math': [61, 70, 81, 54]}

df = pd.DataFrame(marks) 
df.set_index("Names", inplace = True) 

# Example of parameter "columns" function
marks = {'James': [58, 55, 61],
         'Andrew': [62, 32, 70],
         'Bob': [77, 64, 81],
         'Carlson': [65, 80, 54]}

df = pd.DataFrame(marks.values(), 
                  index=marks.keys(),
                  columns=['English', 'Chinese', 'Math'])
      
#         English  Chinese  Math
#James         58       55    61
#Andrew        62       32    70
#Bob           77       64    81
#Carlson       65       80    54

1.2.1. Column View

  • Column Index: df.columns
  • Column Access:
    • Single column: df[col_name] or df.col_name (no space in col_name)
    • Multiple columns: column names in a list df[[col_name1, col_name2, ...]]
# Single-column access
print(df['area_1'])
print(df.area_2)

# Multiple-column access
print(df[['area_1', 'area_2']])
  • Column Addition: df[new_col] = seq where seq can be List/ or Pandas Series

1.2.2. Row View

  • Row Index: df.index
  • Row Access with .iloc:
    • .iloc indexer: integer position-based
      • df.iloc[row_num]
      • df.iloc[[row_nums]] select multiple rows
      • df.iloc[start_row_num:end_row_num] not including the end_row_num
  • Row Access with .loc:
    • .loc indexer: label-based, which means that you have to specify rows and columns based on their row and column labels.
    • Note: different from iloc, end index are included, not excluded
      • df.loc[row_num]
      • df.loc[[row_nums]] select multiple rows
      • df.loc[start_row_num:end_row_num] including the end_row_num

1.2.3. Data Selection

Loc and iLoc

  • Select a portion of data using .iloc[rows, cols] (integer position-based) & loc[rows, cols] (label-based)
#iloc examples
df.iloc[:3, -1] #First 3 rows of the last columns
df.iloc[[0, 2], :3] #row 0th and row 2nd from first 3 cols

#loc examples:
df.loc[[2000, 2001, 2003], :'area_2'] #row 2000, 2001 and 2003 with col 'area_2'
df.loc[2001:2002, 'area_3']

1.2.4. Filter

Data selection with Filter Mask

  • Filter Mask: df[filter mask]
  • Filter Mask: df.loc[filter mask,[columns]] columns is OPTIONAL
  • String: df[df["col"].isin([value1, value2, value3])]
  • String: df[df["col"].str.contains(r"value1|value2")]
df[(df['age'] >= 10) & (df['height'] >= 10)]

df.loc[df['height'] >= 10, ['age', 'weight']]

#isin
data[data['Crime Code'].isin(['624', '510', '230'])

#.str.contains
df[df["Name"].str.contains("Jonkheer")]['Name']
df[df["Name"].str.contains(r'Tyler|Goodwin')] #.str.contains regex

1.2.5. Index Manipulation

  • columns and index: both index objects
  • immutable: cannot change part of it
  • replaceable: replace with completely new index range, or use set_index(col_name) to convert a column to row_index
df.index[0] = 1999 #Error as Index is immutable
df.index = range(len(df)) # len(df) gives the number of rows of DataFrame
df.columns = ['a1', 'a2', 'a3'] #replace the new col indexs

df.set_index('year', inplace = True) #remove the col 'year' and set it as the index of df

2. Updating Rows and Columns

  • Updating rows: always use the loc/iloc indexer. Otherwise, the change only applies to a copy/view
#SettingWithCopyWarning: Change only applies to a copy/view, NOT the df itself
df[df['last name'] == 'Smith']['last name'] = 'Anderson' #have to use .loc/.iloc in this case. See below solution
#Solution:
df.loc[df['last name'] == 'Smith', ['last name']] =  'Anderson' #change will be updated accordingly
  • Updating columns:
    • Assign a sequence to the column, or
    • .str class under Series class provides many utilities.
      • lower, upper, title, capitalize
      • isupper, islower, isalpha, isalnum
      • find, replace, contains
df['last name'] = list(df['last name'])[::-1] #Assign a sequence to the column

#Make use of str class under Series class
df['last name'] = df['last name'].str.lower() 
df["Name"].str.contains("Mrs. Martin")

(Back to top)

3. Apply Function

3.1. Series Apply

  • Syntax: pandas.Series.apply(func), dealing with one item in the Series but not the Series itself
  • No parentheses required when passing the function name
# .apply(provided function)
df['last name'] = df['last name'].apply(str.title)

# .apply(self-defined function)
def change_Josh(fname):
    return 'Joe' if fname == 'Josh' else fname

df['first name'] = df['first name'].apply(change_Josh)

# .apply(lambda function)
import re
p = re.compile(r"\b(\w+)\.")
df["Title"] = df["Name"].apply(lambda s: p.search(s).group(1)) #Apply regex to split the "Title" out of "Name"

3.2. DataFrame Apply

  • Syntax: pandas.DataFrame.apply(func), apply a function along an axis of the DataFrame.
  • Axis parameters: by default, axies = 0 i.e applies to columns
    • 0 or ‘index’: apply function to each column.
    • 1 or ‘columns’: apply function to each row.
#Given below DataFrame
   A  B
0  4  9
1  4  9
2  4  9
#Apply function np.sqrt to all columnns in df
df.apply(np.sqrt)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0
#Apply function np.sum to all rows in df
df.apply(np.sum, axis=1)
0    13
1    13
2    13

(Back to top)

4. Add or Remove Row Column and DataFrame

4.1. Concating a DataFrame to Another

4.1.1 Concat DataFrame

  • pd.concat([df1,df2, df3])
    • axis = 0 (By Default): append next df by row, similar to append
    • axis = 1: append next df by column
      • axis=1, join="inner" Take the union of them all (Default)
      • axis=1, join="outer" Take the intersection
#By default, axis = 0 => Similar to append by row
result = pd.concat([df1, df2, df3])

Screenshot 2021-10-15 at 13 27 03

#By default, axis=1 => join="inner" - Union
result = pd.concat([df1, df4], axis=1) 

#join="inner" = Intersect - the second picture below
result = pd.concat([df1, df4], axis=1, join="inner") 

Screenshot 2021-10-15 at 13 27 03 Screenshot 2021-10-15 at 13 27 03

4.1.2. Append DataFrame

  • A useful shortcut to concat() are the append() instance methods on Series and DataFrame.
  • These methods actually predated concat. They concatenate along axis=0, namely the index:
  • df = df.append(df2, ignore_index=True): Appending a DataFrame to Another
    • resulting DataFrame must be assigned to the original or a new df since no inplace parameter to apply the change to the original DataFrame.
    • ignore_index=True: to reset the index
result = df1.append([df2, df3, ignore_index=True]) #Append multiple dataframes to df1
result = df1.append(df4, sort=False) #Append df1 & df4, since they have different columns, so whatever missing values will be filled with NaN as shown below

Screenshot 2021-10-15 at 13 27 03

4.2. Removing Columns Or Rows

  • pandas.Series.drop(index)
    • remove items with given indices from a Series
  • pandas.DataFrame.drop(index, axis=0, inplace=True)
    • remove rows or columns from a DataFrame
    • remove columns: axis = 1 or axis = ‘columns’
  • df = df[~df[df['gender'] == 'male']]
# Series
df['gender'].drop([3, 4]) #drop index =3, and =4 of Series 'gender'

# DataFrame
df.drop([3, 4]) #drop row 3 and row 4
df.drop(df[df['gender'] == 'male'].index) #drop the row with the index of column "df[df['gender'] == 'male']"
df.drop('year', axis = 1, inplace=True) #drop the col "year"

#Drop column using ~
df = df[~df['your column'].isin(['list of strings', 'A'])]

(Back to top)

5. Sorting Data

5.1. Sort by Columns

  • pandas.DataFrame.sort_values(by=col_names, ascending=True, inplace=True) sort the DataFrame by given columns
df.sort_values(by='last name')
df.sort_values(by=['last name', 'first name']) #Sort by Two Cols
df.sort_values(by=['last name', 'first name'], ascending=[True, False])

5.2. Sort by Row Index

  • pandas.DataFrame.sort_index(inplace=True) sort DataFrame by row index
df.sort_index(inplace=True)

df["SibSp"].value_counts().sort_index()

(Back to top)

6. EDA

6.1. Categorical & Numerical Data

6.1.1. Categorical Data

  • Check the type of columns, and convert those categorical columns to type 'category'
  • .astype('category') convert the data type
df["Survived"] = df["Survived"].astype('category')

6.1.2. Numerical Data

  • List of Numerical Columns
numerical_col = [c for c in df.columns if df[c].dtype in ('float64', 'int64')]

6.2. NaN Values

6.2.1. Check NaN Values

  • NaN or Null values can be counted with the .isna() or .isnull() method.
df["Age"].isnull().sum()

df['Age'].isna().sum()

6.2.2. Drop NaN Values

  • df.dropna(axis = 1)
    • axis = 0 (Default) to drop Row contain NaN
    • axis = 1 to drop Column contain NaN

6.2.3. Fill NaN

  • df['Age'].fillna(df["Age"].mean()) fill the missing values

6.3. Unique Values

  • Need to understand how many unique values in a column
df["Ticket"].nunique() #681

6.3. Distrbution of Values

  • .value_counts(normalize=True, sort=False) can help us find out the distribution of the two values in the column.
  • .describe() to understand the overall distribution
  • .quantile(np.arange(1,11)*0.1) function helps to get the quantile information, parameter should be between 0 and 1
  • .plot.hist() to plot Histogram distribution
df["Survived"].value_counts()
df["Fare"].describe()

df["Age"].quantile(np.arange(1,11)*0.1) # np.arange(1, 11) gives an np.array containing 1 to 10.
#0.1    14.0
#0.2    19.0 most dense area in the distribution is between 19 to 31
#0.3    22.0
#0.4    25.0
#0.5    28.0
#0.6    31.8
#0.7    36.0
#0.8    41.0
#0.9    50.0
#1.0    80.0

df["Age"].plot.hist() # plot a histogram

6.4. Multi-Column Analysis

6.4.1. Groupby

  • df.groupby() creates groups within a DataFrame by a key, i.e., the dateframe is separated into groups according the values of the key.

    • __str__() doesn’t give you much information as groupby() lazy in nature. It doesn’t really do any operations to produce a useful result until you say so.
  • Various methods of .groupby():

    • Meta methods & Properties: .get_group(value in each group)
    • Aggregation methods: .value_counts(normalize=True, sort=False) and other aggregation methods we can apply on the grouped dataset as shown in below table
      • agg() method allows us to apply multiple such methods at once.
        • i.e: grouped_by_df.agg(['count', 'mean', 'median', 'std', 'max', 'min'])
    Aggregation Method Description
    count Number of non-NA values in the group
    sum Sum of non-NA values
    mean Mean of non-NA values
    median Arithmetic median of non-NA values
    std, var Sample standard deviation and variance
    min, max Minimum and Maximum of non-NA values
by_state = df.groupby("state")
# Example of __str__()
print(by_state) #__str__() doesn’t give you much information 
#<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>

# Example of .get_group()
by_state.get_group("PA")
#     last_name first_name   birthday   gender type state               party
# 4       Clymer     George 1739-03-16      M  rep    PA                  NaN
#19       Maclay    William 1737-07-20      M  sen    PA  Anti-Administration
#21       Morris     Robert 1734-01-20      M  sen    PA   Pro-Administration
  • For Categorical Columns Analysis: .value_counts()
gend = df.groupby(by='Survived')['Sex'].value_counts(sort=False)

Screenshot 2021-10-28 at 17 05 06

pclass_df = df.groupby(by='Survived')["Pclass"].value_counts(sort=False)
pclass_pct_df = df.groupby(by='Survived')["Pclass"].value_counts(sort=False, normalize=True)
pd.concat([pclass_df, pclass_pct_df], axis = 1)

Screenshot 2021-10-28 at 17 17 52

  • For Numerical Columns Analysis: agg()
fare_gp = df.groupby(by="Survived")["Fare"]
fare_gp.mean() #Calculate the mean of fare for each group in "Survived" col
#Survived
#0    22.117887
#1    48.395408

fare_gp.agg(['count', 'mean', 'median', 'std', 'max', 'min'])

Screenshot 2021-10-28 at 17 15 57

6.4.2. Cut

  • pd.cut(col, interval) bin values into discrete intervals.
    • For example, bucketize value of df['Fare'] into following intervals: [(0, 20] < (20, 100] < (100, 513]]
buckets = [0,20,100, 513]
df['Fare_buckets'] = pd.cut(df['Fare'], buckets)

PassengerId
1        (0, 20]
2      (20, 100]
3     (100, 513]

(Back to top)