Lecture - 26 - 28

pptx

School

Manchester Community College *

*We aren’t endorsed by this school

Course

202

Subject

Industrial Engineering

Date

Nov 24, 2024

Type

pptx

Pages

54

Uploaded by ChiefWater13497

Report
Lecture Notebooks 26, 27 and 28
Combining data set - Concat What are we going to learn in Note 26
Consider the following function: def make_df(cols, ind): """Quickly make a DataFrame""" data = {c: [str(c) + str(i) for i in ind] for c in cols} return pd.DataFrame(data, ind) # example DataFrame make_df('ABC', range(3)) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Simple Concatenate with pd.concat pd.concat()  can be used for a simple concatenation of Series or DataFrame objects, just like np.concatenate can be used for simple concatenations of array in Numpy. Example : Concatenate Series Output : ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3]) # 3-row Series 1 A ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6]) # 3-row Series 2 B pd.concat([ser1, ser2]) # 3+3=6-row Series 3 C 4 D 5 E 6 F dtype: object
Example : Concatenate DataFrame df1 = make_df('AB', [1, 2]) # recall that make_df() is a function that we defined above df2 = make_df('AB', [3, 4]) display('df1', 'df2', 'pd.concat([df1, df2])') Output :
Example : Concatenation can take place within the DataFrame (i.e., axis = 1 or axis = ‘column’) df3 = make_df('AB', [0, 1]) df4 = make_df('CD', [0, 1]) display('df3', 'df4', "pd.concat([df3, df4], axis = 'columns')") Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Duplicate indices Example : x = make_df('AB', [0, 1]) x = make_df('AB', [0, 1]) y = make_df('AB', [2, 3]) y = make_df('AB', [2, 3]) y.index = x.index # make duplicate indices! # y.index = x.index display('x', 'y', 'pd.concat([x, y])’) display('x', 'y', 'pd.concat([x, y])’) Output : Output :
Catching the repeats as error Example : Use verify_integrity to verify if the indices in the result of pd.concat() do not overlap. try: pd.concat([x, y], verify_integrity=True) except ValueError as e: print("ValueError:", e) Output : ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
Ignoring the index Example : Sometimes the index itself does not matter, and you would prefer it to simply be ignored. This option can be specified using the  ignore_index  flag. With this set to true, the concatenation will create a new integer index for the resulting Series. display('x', 'y', 'pd.concat([x, y], ignore_index=True)') Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Adding MultiIndex keys Example : Use the keys option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data. display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])") Output :
Concatenation with joins Example : Consider the following DataFrame . df5 = make_df('ABC', [1, 2]) df6 = make_df('BCD', [3, 4]) display('df5', 'df6', 'pd.concat([df5, df6])') # by default: axis=0, i.e. row concatenation Output :
Note : By default, the entries for which no data is available are filled with NA values. To change this, we can specify one of several options for the join parameters of the concatenate function. By default, the join is a union of the input columns ( join='outer' ), but we can change this to an intersection of the columns using  join='inner’ . display('df5', 'df6', "pd.concat([df5, df6], join='inner’)”) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Combining data set – Merge and join What are we going to learn in Note 27
One-to-one joins Example : Consider the following two DataFrames: df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']}) df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012, 2014]}) display('df1', 'df2’) Output :
Example : To combine this information into a single DataFrame, we can use the  pd.merge()  function: df3 = pd.merge(df1, df2) df4 = pd.concat([df1, df2]) print(df3) print(df4) Output : Output : Note : The pd.merge() function recognizes that each DataFrame has an "employee" column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : x1 = make_df("abc",range(3)) x2 = make_df("bcd",range(3)) display('x1', 'x2', 'pd.merge(x1, x2)’) Output :
Example (Cont) : Keep in mind that the merge in general discards the index , except in the special case of merges by index. Indexes are changed! x1 = make_df("abc",range(3)) x2 = make_df("bcd",range(1,4)) display('x1', 'x2', 'pd.merge(x1, x2)’) Output :
Many-to-one joins Example : Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate. df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']}) display('df3', 'df4', 'pd.merge(df3, df4)’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Many-to-many joins Example : If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']}) display('df1', 'df5', "pd.merge(df1, df5)") Output :
The on keyword Example : One can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names . display('df1', 'df2', "pd.merge(df1, df2, on='employee')") Output :
The left_on and right_on keywords Example : At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as "name" rather than "employee". In this case, we can use the  left_on  and  right_on  keywords to specify the two column names. df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'salary': [70000, 80000, 120000, 90000]}) display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : Note that the result has a redundant column that we can drop if desired– for example, by using the  drop()  method of DataFrames. pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The left_index and right_index keywords Example : You can merge on an index. You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge(): df1a = df1.set_index('employee’) # Note: employee is now the index df2a = df2.set_index('employee’) # Note: employee is now the index display('df1a', 'df2a’) display('df1a', 'df2a’, "pd.merge(df1a, df2a, left_index=True, right_index=True)") Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : Alternatively,  DataFrames implement the  join()  method, which performs a merge that defaults to joining on indices: display('df1a', 'df2a', 'df1a.join(df2a)’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : If you'd like to mix indices and columns, you can combine left_index with right_on  or  left_on  with  right_index  to get the desired behavior: display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')") Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Specifying Set Arithmetic for Joins Example : When we merge two datasets that have only a single "name" entry in common, by default, the result contains the  intersection  of the two sets of inputs; this is what is known as an  inner join . We can specify this explicitly using the how keyword, which defaults to " inner ". df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']}, columns=['name', 'food']) df7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']}, columns=['name', 'drink’]) pd.merge(df6, df7, how='inner’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : An  outer join  returns a join over the union of the input columns, and fills in all missing values with NAs. display('df6', 'df7', "pd.merge(df6, df7, how='outer')") Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : The   left join   and  right join   return joins over the left entries and right entries, respectively. display('df6', 'df7', "pd.merge(df6, df7, how='left’)”) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Overlapping Column Names: The suffixes Keyword Example : Consider the following example df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [1, 2, 3, 4]}) df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [3, 1, 4, 2]}) display('df8', 'df9', 'pd.merge(df8, df9, on="name")’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Note : If the outputs have two conflicting column names, the merge function automatically appends a suffix _x or _y to make the output columns unique. If these defaults are inappropriate, it is possible to specify a custom suffix using the  suffixes  keyword: display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_Sem1", "_Sem_2"])’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example: US States Data (Please read this part after class)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Aggregation and grouping What are we going to learn in Note 28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Simple Aggregation in Pandas Example : Aggregation for Series rng = np.random.RandomState(42) ser.sum() ser = pd.Series(rng.rand(5)) ser Output: 2.811925491708157 Output: 0 0.374540 ser.mean() 1 0.950714 2 0.731994 Output: 3 0.598658 0.5623850983416314 4 0.156019 dtype: float64
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example : For a  DataFrame , by default the aggregates return results within each column. df = pd.DataFrame({'A': rng.rand(5), df.mean() # By specify the axis argument, 'B': rng.rand(5)}) # you can instead aggregate # within each row Output: df.mean(axis='columns’) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Planets Data Example : We are going to use the Planets dataset which is available from the Seaborn package . import seaborn as sns planets = sns.load_dataset('planets') planets.shape Output : (1035, 6) planets.head(10) Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Planets Data Example (Cont) : Method  describe()  that computes several common aggregates for each column and returns the result. planets.dropna().describe() # dropping rows with missing values: Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Planets Data The following table summarizes some other built-in Pandas aggregations: Aggregation Description count() Total number of items first(), last() First and last item mean(), median() Mean and median min(), max() Minimum and maximum std(), var() Standard deviation and variance mad() Mean absolute deviation prod() Product of all items sum() Sum of all items
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GroupBy : Split , Apply , Combine
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The  groupby  accomplishes: The  split  step involves breaking up and grouping a  DataFrame  depending on the value of the specified key. The  apply  step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups. The  combine  step merges the results of these operations into an output array.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example : df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6)}, columns=['key', 'data’]) df df.groupby('key').sum() Output : Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Planets Data) : We can select a particular Series group from the original DataFrame group by reference to its column name. planets.groupby('method')['orbital_period'].median() Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Dispatch methods Example : One can use the describe() method of DataFrame s to perform a set of aggregations that describe each group in the data. planets.groupby('method')['year'].describe() Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Aggregate, filter, transform, apply Example : We consider the following DataFrame . rng = np.random.RandomState(0) df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(1,7), 'data2': rng.choice(range(1,11), 6, replace=False)}, columns = ['key', 'data1', 'data2']) df Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : aggregate()  method allows for more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once. df.groupby('key').aggregate(['min', np.median, max]) Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column. df.groupby('key').aggregate({'data1': 'min', 'data2': 'max’}) Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : filter() operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value. def filter_func(x): return x['data2'].std() > 4 display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)") Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. display("df", "df.groupby('key').mean()", "df.groupby('key').transform(lambda x: x - x.mean())") Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : The  apply()  method lets you apply an arbitrary function to the group results. The function should take a  DataFrame , and return either a Pandas object (e.g.,  DataFrame Series ) or a scalar; the combine operation will be tailored to the type of output returned. pd.DataFrame(df.groupby('key').data2.sum()) Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : rng = np.random.RandomState(0) def norm_by_data2(x): # x is a DataFrame of group values x['data1'] /= x['data2'].sum() return x df_group_sums = pd.DataFrame(df.groupby('key')['data2'].sum()) # here pd.DataFrame() is needed to turn Series object into a DataFrame display("df", "df_group_sums", "df.groupby('key').apply(norm_by_data2)")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example (Cont) : Output :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Specifying the split key Example : The key can be any series or list with a length matching that of the  DataFrame . . L = [0, 1, 2, 0, 1, 2] # A=0, B=1, C=1 # Can be accomplished by the df.groupby('key') display('df', 'df.groupby(L).sum()’) or display('df', "df.groupby(df['key']).sum()") Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A dictionary or series mapping index to group Example : Another method is to provide a dictionary that maps index values to the group keys. df2 = df.set_index('key') mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'} display('df2', 'df2.groupby(mapping).sum()’) Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Any Python function Example : One can pass any Python function that will input the index value and output the group. display('df2', 'df2.groupby(str.lower).mean()’) Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A list of valid keys Example : Any of the preceding key choices can be combined to group on a multi-index. df2.groupby([str.lower, mapping]).mean() Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help