lab03_tables

html

School

Temple University *

*We aren’t endorsed by this school

Course

1013

Subject

Computer Science

Date

Dec 6, 2023

Type

html

Pages

Uploaded by samzahroun

Lab 3: Tables ¶ Welcome to lab 3! This week, we will focus on manipulating tables. We will import our data sets into tables and complete the majority of analysis using these tables. Tables are described in Chapter 6 of the Inferential Thinking text. A related approach in Python programming is to use what is known as a pandas dataframe which we will need to resort to occasionally. Pandas is a mainstay datascience tool. First, set up the tests and imports by running the cell below. In [1]: import numpy as np from datascience import * # Brings into Python the datascience Table object # These lines load the tests. from gofer.ok import check In [2]: # Enter your name as a string # Example dogname = "Fido" # Your name name = "Sam Zahroun" 1. Introduction ¶ For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection. In most data science applications, we have data about many entities, but we also have several kinds of data about each entity. For example, in the cell below we have two arrays. The first one contains the world population in each year (estimated by the US Census Bureau), and the second contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays. In [3]: population_amounts = Table.read_table("world_population.csv").column("Population") years = np.arange(1950, 2016,1) print("Population column:", population_amounts) print("Years column:", years) Population column: [2557628654 2594939877 2636772306 2682053389 2730228104 2782098943 2835299673 2891349717 2948137248 3000716593 3043001508 3083966929 3140093217 3209827882 3281201306 3350425793 3420677923 3490333715 3562313822 3637159050 3712697742 3790326948 3866568653 3942096442 4016608813 4089083233 4160185010 4232084578 4304105753 4379013942 4451362735 4534410125 4614566561 4695736743 4774569391 4856462699 4940571232 5027200492 5114557167 5201440110 5288955934 5371585922 5456136278 5538268316 5618682132 5699202985 5779440593 5857972543 5935213248 6012074922 6088571383 6165219247 6242016348 6318590956 6395699509 6473044732 6551263534 6629913759 6709049780 6788214394 6866332358 6944055583 7022349283 7101027895 7178722893 7256490011] Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964

1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015] Suppose we want to answer this question: When did world population cross 6 billion? You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a Table , a 2-dimensional type of dataset. The expression below: • creates an empty table using the expression Table() , • adds two columns to the table by calling with_columns with four arguments (column and data for each), • assigns the result to the name population , and finally • evaluates population so that we can see the table. The strings "Year" and "Population" are column labels that we have chosen. Ther names population_amounts and years were assigned above to two arrays of the same length. The function with_columns (you can find the documentation here ) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all separated by commas. Tip: Both population_amounts and years need the same number of data points or an error will be returned on attempting to construct the table. In [4]: population = Table().with_columns( "Population", population_amounts, "Year", years ) population Out[4]: Population Year 2557628654 1950 2594939877 1951 2636772306 1952 2682053389 1953 2730228104 1954 2782098943 1955 2835299673 1956 2891349717 1957

Population Year 2948137248 1958 3000716593 1959 ... (56 rows omitted) Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later. Question 1 From the example in the cell above, identify the variables or data types for each of the following: which variable contains the table? which variable contains an array? On the right of the equals sign provide the correct variable name. In [5]: table_var = population array_var = years In [6]: check('tests/q1.py') Out[6]: All tests passed! 2. Creating Tables ¶ Question 2 In the cell below, we've created 2 arrays. In these examples, we're going to be looking at the Enviornmental Protection Index which describes the state of sustainability in each country. More information can be found: Yale EPI . Using the steps above, assign top_10_epi to a table that has two columns called "Country" and "Score", which hold top_10_epi_countries and top_10_epi_scores respectively. In [7]: top_10_epi_scores = make_array(82.5, 82.3, 81.5, 81.3, 80., 79.6, 78.9, 78.7, 77.7, 77.2) top_10_epi_countries = make_array( 'Denmark', 'Luxembourg', 'Switzerland', 'United Kingdom', 'France', 'Austria', 'Finland', 'Sweden', 'Norway', 'Germany' ) top_10_epi = Table().with_columns( "Country", top_10_epi_countries, "Score", top_10_epi_scores ) # We've put this next line here so your table will get printed out when you # run this cell. top_10_epi

Your preview ends here