A corpus is a technical term for a collection of texts used to analyze a language and verify its linguistic properties. The first modern, computer- readable corpus was the Brown Corpus of Standard American English, compiled by Henry Kucera and W. Nelson Francis of Brown University. The Brown Corpus draws from American English texts printed in 1961 and was for many years a widely cited resource in computational linguistics. The five most frequently occurring words in the Brown Corpus are the, of, and, to, and a. Consider a data set consisting of all occurrences of these words in the Corpus. The values of the variable named Word are a, to, and, of, and the, so Word is a nominal variable with five categories. Frequency and relative frequency distributions are constructed to summarize the data. They are shown in the table that follows, but the table is incomplete. Use the dropdown menus to complete the table. Table 1 Word Frequency Relative Frequency (Thousands of occurrences) a 23.1 0.1252 to 26.1 and 0.1566 of 36.4 0.1973 the 70.0 0.3794 Total 184.5 The Brown Corpus contains about 1 million words. The frequency of the word a in the entire corpus is about occurrences. The relative frequency of the word a in the entire corpus is about A census is an enumeration of a population. The U.S. Census Bureau conducts a census every 10 years, but in addition, the Population Estimates Program of the bureau publishes population estimates for incorporated places every year. According to 2007 estimates, the five largest U.S. cities (by population) are New York City, Los Angeles, Chicago, Houston, and Phoenix.

MATLAB: An Introduction with Applications
6th Edition
ISBN:9781119256830
Author:Amos Gilat
Publisher:Amos Gilat
Chapter1: Starting With Matlab
Section: Chapter Questions
Problem 1P
icon
Related questions
Topic Video
Question
A *corpus* is a technical term for a collection of texts used to analyze a language and verify its linguistic properties. The first modern, computer-readable corpus was the *Brown Corpus of Standard American English*, compiled by Henry Kučera and W. Nelson Francis of Brown University. The Brown Corpus draws from American English texts printed in 1961 and was for many years a widely cited resource in computational linguistics.

The five most frequently occurring words in the Brown Corpus are *the, of, and, to,* and *a*. Consider a data set consisting of all occurrences of these words in the Corpus. The values of the variable named Word are *a, to, and, of,* and *the,* so Word is a nominal variable with five categories.

Frequency and relative frequency distributions are constructed to summarize the data. They are shown in the table that follows, but the table is incomplete. Use the dropdown menus to complete the table.

---

**Table 1**

| Word | Frequency (Thousands of occurrences) | Relative Frequency |
|------|--------------------------------------|--------------------|
| a    | 23.1                                 | 0.1252             |
| to   | 26.1                                 | ▼                  |
| and  | ▼                                    | 0.1566             |
| of   | 36.4                                 | 0.1973             |
| the  | 70.0                                 | 0.3794             |
| **Total** | 184.5                           | ▼                  |

---

The Brown Corpus contains about 1 million words. The frequency of the word *a* in the entire corpus is about ______ occurrences. The relative frequency of the word *a* in the entire corpus is about ______.

A *census* is an enumeration of a population. The U.S. Census Bureau conducts a census every 10 years, but in addition, the Population Estimates Program of the bureau publishes estimates for incorporated places every year. According to 2007 estimates, the five largest U.S. cities (by population) are New York City, Los Angeles, Chicago, Houston, and Phoenix.
Transcribed Image Text:A *corpus* is a technical term for a collection of texts used to analyze a language and verify its linguistic properties. The first modern, computer-readable corpus was the *Brown Corpus of Standard American English*, compiled by Henry Kučera and W. Nelson Francis of Brown University. The Brown Corpus draws from American English texts printed in 1961 and was for many years a widely cited resource in computational linguistics. The five most frequently occurring words in the Brown Corpus are *the, of, and, to,* and *a*. Consider a data set consisting of all occurrences of these words in the Corpus. The values of the variable named Word are *a, to, and, of,* and *the,* so Word is a nominal variable with five categories. Frequency and relative frequency distributions are constructed to summarize the data. They are shown in the table that follows, but the table is incomplete. Use the dropdown menus to complete the table. --- **Table 1** | Word | Frequency (Thousands of occurrences) | Relative Frequency | |------|--------------------------------------|--------------------| | a | 23.1 | 0.1252 | | to | 26.1 | ▼ | | and | ▼ | 0.1566 | | of | 36.4 | 0.1973 | | the | 70.0 | 0.3794 | | **Total** | 184.5 | ▼ | --- The Brown Corpus contains about 1 million words. The frequency of the word *a* in the entire corpus is about ______ occurrences. The relative frequency of the word *a* in the entire corpus is about ______. A *census* is an enumeration of a population. The U.S. Census Bureau conducts a census every 10 years, but in addition, the Population Estimates Program of the bureau publishes estimates for incorporated places every year. According to 2007 estimates, the five largest U.S. cities (by population) are New York City, Los Angeles, Chicago, Houston, and Phoenix.
### Table 1: Frequency and Relative Frequency of Populations in U.S. Cities

| City          | Frequency (Millions of people) | Relative Frequency |
|---------------|-------------------------------|-------------------|
| Phoenix       | 1.55                          | 0.0829            |
| Chicago       | 2.84                          |                   |
| Houston       |                               | 0.1182            |
| Los Angeles   | 3.83                          | 0.2048            |
| New York City | 8.27                          | 0.4422            |
| **Total**     | 18.70                         |                   |

#### Explanation:

- **Frequency**: This column lists the population in millions for each city.
- **Relative Frequency**: This column shows each city’s population as a fraction of the total population listed.

#### Contextual Information:

- **U.S. Population**: Approximately 300 million.
- **Zipf’s Law**: States that the frequency of the k-th most frequent word (or population in this case) is proportional to 1/k.

#### Examples from Zipf’s Law:
- The frequency of the third most frequent word in the Brown Corpus is a fraction of the most frequent word, much like the population of the third-largest city in the U.S.
- Similar patterns apply to the fifth most frequent word and the fifth largest city's population.

This table and accompanying information illustrate how Zipf's Law can apply not only to language but also to city populations, providing insight into how certain distributions are naturally occurring in various domains.
Transcribed Image Text:### Table 1: Frequency and Relative Frequency of Populations in U.S. Cities | City | Frequency (Millions of people) | Relative Frequency | |---------------|-------------------------------|-------------------| | Phoenix | 1.55 | 0.0829 | | Chicago | 2.84 | | | Houston | | 0.1182 | | Los Angeles | 3.83 | 0.2048 | | New York City | 8.27 | 0.4422 | | **Total** | 18.70 | | #### Explanation: - **Frequency**: This column lists the population in millions for each city. - **Relative Frequency**: This column shows each city’s population as a fraction of the total population listed. #### Contextual Information: - **U.S. Population**: Approximately 300 million. - **Zipf’s Law**: States that the frequency of the k-th most frequent word (or population in this case) is proportional to 1/k. #### Examples from Zipf’s Law: - The frequency of the third most frequent word in the Brown Corpus is a fraction of the most frequent word, much like the population of the third-largest city in the U.S. - Similar patterns apply to the fifth most frequent word and the fifth largest city's population. This table and accompanying information illustrate how Zipf's Law can apply not only to language but also to city populations, providing insight into how certain distributions are naturally occurring in various domains.
Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 2 steps

Blurred answer
Knowledge Booster
Research Design Formulation
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, statistics and related others by exploring similar questions and additional content below.
Similar questions
Recommended textbooks for you
MATLAB: An Introduction with Applications
MATLAB: An Introduction with Applications
Statistics
ISBN:
9781119256830
Author:
Amos Gilat
Publisher:
John Wiley & Sons Inc
Probability and Statistics for Engineering and th…
Probability and Statistics for Engineering and th…
Statistics
ISBN:
9781305251809
Author:
Jay L. Devore
Publisher:
Cengage Learning
Statistics for The Behavioral Sciences (MindTap C…
Statistics for The Behavioral Sciences (MindTap C…
Statistics
ISBN:
9781305504912
Author:
Frederick J Gravetter, Larry B. Wallnau
Publisher:
Cengage Learning
Elementary Statistics: Picturing the World (7th E…
Elementary Statistics: Picturing the World (7th E…
Statistics
ISBN:
9780134683416
Author:
Ron Larson, Betsy Farber
Publisher:
PEARSON
The Basic Practice of Statistics
The Basic Practice of Statistics
Statistics
ISBN:
9781319042578
Author:
David S. Moore, William I. Notz, Michael A. Fligner
Publisher:
W. H. Freeman
Introduction to the Practice of Statistics
Introduction to the Practice of Statistics
Statistics
ISBN:
9781319013387
Author:
David S. Moore, George P. McCabe, Bruce A. Craig
Publisher:
W. H. Freeman