How to Open and Read Txt File Pandas University_towns.txt
An of import function of the data science procedure is asking a question, gathering and analyzing the information needed to reply that question and exam our hypothesis on the topic against that information.
One such question is, Are academy towns housing prices less affected by recession? In this data analysis project, nosotros will formulate a hypothesis around this question, gather and process required information and finally examination our hypothesis.
Strategy
We will utilize 3 data sources for this project.
-
We will use GDP data from Bureau of Economical Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (using the chained value in 2009 dollars), in quarterly intervals, in the file gdplev.xls. We will apply this data to determine recession periods. For this project, nosotros volition only look at GDP data from the commencement quarter of 2000 onward.
-
We volition use housing price data from the Zillow research data site. There is housing data for the Us. In particular the data file for all homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
-
Nosotros will employ a list of academy towns collected from Wikipedia to divide the housing price data in two sets equally academy towns and non-university towns. We have this information in the university_towns.txt file.
So, nosotros volition use the above mentioned data sources and manipulate and transform them to test our hypothesis.
All the data and the notebook tin can be establish in my github profile
Hypothesis
Academy towns accept their hateful housing prices less affected by recessions.
Required Definitions
- A quarter is a specific iii month catamenia, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through Dec.
- A recession is defined equally starting with ii sequent quarters of Gross domestic product decline, and ending with two consecutive quarters of Gdp growth.
- A recession bottom is the quarter within a recession which had the lowest Gdp.
- A university town is a city which has a high pct of university students compared to the full population of the urban center.
First, let's import necessary libraries.
Nosotros need to make a listing of university towns out of the text file. Allow's sort that out.
Permit's take a await at the output of this function.
Series | State | RegionName |
---|---|---|
0 | Alabama | Auburn |
1 | Alabama | Florence |
2 | Alabama | Jacksonville |
3 | Alabama | Livingston |
four | Alabama | Montevallo |
So far so good.
Now, let's get the gdp data sorted out. We volition only use information from 2000 onwards.
Here, we are reading the excel file into a dataframe, skipping some initial rows to only include data from 2000 onwards and only keeping 2 columns, one that includes the quarter data and the other contains the gdp in chained billion dollars.
Let'south look at the gross domestic product data visually first.
We can run into there's small dip in the middle. Let's find out the recession period and look at it more than closely.
The higher up office finds the start of a recession based on the definition above. Permit's run this and discover the starting betoken for recession.
So, within our window of analysis, a recession had started in the 3rd quarter of 2008 as is well known
This office finds the end of a recession according to the definition.
Let'due south run this and find out the end of the recession catamenia
So, the recession concluded in the final quarter of 2009. Information technology lasted over a year!
Now let's effigy out the bottom of the recession
Now, that we take figured out the recession menstruum, let's visualize the flow.
So, the 2nd quarter of 2009 had the lowest Gross domestic product. Nosotros can see two consecutive decrease starting from 2008q3 and 2 consecutive growth starting from 2009q3. So, 2009q4 marks the end of this recession flow.
And so, we accept determined the of import data points related to recession. Permit'southward wait into the housing cost data now.
And so, after the first wait at the data it seems nosotros need the following cleaning:
- The states are in short names, then we need to transform this.
- Information technology contains month by calendar month prices from 1996, we only need data from 2001 so drop the unnecessary columns.
- Utilise the land and region proper name for indexing.
- Finally, convert month past month data to quarterly information.
Then, let's get to work!
Let's take a look at the shape of the transformed zillow data. It should accept the same number of rows as the original file that is 10730 and the number of columns should be 67 (4 quarters in 17 years minus the terminal quarter in 2016)
So, the data is prepared as desired. Now, we will move into the final phase of the analysis.
- We only demand to look at the recession flow, so we volition narrow the columns to recession menstruum.
- Nosotros need a cost ratio column to compare the prices betwixt the start and bottom of the recession.
- We need to split the information into university towns and non-university towns to run t-test on our hypothesis.
At present, permit'due south run the ttest to validate our hypothesis!
The p-value is well below the 0.05 threshold and then we can reject the cipher hypothesis and claim that housing prices in university towns are less affected by recession.
So, we have gone from collecting the data to cleaning and visualizing it and finally testing out hypothesis confronting it.
Source: https://modasserbillah.ml/2018/07/01/hypothesis-testing/
0 Response to "How to Open and Read Txt File Pandas University_towns.txt"
Post a Comment