Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
baksho authored Jul 27, 2024
1 parent f790abf commit 0b60c52
Showing 1 changed file with 19 additions and 79 deletions.
98 changes: 19 additions & 79 deletions datasets/california_housing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,92 +3,32 @@
#### Source
This dataset was downloaded from the [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices?resource=download). The dataset was originally featured in the paper: _Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297._

### Data description
#### About Dataset

Int64Index: 3292 entries, 0 to 3291
Data columns (total 17 columns):
"LOCATION" 3292 non-null object
Country 3292 non-null object
INDICATOR 3292 non-null object
Indicator 3292 non-null object
MEASURE 3292 non-null object
Measure 3292 non-null object
INEQUALITY 3292 non-null object
Inequality 3292 non-null object
Unit Code 3292 non-null object
Unit 3292 non-null object
PowerCode Code 3292 non-null int64
PowerCode 3292 non-null object
Reference Period Code 0 non-null float64
Reference Period 0 non-null float64
Value 3292 non-null float64
Flag Codes 1120 non-null object
Flags 1120 non-null object
dtypes: float64(3), int64(1), object(13)
memory usage: 462.9+ KB
##### Context
This dataset is used in the second chapter of Aurélien Géron's book 'Hands-On Machine learning with Scikit-Learn and TensorFlow', and serves as an excellent introduction to implement machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and has an optimal size.

### Example usage using python Pandas

>>> life_sat = pd.read_csv("oecd_bli_2015.csv", thousands=',')

>>> life_sat_total = life_sat[life_sat["INEQUALITY"]=="TOT"]

>>> life_sat_total = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")

>>> life_sat_total.info()
<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, Australia to United States
Data columns (total 24 columns):
Air pollution 37 non-null float64
Assault rate 37 non-null float64
Consultation on rule-making 37 non-null float64
Dwellings without basic facilities 37 non-null float64
Educational attainment 37 non-null float64
Employees working very long hours 37 non-null float64
Employment rate 37 non-null float64
Homicide rate 37 non-null float64
Household net adjusted disposable income 37 non-null float64
Household net financial wealth 37 non-null float64
Housing expenditure 37 non-null float64
Job security 37 non-null float64
Life expectancy 37 non-null float64
Life satisfaction 37 non-null float64
Long-term unemployment rate 37 non-null float64
Personal earnings 37 non-null float64
Quality of support network 37 non-null float64
Rooms per person 37 non-null float64
Self-reported health 37 non-null float64
Student skills 37 non-null float64
Time devoted to leisure and personal care 37 non-null float64
Voter turnout 37 non-null float64
Water quality 37 non-null float64
Years in education 37 non-null float64
dtypes: float64(24)
memory usage: 7.2+ KB
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

## GDP per capita
### Source
Dataset obtained from the IMF's website at: http://goo.gl/j1MSKe
##### Content
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

### Data description
1. `longitude`: A measure of how far west a house is; a higher value is farther west
2. `latitude`: A measure of how far north a house is; a higher value is farther north
3. `housing_median_age`: Median age of a house within a block; a lower number is a newer building
4. `total_rooms`: Total number of rooms within a block
5. `total_bedrooms`: Total number of bedrooms within a block
6. `population`: Total number of people residing within a block
7. `households`: Total number of households, a group of people residing within a home unit, for a block
8. `median_income`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. `median_house_value`: Median house value for households within a block (measured in US Dollars)
10. `ocean_proximity`: Location of the house w.r.t ocean/sea

Int64Index: 190 entries, 0 to 189
Data columns (total 7 columns):
Country 190 non-null object
Subject Descriptor 189 non-null object
Units 189 non-null object
Scale 189 non-null object
Country/Series-specific Notes 188 non-null object
2015 187 non-null float64
Estimates Start After 188 non-null float64
dtypes: float64(2), object(5)
memory usage: 11.9+ KB
Here, the dependent variable is `median_house_value`. The dataset contains 20640 data about housing in California from 1990 database.

### Example usage using python Pandas

>>> gdp_per_capita = pd.read_csv(
... datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
... encoding='latin1', na_values="n/a", index_col="Country")
>>> cal_housing = pd.read_csv(datapath + "housing.csv")
...
>>> gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)


0 comments on commit 0b60c52

Please sign in to comment.