Update README.md

baksho · Jul 27, 2024 · 0b60c52 · 0b60c52
1 parent f790abf
commit 0b60c52
Showing 1 changed file with 19 additions and 79 deletions.
diff --git a/datasets/california_housing/README.md b/datasets/california_housing/README.md
@@ -3,92 +3,32 @@
 #### Source
 This dataset was downloaded from the [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices?resource=download). The dataset was originally featured in the paper: _Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297._
 
-### Data description
+#### About Dataset
 
-    Int64Index: 3292 entries, 0 to 3291
-    Data columns (total 17 columns):
-    "LOCATION"              3292 non-null object
-    Country                  3292 non-null object
-    INDICATOR                3292 non-null object
-    Indicator                3292 non-null object
-    MEASURE                  3292 non-null object
-    Measure                  3292 non-null object
-    INEQUALITY               3292 non-null object
-    Inequality               3292 non-null object
-    Unit Code                3292 non-null object
-    Unit                     3292 non-null object
-    PowerCode Code           3292 non-null int64
-    PowerCode                3292 non-null object
-    Reference Period Code    0 non-null float64
-    Reference Period         0 non-null float64
-    Value                    3292 non-null float64
-    Flag Codes               1120 non-null object
-    Flags                    1120 non-null object
-    dtypes: float64(3), int64(1), object(13)
-    memory usage: 462.9+ KB
+##### Context
+This dataset is used in the second chapter of Aurélien Géron's book 'Hands-On Machine learning with Scikit-Learn and TensorFlow', and serves as an excellent introduction to implement machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and has an optimal size.
 
-### Example usage using python Pandas
-
-    >>> life_sat = pd.read_csv("oecd_bli_2015.csv", thousands=',')
-
-    >>> life_sat_total = life_sat[life_sat["INEQUALITY"]=="TOT"]
-
-    >>> life_sat_total = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")
-
-    >>> life_sat_total.info()
-    <class 'pandas.core.frame.DataFrame'>
-    Index: 37 entries, Australia to United States
-    Data columns (total 24 columns):
-    Air pollution                                37 non-null float64
-    Assault rate                                 37 non-null float64
-    Consultation on rule-making                  37 non-null float64
-    Dwellings without basic facilities           37 non-null float64
-    Educational attainment                       37 non-null float64
-    Employees working very long hours            37 non-null float64
-    Employment rate                              37 non-null float64
-    Homicide rate                                37 non-null float64
-    Household net adjusted disposable income     37 non-null float64
-    Household net financial wealth               37 non-null float64
-    Housing expenditure                          37 non-null float64
-    Job security                                 37 non-null float64
-    Life expectancy                              37 non-null float64
-    Life satisfaction                            37 non-null float64
-    Long-term unemployment rate                  37 non-null float64
-    Personal earnings                            37 non-null float64
-    Quality of support network                   37 non-null float64
-    Rooms per person                             37 non-null float64
-    Self-reported health                         37 non-null float64
-    Student skills                               37 non-null float64
-    Time devoted to leisure and personal care    37 non-null float64
-    Voter turnout                                37 non-null float64
-    Water quality                                37 non-null float64
-    Years in education                           37 non-null float64
-    dtypes: float64(24)
-    memory usage: 7.2+ KB
+The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
 
-## GDP per capita
-### Source
-Dataset obtained from the IMF's website at: http://goo.gl/j1MSKe
+##### Content
+The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:
 
-### Data description
+1. `longitude`: A measure of how far west a house is; a higher value is farther west
+2. `latitude`: A measure of how far north a house is; a higher value is farther north
+3. `housing_median_age`: Median age of a house within a block; a lower number is a newer building
+4. `total_rooms`: Total number of rooms within a block
+5. `total_bedrooms`: Total number of bedrooms within a block
+6. `population`: Total number of people residing within a block
+7. `households`: Total number of households, a group of people residing within a home unit, for a block
+8. `median_income`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
+9. `median_house_value`: Median house value for households within a block (measured in US Dollars)
+10. `ocean_proximity`: Location of the house w.r.t ocean/sea
 
-    Int64Index: 190 entries, 0 to 189
-    Data columns (total 7 columns):
-    Country                          190 non-null object
-    Subject Descriptor               189 non-null object
-    Units                            189 non-null object
-    Scale                            189 non-null object
-    Country/Series-specific Notes    188 non-null object
-    2015                             187 non-null float64
-    Estimates Start After            188 non-null float64
-    dtypes: float64(2), object(5)
-    memory usage: 11.9+ KB
+Here, the dependent variable is `median_house_value`. The dataset contains 20640 data about housing in California from 1990 database.
 
 ### Example usage using python Pandas
 
-    >>> gdp_per_capita = pd.read_csv(
-    ...     datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
-    ...     encoding='latin1', na_values="n/a", index_col="Country")
+    >>> cal_housing = pd.read_csv(datapath + "housing.csv")
     ...
-    >>> gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
+