about datasets
Columns in the two Datasets
We fetch crime data from the Los Angeles Open Data Platform using the requests
library in Python. Here is an example of how to retrieve and load the data into a Pandas DataFrame:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
from folium import plugins
from folium.plugins import HeatMap
plt.style.use('fivethirtyeight')
pd.set_option('display.max_rows', 500)
Dataset 1
about:
Crime Data from 2020 to Present:
Update 1/18/2024 - LAPD is facing issues with posting the Crime data, but we are taking immediate action to resolve the problem. We understand the importance of providing reliable and up-to-date information and are committed to delivering it.
As we work through the issues, we have temporarily reduced our updates from weekly to bi-weekly to ensure that we provide accurate information. Our team is actively working to identify and resolve these issues promptly.
We apologize for any inconvenience this may cause and appreciate your understanding. Rest assured, we are doing everything we can to fix the problem and get back to providing weekly updates as soon as possible.
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments. Type: accessing APIs
Method: Gather data by accessing APIs from source data.lacity.org . The data was gathered using the "API Access" method from (https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8/data_preview) source.
data=requests.get('https://data.lacity.org/resource/2nrs-mtv8.json').json()
data_crime_2020_2024 = pd.DataFrame.from_dict(data)
#store the raw data in your local data store
data_crime_2020_2024.to_csv('data_crime_2024',index=False)
#retrive the data form my local device
data_crime_2024=pd.read_csv('data_crime_2024')
Dataset 2
dataset Data Crime from 2010-2019 (data_cirme_2019):
about:
This dataset reflects incidents of crime in the City of Los Angeles from 2010 - 2019. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.
Type: CSV File
Method: The data was gathered using the Download data manually method from (https://data.lacity.org/) source.
#2nd data gathering and loading method
data_crime_2019=pd.read_csv('Crime_Data_from_2010_to_2019_20240118.csv')
Variable | Description | Type |
---|---|---|
DR_NO | Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits. | Plain Text |
Date Rptd | MM/DD/YYYY | Date & Time |
DATE OCC | MM/DD/YYYY | Date & Time |
TIME OCC | In 24 hour military time. | Plain Text |
AREA | The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21. | Plain Text |
AREA NAME | The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. | Plain Text |
Rpt Dist No | A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at this link | Plain Text |
Part 1-2 | Number | |
Crm Cd | Indicates the crime committed. (Same as Crime Code 1) | Plain Text |
Crm Cd Desc | Defines the Crime Code provided. | Plain Text |
Mocodes | Modus Operandi: Activities associated with the suspect in commission of the crime. See attached PDF for list of MO Codes in numerical order. Download MO Codes PDF | Plain Text |
Vict Age | Two character numeric | Plain Text |
Vict Sex | F - Female M - Male X - Unknown | Plain Text |
Vict Descent | Descent Code: (A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - WhiteX - Unknown Z - Asian Indian) | Plain Text |
Premis Cd | The type of structure, vehicle, or location where the crime took place. | Number |
Premis Desc | Defines the Premise Code provided. | Plain Text |
Weapon Used Cd | The type of weapon used in the crime. | Plain Text |
Weapon Desc | Defines the Weapon Used Code provided. | Plain Text |
Status | Status of the case. (IC is the default) | Plain Text |
Status Desc | Defines the Status Code provided. | Plain Text |
Crm Cd 1 | Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious. | Plain Text |
Crm Cd 2 | May contain a code for an additional crime, less serious than Crime Code 1. | Plain Text |
Crm Cd 3 | May contain a code for an additional crime, less serious than Crime Code 1. | Plain Text |
Crm Cd 4 | May contain a code for an additional crime, less serious than Crime Code 1. | Plain Text |
LOCATION | Street address of crime incident rounded to the nearest hundred block to maintain anonymity. | Plain Text |
Cross Street | Cross Street of rounded Address | Plain Text |
LAT | Latitude | Number |
LON | Longtitude | Number |
Quality Issue 1: completeness => Missing data both datasets
its completeness issue Quality issue there is a lot of missing values in both datasets : data_crime_2019 dataset : there is missing values in Crm Cd 2 ,Crm Cd 3 ,Crm Cd 4 most of the columns are missing values and vict_sex,vict_descent, weapon_used_cd ,weapon_desc more than 100 K missing values data_crime_2024 dataset : there is missing values incrm_cd_2 ,cross_street most of the columns are minssing values and weapon_used_cd , weapon_desc half of the columns are missing values.
Quality Issue 2: Uniqueness => duplicates values both datasets
its Uniqueness issue in data Quality issues there is 230021 rows duplicte depand in dr_no column in data_crime_2019 dataset.
Tidiness Issue 1: uppercase letters are used for column names in data_crime_2019 and lowercase letters are used for column names in data_crime_2024 we have to change data_crime_2019 columns name to lower case letters and change columns to snake_case so we can index it easliy.
Tidiness Issue 2:its Multiple variables are stored in one column
there is muliple variables in the same cell in one columns crm cd desc the crime description and grand and many crimes
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. Make sure you include justifications for your cleaning decisions.
After the cleaning for each issue, please use either the visually or programatical method to validate the cleaning was succesful. At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have at least 4 variables after combining the data.
Make copies of the datasets to ensure the raw dataframes
data_crime_2019_copy=data_crime_2019.copy()
data_crime_2024_copy=data_crime_2024.copy()
#drop columns thats contain 60% or above missing values and Mocodes column bc there is no data source about it
data_crime_2019_copy.drop(columns=['Weapon Used Cd','Weapon Desc','Crm Cd 2','Crm Cd 3','Crm Cd 4','Cross Street','Mocodes'],inplace=True)
#change the missing values and - to be X-unknowing
data_crime_2019_copy['Vict Sex'] = data_crime_2019_copy['Vict Sex'].replace([np.nan, '-'], 'X')
data_crime_2019_copy['Vict Descent'] = data_crime_2019_copy['Vict Descent'].replace([np.nan,'-','unknowing'], 'UN')
#exlude the result in Premis Desc thats contain missing values
data_crime_2019_copy=data_crime_2019_copy[data_crime_2019_copy['Premis Desc'].notnull()]
data_crime_2019_copy=data_crime_2019_copy[data_crime_2019_copy['Vict Age'] >= 1]
data_crime_2024_copy=data_crime_2024_copy[data_crime_2024_copy['vict_age'] >= 1]
We will change data_crime_2019 columns name from upper case into lower case by lambda func
data_crime_2019_copy.rename(columns=lambda x: x.lower(), inplace=True)
data_crime_2019_copy.columns = data_crime_2019_copy.columns.str.replace(' ', '_')
def clean_and_transform(dataframe):
# Perform the split and stack operation on 'crm_cd_desc'
crm_cd_desc_split = dataframe['crm_cd_desc'].str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
crm_cd_desc_split.name = 'crime' # Renaming the resulting Series for clarity
# Join the expanded 'crm_cd_desc' with the original DataFrame based on the index,
# which contains 'dr_no' and any other additional columns
dataframe = dataframe.drop(columns=['crm_cd_desc']).join(crm_cd_desc_split).reset_index(drop=True)
# Change the new 'crime' column from being the last one to be the tenth column from the left
columns = list(dataframe.columns)
# First, remove the 'crime' column name
columns.remove('crime')
# Then insert 'crime' at the desired position
columns.insert(9, 'crime')
# Now, reindex the DataFrame with the new columns order
dataframe = dataframe.reindex(columns=columns)
return dataframe
# Assuming data_crime_2019_copy and data_crime_2024_copy are your DataFrames
data_crime_2019_copy = clean_and_transform(data_crime_2019_copy)
data_crime_2024_copy = clean_and_transform(data_crime_2024_copy)
#Remove unnecessary variables and combine datasets
data_crime_2019_copy.drop(columns=['crm_cd_1','lat','lon'],inplace=True)
data_crime_2024_copy.drop(columns=['crm_cd_1','crm_cd_2','lat','lon','mocodes','weapon_used_cd','weapon_desc','cross_street','crm_cd_3'],inplace=True)
data_crime_2019_copy.rename(columns={'area_': 'area'}, inplace=True)
data_crime_2019_copy.rename(columns={'part_1-2': 'part_1_2'}, inplace=True)
Update your data store
rom sqlalchemy import create_engine
cleaned_data_crime=pd.concat([data_crime_2019_copy,data_crime_2024_copy])
cleaned_data_crime.head()
#optimize cleaned data
cleaned_data_crime=cleaned_data_crime.astype({'time_occ':'int8','area':'int8','rpt_dist_no':'int8','part_1_2':'int8','crm_cd':'int8','vict_age':'int8','premis_cd':'int8'})
cleaned_data_crime.info()
#row data
second_copy_2019=data_crime_2019.copy()
second_copy_2024=data_crime_2020_2024.copy()
# ### Saving Row data
second_copy_2019.rename(columns=lambda x: x.lower(), inplace=True)
second_copy_2019.columns = second_copy_2019.columns.str.replace(' ', '_')
#row data
row_data=pd.concat([second_copy_2019,second_copy_2024])
#saving_raw_data
row_data.to_csv('row_data.csv',index=False)
# Save DataFrame to SQLite database
engine = create_engine('sqlite:///datacrime.db') # Replace 'database.db' with your desired SQLite database file name
row_data.to_sql('row_data_2019_2024', engine, index=False, if_exists='replace')
# ### saving Clean Data crime
#saving cleaned_data_crime to csv file
cleaned_data_crime.to_csv('clean_data_crime.csv',index=False)
#saving cleaned_data_crime to sqlite table
cleaned_data_crime.to_sql('cleaned_data_crime_2019_2024',engine,index=False,if_exists='replace')