Skip to content

masiod/data_crime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

Crime_data

about datasets

Columns in the two Datasets

1-Data Gathering

We fetch crime data from the Los Angeles Open Data Platform using the requests library in Python. Here is an example of how to retrieve and load the data into a Pandas DataFrame:

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
from folium import plugins
from folium.plugins import HeatMap

plt.style.use('fivethirtyeight')
pd.set_option('display.max_rows', 500)

Dataset 1

about:

Crime Data from 2020 to Present:

Update 1/18/2024 - LAPD is facing issues with posting the Crime data, but we are taking immediate action to resolve the problem. We understand the importance of providing reliable and up-to-date information and are committed to delivering it.

As we work through the issues, we have temporarily reduced our updates from weekly to bi-weekly to ensure that we provide accurate information. Our team is actively working to identify and resolve these issues promptly.

We apologize for any inconvenience this may cause and appreciate your understanding. Rest assured, we are doing everything we can to fix the problem and get back to providing weekly updates as soon as possible.

This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments. Type: accessing APIs

Method: Gather data by accessing APIs from source data.lacity.org . The data was gathered using the "API Access" method from (https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8/data_preview) source.

data=requests.get('https://data.lacity.org/resource/2nrs-mtv8.json').json()

data_crime_2020_2024 = pd.DataFrame.from_dict(data)
#store the raw data in your local data store
data_crime_2020_2024.to_csv('data_crime_2024',index=False)
#retrive the data form my local device 
data_crime_2024=pd.read_csv('data_crime_2024')

Dataset 2

dataset Data Crime from 2010-2019 (data_cirme_2019):

about:

This dataset reflects incidents of crime in the City of Los Angeles from 2010 - 2019. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.

Type: CSV File

Method: The data was gathered using the Download data manually method from (https://data.lacity.org/) source.

#2nd data gathering and loading method 
data_crime_2019=pd.read_csv('Crime_Data_from_2010_to_2019_20240118.csv')

Variables in both datasets:

Variable Description Type
DR_NO Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits. Plain Text
Date Rptd MM/DD/YYYY Date & Time
DATE OCC MM/DD/YYYY Date & Time
TIME OCC In 24 hour military time. Plain Text
AREA The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21. Plain Text
AREA NAME The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. Plain Text
Rpt Dist No A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at this link Plain Text
Part 1-2 Number
Crm Cd Indicates the crime committed. (Same as Crime Code 1) Plain Text
Crm Cd Desc Defines the Crime Code provided. Plain Text
Mocodes Modus Operandi: Activities associated with the suspect in commission of the crime. See attached PDF for list of MO Codes in numerical order. Download MO Codes PDF Plain Text
Vict Age Two character numeric Plain Text
Vict Sex F - Female M - Male X - Unknown Plain Text
Vict Descent Descent Code: (A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - WhiteX - Unknown Z - Asian Indian) Plain Text
Premis Cd The type of structure, vehicle, or location where the crime took place. Number
Premis Desc Defines the Premise Code provided. Plain Text
Weapon Used Cd The type of weapon used in the crime. Plain Text
Weapon Desc Defines the Weapon Used Code provided. Plain Text
Status Status of the case. (IC is the default) Plain Text
Status Desc Defines the Status Code provided. Plain Text
Crm Cd 1 Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious. Plain Text
Crm Cd 2 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
Crm Cd 3 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
Crm Cd 4 May contain a code for an additional crime, less serious than Crime Code 1. Plain Text
LOCATION Street address of crime incident rounded to the nearest hundred block to maintain anonymity. Plain Text
Cross Street Cross Street of rounded Address Plain Text
LAT Latitude Number
LON Longtitude Number

2-Assess data

Quality Issue 1: completeness => Missing data both datasets

its completeness issue Quality issue there is a lot of missing values in both datasets : data_crime_2019 dataset : there is missing values in Crm Cd 2 ,Crm Cd 3 ,Crm Cd 4 most of the columns are missing values and vict_sex,vict_descent, weapon_used_cd ,weapon_desc more than 100 K missing values data_crime_2024 dataset : there is missing values incrm_cd_2 ,cross_street most of the columns are minssing values and weapon_used_cd , weapon_desc half of the columns are missing values.

Quality Issue 2: Uniqueness => duplicates values both datasets

its Uniqueness issue in data Quality issues there is 230021 rows duplicte depand in dr_no column in data_crime_2019 dataset.

Tidiness Issue 1: uppercase letters are used for column names in data_crime_2019 and lowercase letters are used for column names in data_crime_2024 we have to change data_crime_2019 columns name to lower case letters and change columns to snake_case so we can index it easliy.

Tidiness Issue 2:its Multiple variables are stored in one column

there is muliple variables in the same cell in one columns crm cd desc the crime description and grand and many crimes

3-Cleaning data

Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. Make sure you include justifications for your cleaning decisions.

After the cleaning for each issue, please use either the visually or programatical method to validate the cleaning was succesful. At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have at least 4 variables after combining the data.

Make copies of the datasets to ensure the raw dataframes

data_crime_2019_copy=data_crime_2019.copy()
data_crime_2024_copy=data_crime_2024.copy()

#drop columns thats contain 60% or above missing values and Mocodes column bc there is no data source about it

data_crime_2019_copy.drop(columns=['Weapon Used Cd','Weapon Desc','Crm Cd 2','Crm Cd 3','Crm Cd 4','Cross Street','Mocodes'],inplace=True)

#change the missing values and - to be X-unknowing

data_crime_2019_copy['Vict Sex'] = data_crime_2019_copy['Vict Sex'].replace([np.nan, '-'], 'X')
data_crime_2019_copy['Vict Descent'] = data_crime_2019_copy['Vict Descent'].replace([np.nan,'-','unknowing'], 'UN')

#exlude the result in Premis Desc thats contain missing values

data_crime_2019_copy=data_crime_2019_copy[data_crime_2019_copy['Premis Desc'].notnull()]

### Quality Issue 2: Accuracy issue

Apply the cleaning strategy

data_crime_2019_copy=data_crime_2019_copy[data_crime_2019_copy['Vict Age'] >= 1]
data_crime_2024_copy=data_crime_2024_copy[data_crime_2024_copy['vict_age'] >= 1]

Apply the cleaning strategy

We will change data_crime_2019 columns name from upper case into lower case by lambda func

data_crime_2019_copy.rename(columns=lambda x: x.lower(), inplace=True)
data_crime_2019_copy.columns = data_crime_2019_copy.columns.str.replace(' ', '_')

Tidiness Issue 2: FILL IN

def clean_and_transform(dataframe):
    # Perform the split and stack operation on 'crm_cd_desc'
    crm_cd_desc_split = dataframe['crm_cd_desc'].str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
    crm_cd_desc_split.name = 'crime'  # Renaming the resulting Series for clarity

    # Join the expanded 'crm_cd_desc' with the original DataFrame based on the index,
    # which contains 'dr_no' and any other additional columns
    dataframe = dataframe.drop(columns=['crm_cd_desc']).join(crm_cd_desc_split).reset_index(drop=True)

    # Change the new 'crime' column from being the last one to be the tenth column from the left
    columns = list(dataframe.columns)
    # First, remove the 'crime' column name
    columns.remove('crime')
    # Then insert 'crime' at the desired position
    columns.insert(9, 'crime')
    # Now, reindex the DataFrame with the new columns order
    dataframe = dataframe.reindex(columns=columns)
    return dataframe

# Assuming data_crime_2019_copy and data_crime_2024_copy are your DataFrames
data_crime_2019_copy = clean_and_transform(data_crime_2019_copy)
data_crime_2024_copy = clean_and_transform(data_crime_2024_copy)

#Remove unnecessary variables and combine datasets
data_crime_2019_copy.drop(columns=['crm_cd_1','lat','lon'],inplace=True)
data_crime_2024_copy.drop(columns=['crm_cd_1','crm_cd_2','lat','lon','mocodes','weapon_used_cd','weapon_desc','cross_street','crm_cd_3'],inplace=True)
data_crime_2019_copy.rename(columns={'area_': 'area'}, inplace=True)
data_crime_2019_copy.rename(columns={'part_1-2': 'part_1_2'}, inplace=True)

Update your data store

rom sqlalchemy import create_engine





cleaned_data_crime=pd.concat([data_crime_2019_copy,data_crime_2024_copy])
cleaned_data_crime.head()





#optimize cleaned data

cleaned_data_crime=cleaned_data_crime.astype({'time_occ':'int8','area':'int8','rpt_dist_no':'int8','part_1_2':'int8','crm_cd':'int8','vict_age':'int8','premis_cd':'int8'})
cleaned_data_crime.info()





#row data 
second_copy_2019=data_crime_2019.copy()
second_copy_2024=data_crime_2020_2024.copy()


# ### Saving Row data




second_copy_2019.rename(columns=lambda x: x.lower(), inplace=True)
second_copy_2019.columns = second_copy_2019.columns.str.replace(' ', '_')
#row data 
row_data=pd.concat([second_copy_2019,second_copy_2024])
#saving_raw_data
row_data.to_csv('row_data.csv',index=False)
# Save DataFrame to SQLite database
engine = create_engine('sqlite:///datacrime.db')  # Replace 'database.db' with your desired SQLite database file name
row_data.to_sql('row_data_2019_2024', engine, index=False, if_exists='replace') 


# ### saving Clean Data crime




#saving cleaned_data_crime to csv file
cleaned_data_crime.to_csv('clean_data_crime.csv',index=False)
#saving cleaned_data_crime to sqlite table
cleaned_data_crime.to_sql('cleaned_data_crime_2019_2024',engine,index=False,if_exists='replace')


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published