Brief introduction to pandas and folium
Let’s download Vehicle Collisions in NYC dataset from kaggle. This data was collected by the NYPD and published by NYC OpenData. After downloading database.csv file, place it in the same folder as the jupyter notebook you want to work in.
To begin with, we will need to import pandas (to work with csv file), folium (to show locations on the map) and datetime (to convert data information into MM/DD/YYYY format). Then we will create a variable df (short for dataframe) and read the database.csv file into it. Implementing df.shape will give us a general overview of how much information we have. In this particular case, we have 477,732 rows and 29 columns. That’s a lot of data! Let’s take our time looking around.
Use df.columns to see a list of the column names. The vehicle collision dataset includes the following information:
- 'UNIQUE KEY' to identify each collision;
- 'DATE' and 'TIME' of the collision;
- Information about the location: 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME';
- Information about who was injured: 'PERSONS INJURED', 'PERSONS KILLED', 'PEDESTRIANS INJURED', 'PEDESTRIANS KILLED', 'CYCLISTS INJURED', 'CYCLISTS KILLED', 'MOTORISTS INJURED', 'MOTORISTS KILLED';
- Information about vehicles involved: 'VEHICLE 1 TYPE', 'VEHICLE 2 TYPE', 'VEHICLE 3 TYPE', 'VEHICLE 4 TYPE', 'VEHICLE 5 TYPE', 'VEHICLE 1 FACTOR', 'VEHICLE 2 FACTOR', 'VEHICLE 3 FACTOR', 'VEHICLE 4 FACTOR', 'VEHICLE 5 FACTOR'.
To see a sample of the data, use df.head(10) . The number 10 specifies the number of rows shown. If not specified, only 5 rows will be seen, and the maximum number of rows is 50. Additionally, only the first 10 and the last 10 columns will be shown by default. To see all columns, run pd.set_option(‘display.max_columns’, None).
df.describe() provides statistical information by column: count of items in a column, column mean, standard deviation, min, 25/50/75 percentile and max values.
While working with pandas dataset, it’s always a good idea to check for NaN (stands for not a number) values using df.isna().sum() or df.isna().mean().
Because we want to show the accident location on the map, we need latitude and longitude information; however, that information is missing for 121,132 rows. One way to deal with that is to remove those rows using df.dropna(subset = [‘LATITUDE’], inplace = True). Now if we run df.shape, the output is 356,600 rows and 29 columns.
Let’s run df.head(10) again and see what changed. Row #4 used to have NaN value in LATITUDE column and was removed.
Next, let’s review DATE column. df[‘DATE’] will give us a brief view of the column. We can see right away that it has information for 2015–2017 years.
df[‘DATE’][0] will return ‘01/01/2015’, the first row in the column.
And type(df[‘DATE’][0]) returns string, meaning 01/01/2015 is not actually written in the date format. Let’s change that by creating a new column df[‘DATE_formatted’] = pd.to_datetime(df[‘DATE’]). Now type(df[‘DATE_formatted’][0]) returns pandas._libs.tslibs.timestamps.Timestamp.
We can also create a new column containing only YEAR information by using df[‘DATE_YEAR’] = df[‘DATE_formatted’].dt.year.
Let’s see the breakdown of rows by year using df[‘DATE_YEAR’].value_counts()
That’s too much information for our purpose, so we can change it by removing years other than 2015. Let’s create a new dataset only for 2015 information: df_2015 = df[df[‘DATE_YEAR’] == 2015]. Running df_2015.shape shows us that we have 182,845 rows and 31 columns now.
The code for information above is:
Now we can show the information for 2015 accidents in NYC on the map. Let’s start with showing one location. Googling NYC location gives us 40.730610 latitude and -73.935242 longitude that we can add to folium.Map function. Then we can add a marker to it with radius = 1. The default color for the marker is blue; you can see it in Queens.
Now, let’s show location of all accidents on the map. Take PERSONS INJURED column and plot green markers for 0 people injured and red markers for more than 0 people injured.
Well the previous map didn’t look pretty, but keep in mind that it showed all accidents that have happened in 2015 calendar year. The good think about folium library is that it’s interactive, and we can zoom in and out of it as if it was a regular map application.
As another example, we can show accidents involving cyclists on the map. Blue dots show where 1 cyclist was injured, and orange dots show where more than 1 cyclists were injured.
That was a brief over view of pandas dataset and folium map. If you would like to look further into visualizing geospacial data, geopandas and geoplot libraries can be used.