OpenStreetMap Data Cleaning

Map Area: Denver, CO

I have always considered moving to Denver, so I decided to use the area for this project. I am interested in the information I will get from the data.

Problems Encountered in the Map

Outputting the values for various data types revealed many problems that the data had. Some of the problems was formatting, and some was for incorrect entries. I decided to focus on a couple of these problems.

Abbreviated street types (“Apache Dr”)
Incorrect city names (“Centenn”)
Cities with the state included (“Aurora, CO”)
Cities in all caps or all lowercase
Invalid phone number with less than 10 numbers
Different phone number formats
Two numbers in the same entry (“303-759-0316;720-666-3971”)

Cleaning Street Names

Running a function in Python to return a set of all the street names revealed that quite a few street types have been abbreviated. Using regular expressions to find the street names made it easy to find the ending street type of each name. Then all of the abbreviated names were substituted with the full street type name from a list of mapping values

def update_name(name, mapping):

    m = street_type_re.search(name)
    other_st = []
    if m:
        street_type = m.group()
        if street_type in mapping.keys():
            name = re.sub(street_type_re, mapping[street_type], name)
        else:
            other_st.append(street_type)
    return name

Cleaning City Names

Fixing the city name data used a similar solution as the street names. Using a regular expression to find every instance that CO is included, I was able to substitute it for an empty string. Some of the values were input as unicode so I had to encode those first. Then I could capitalize all the values and apply another mapping function to correct any incorectly entered names

def updateCity(city, cityMapping):
    if type(city) == unicode:
        city = city.encode('ascii', 'ignore')
    # Remove the state at end
    city = re.sub(r'(?i),? (co)*\d*$', '', city)
    city = string.capwords(city)
    city = mapCities(city, cityMapping)
    return city

Once the city names were clean, I could run a query to observe the amount of entries for each city

df1 = pd.read_sql_query('SELECT tags.value, COUNT(*) as count \
                        FROM (SELECT * FROM nodes_tags UNION ALL \
                              SELECT * FROM ways_tags) tags \
                        WHERE tags.key LIKE  "city" \
                        GROUP BY tags.value \
                        ORDER BY count DESC;', conn)

This output the following data for the first five entries:

value	count
Denver	29522
Lafayette	3684
Boulder	3263
Aurora	1915
Broomfield	1005

Overall there was a total of 80 unique cities. The bulk being in the largest city Denver, but also gathering data from quite a few neighboring towns in the area.

Cleaning Phone Numbers

The phone numbers required some more extensive cleaning than the street and city names. I wanted all of the data to be in the same format to be easy to read. Looking through the data, some of the numbers were entered with two numbers but not as separate values. By using a short regular expression:

findMultiples = re.compile(r'[\;]')
match = re.search(findMultiples, phone)

I was able to find any instances where two numbers were in the same entry and separate them. Then, I decided that any numbers were less than 10 characters were not valid and shouldn’t be included with the data so the invalid numebers were removed. By using another regular expression:

correctFormat = re.compile(r'^(\+1) /(\d{3}/) \d{3}-\d{4}')

I could find any data that had the correct format that I wanted and return it. If the number was not in this format, I removed any characters except for the numbers. This made it easy to then put each number into the same format with the following code:

def standardize(phone):
    # Remove everything but the number
    phone = [item.replace('(', '').replace(
        ')', '').replace(' ', '').replace('-', '').replace('.', '')
        for item in phone]
    phone = "".join(phone)
    # Put numbers to same format
    if phone.startswith('+01'):
        phone = '+1 ' + '(' + str(phone[3:6]) + ')' + ' ' \
                + str(phone[6:9]) + '-' + str(phone[9:])
    elif phone.startswith('+1'):
        phone = '+1 ' + '(' + str(phone[2:5]) + ')' + ' ' \
                + str(phone[5:8]) + '-' + str(phone[8:])
    elif phone.startswith('1'):
        phone = '+1 ' + '(' + str(phone[1:4]) + ')' + ' ' \
                + str(phone[4:7]) + '-' + str(phone[7:])
    else:
        phone = '+1 ' + '(' + str(phone[:3]) + ')' + ' ' \
                + str(phone[3:6]) + '-' + str(phone[6:])
    return phone

Data Overview

File Sizes

denver-boulder_colorado.osm : 956 MB
mapData.db : 79 MB
nodes_tags.csv : 14.6 MB
nodes.csv : 374.4 MB
ways_nodes.csv : 119.8 MB
ways_tags.csv : 60.8 MB
ways.csv : 30.2 MB

Number of Unique Users

df2 = pd.read_sql_query('SELECT COUNT(DISTINCT(x.uid)) \
                         FROM (SELECT uid FROM nodes UNION ALL \
                               SELECT uid FROM ways) x;', conn)
print df2

2294

Number of Nodes

df3 = pd.read_sql_query('SELECT COUNT(*) FROM nodes;', conn)
print df3

4360962

Number of Ways

df4 = pd.read_sql_query('SELECT COUNT(*) FROM ways;', conn)
print df4

487537

Top 10 Contributors

df5 = pd.read_sql_query('SELECT x.user, COUNT(*) as count \
                         FROM (SELECT user FROM nodes UNION ALL \
                               SELECT user FROM ways) x \
                         GROUP BY x.user \
                         ORDER BY count DESC \
                         LIMIT 10;', conn)
print df5

user	count
chachafish	823201
Your Village Maps	734747
woodpeck_fixbot	340630
GPS_dr	313217
jjyach	309141
DavidJDBA	186559
Stevestr	170228
CornCO	156123
russdeffner	124346
Berjoh	84235

Number of users with less than 5 posts

df6 = pd.read_sql_query('SELECT COUNT(*) \
                         FROM (SELECT x.user, COUNT(*) as num \
                         FROM (SELECT user FROM nodes UNION ALL \
                               SELECT user FROM ways) x \
                         GROUP BY x.user \
                         HAVING num < 10);', conn)
print df6

950

Additional Statistics

With a total number of 4,848,499 entries, the top user chachafish contributed approximately 17%. It would appear that the spread of entry values is decently distributed. However with such a large amount of unique users, the bulk of entries are kept for the top 10 users. Overall the top 10 users account for near 67% of all entries. This means there are nearly 2,000 users contributing very little. Almost half of the users have less than 10 posts.

I play a good amount of video games in my spare time. Observing the top 10 contributors table, to me it looks very similar to a leaderboard in a video game. To give some incentive for more posts, all it could take is to display that graph publicly somewhere on the open street map site. Any form of competition typically drives people to be at the top, even if it is just showing who the top contributors currently are.

Ideas for Improvement

Building on the idea of a video game type leaderboard from the top user query, it may be useful to combine the data cleaning code with an actual videogame. The first one that comes to mind would be the popular Pokemon Go app that so many people use. If the developers could add a sort of “Where did you find this Pokemon?” option when players catch certain Pokemon, that could be linked to the open map data. The players could get some points for incentive to find new areas. There are advantages and disadvanteges to this however:

Advantages	Disadvantages
Create content very quickly	Possible wrong data
Data from a wide range of areas	May drain phones battery quicker
More users instead of bots	Areas without players would lack data

Top 10 Amenities

df7 = pd.read_sql_query('SELECT tags.value, COUNT(*) as num \
                         FROM (SELECT * FROM nodes_tags UNION ALL \
                               SELECT * FROM ways_tags) tags  \
                         WHERE key = "amenity" \
                         GROUP BY tags.value \
                         ORDER BY num DESC \
                         LIMIT 10;', conn)
print df7

value	num
parking	16809
restaurant	2294
school	1507
fast_food	1075
bench	963
place_of_worship	959
bicycle_parking	919
fuel	752
shelter	622
bank	543

Top 5 Religions

df8 = pd.read_sql_query('SELECT nodes_tags.value, COUNT(*) as num \
                         FROM nodes_tags \
                         WHERE nodes_tags.key="religion" \
                         GROUP BY nodes_tags.value \
                         ORDER BY num DESC \
                         LIMIT 5;', conn)

value	num
Christian	573
Jewish	9
Buddhist	4
Muslim	3
Multifaith	2

Streets to live on if you like Mexican food

df8 = pd.read_sql_query('SELECT nodes_tags.value, COUNT(*) as num \
                         FROM nodes_tags \
                         JOIN (SELECT DISTINCT(id) FROM nodes_tags \
                               WHERE value = "mexican") x \
                         ON nodes_tags.id = x.id \
                         WHERE nodes_tags.key="street" \
                         GROUP BY nodes_tags.value \
                         ORDER BY num DESC \
                         LIMIT 5;', conn)

value	num
East Colfax Avenue	14
East Hampden Avenue	5
Colorado Boulevard	4
East Mississippi Avenue	4
South Parker Road	4

Streets to live on if you like Beer

df9 = pd.read_sql_query('SELECT nodes_tags.value, COUNT(*) as num \
                         FROM nodes_tags \
                         JOIN (SELECT DISTINCT(id) FROM nodes_tags \
                               WHERE value = "pub") x \
                         ON nodes_tags.id = x.id \
                         WHERE nodes_tags.key="street" \
                         GROUP BY nodes_tags.value \
                         ORDER BY num DESC \
                         LIMIT 5;', conn)
print df9

value	num
Market Street	8
Blake Street	6
Tennyson Street	4
East Iliff Avenue	3
15th Street	2

Conclusion

After cleaning and reviewing the data, it is apparent that there is still a large bulk of data that could use formatting. Further cleaning of the phone numbers could include separating the extensions from the actual phone number. Various beginnings of street names were not taken into consideration for the scope of this project however that should be the next focus for future improvements. Most, if not all of the top users appear to be programmed bots. If whatever programming the bots use was merged with the code for this project, the data could be input correctly in the first place. However then comes the difficulty of all the bots having the same format. Overall, the data that was cleaned proved useful for the scope of this project. I beleive if I ever do decide to move to Denver, I will most likely be living on Market Street.

Share on

Twitter Facebook Google+ LinkedIn