Skip to content

Creating database for a new city

Michael Saugstad edited this page Sep 6, 2024 · 39 revisions

This is intended to be a nearly comprehensive guide for how to download a new city's data from OpenStreetMap, transform it to be in the format we need for Project Sidewalk, and put it into a fresh database. It is a work in progress, and is only based off of my (Mikey's) experience creating the database for Newberg, OR; Columbus, OH; and Mexico City thus far.

Prepare the road network using QGIS

Downloading road network data

We typically use https://extract.bbbike.org/ to download the data from OpenStreetMap. Just make a bounding box that encompasses the entire city (but try to make it as small as possible while still getting the whole city). You can see the boundaries of a city using lots of different tools, including Google Maps. I generally use OSM XML 7z, though the file can be quite large for bigger cities. An ESRI Shapefile is more compact, but I usually find weird errors in Shapefiles from bbbike. Right after importing the street data into QGIS, I usually export it again as a Shapefile locally to improve performance. Note that on the bbbike status page you can get a link back to your map so that you can download it in a different format (really useful if you carefully sculpted around city boundaries).

If the city is too large to use bbbike, another option that accepts a wider range would be the HOT Export Tool. You can make a many-sided polygon to cover the area you're looking for. Or you can import a geojson file containing a polygon that you want to query for.

The HOT Export Tool also has a limit, so you might have to learn how to use the Overpass API for a very large area.

Downloading data on city limits / neighborhoods

  • First try and find the data available for download on the city's website. This was super easy for DC, as they have a big open data initiative, but other places (especially smaller places like Newberg, OR) may not have that data readily available for download online. If you can't find it online, it is best to contact the city to ask for this information, because I can't imagine how they wouldn't have those files somewhere.
  • If you need to get in touch with the city, you can try and find the boundary data elsewhere on the internet in the meantime. This is the site I've used for this: https://osm-boundaries.com/. I've found that the boundary data can also be out of date (can cross-reference with what the boundary looks like in Google Maps). If the boundary in this open dataset seems to match up with Google Maps exactly, I wouldn't worry so much about getting the exact data from the city government.

Install QGIS

https://www.qgis.org/en/site/forusers/download.html

Importing a shapefile into QGIS

Go to Layer -> Add layer -> Add vector layer, and find the .shp file

Make sure projection is EPSG:4326, WGS 84

  • Check the CRS (coordinate reference system) by going to the properties for each layer you add and checking the Information tab
  • If the CRS is not EPSG:4326, WGS 84, exit the properties, right click on the layer and choose Export -> Save Features As... Pick the correct CRS there. This will tranform the shapefile and save a new version in the correct CRS.

Create neighborhoods if we have none, or split up neighborhoods if they are too large

You'll only need to do this if you can't track down a dataset with neighborhoods for the city or if the neighborhoods you have turn out to be way too large (in the sum of the lengths of the streets). For neighborhoods that are too large, I typically wait until I get to the end of the process, then check the sum of the lengths of the streets in the neighborhood. Then I will start from the beginning, splitting up the neighborhoods that are too large on the 2nd attempt. I start over just to prevent us from splitting up neighborhoods that we don't need to. Click the arrow to expand if you need to do this.
  • Install the digitizing tools plugin
  • Make sure you can snap vertices together: Project -> Snapping Options -> set units to degrees & tolerance to ~0.001
  • Toggle editing for the polygons you want to split up
  • Click on "Split Features" in the Digitizing toolbar
  • To do the split, left click either on a vertex of the polygon (you know it's on a vertex because a purple box shows up on the cursor) or fully outside the polygon. Then continue to left click to add more vertices to a linestring that indicates where the cut should be.
  • After adding your final vertex (again on a vertex of the polygon or completely outside the polygon), you can right click to complete the cut.
  • You should now be able to see that the polygon has been split in two, both with identical attributes in the attributes table. If the neighborhood was called "Blah", then you'll probably want to rename the two parts into something like "Blah North" and "Blah South".
  • If you want to split the polygon into more parts, you can continue to do so! Just follow the same process again to split one of those smaller polygons into two even smaller ones.

Create a new unique index for the regions

  • Go to the properties for the region layer -> Fields -> Field Calculator
  • Give the output field the name region_id, choose row number to output and click Ok.

Import the street data

If it is in shapefile format, you should be able to just import the roads.shp file. If it is in the OSM format, you should be able to select just the "lines" (shouldn't need the multilinestrings).

Remove streets outside the city

This isn't strictly necessary, but you will definitely want to do it for large cities (or if you just requested the road network for a very large bounding box). It will drastically improve the running time of the algorithm that splits streets at intersections. You can do this by selecting features by area and deleting them.

Just make sure to check every time you highlight some streets to verify that none of them are actually part of the road network of the city. A caveat: if a street only intersects a neighborhood polygon just a tiny bit near it's endpoint, it is actually best to just delete that street. That little tiny piece of a street would end up being a street that someone has to audit, and it is a bit of a confusing experience for the user.

Filter the streets

  • In QGIS, right click on the polylines layer and click on "Filter"
  • Add this line
    "highway" IN ('trunk', 'primary', 'secondary', 'tertiary', 'residential', 'unclassified', 'pedestrian', 'living_street')
    
  • If you also want to include alleys, you filter would look like this
    "highway" IN ('trunk', 'primary', 'secondary', 'tertiary', 'residential', 'unclassified', 'pedestrian', 'living_street')
        OR ("highway" = 'service' AND other_tags LIKE '%"service"=>"alley"%')
    
  • NOTE if you downloaded data in another format, the attribute might be called "type" instead of "highway".

Check that we aren't missing important streets

  • In QGIS, add a Google Maps layer by right-clicking on XYZ Tiles -> New Connection. Name it "Google Maps". The URL is https://mt0.google.com/vt/lyrs=m&x={x}&y={y}&z={z}. Set Max. Zoom Level to 19.
  • Make sure that the streets are overlaid on Google Maps, and look around for streets that are not in our road network that look like they are streets according to Google.
  • As you find such streets, take a look at those streets in GSV. If the streets are clearly alleyways that we wouldn't want users to audit, you can move on to another street.
  • Try to sample around the city and look for streets that seem like they would be large enough that they should be included. It's also helpful to look for areas where the streets that are included or not look sort of random.
  • TODO what exactly should we do in this situation? It might depend on the scope of the problem and size of the city. We might just manually include streets that we realize should be there. Maybe we go back and fix those streets in OSM and do the data import again later. Or maybe we just include all the "service road" or "alley" streets in cities where the data looks particularly messed up.

Split lines along intersections

  • Processing -> Toolbox -> QGIS geoalgorithms -> Vector overlay tools -> Split lines with lines (use roads for input and split layers)

Filter within city limits and apply region_id to streets

  • Go to Vector -> Geoprocessing tools -> Intersection
  • Input layer should be the newly split roads, overlay layer the neighborhood polygons

Make sure you have single-linestrings not multi-linestrings

Use the Processing->Toolbox->Multipart to Singleparts tool.

TODO (not currently doing this) Split lines based on their length; no longer than 250ft?

  • Get an estimate for how long 250 ft is in degrees (possibly using this site).
  • Processing -> Toolbox -> v.split.length

Create a new unique index for the streets

  • Go to the properties for the streets layer -> Fields -> Field Calculator
  • Give the output field the name road_id, choose row number to output and click Ok.

Create a blank schema in your db:

Make sure you have psql version 16.2. Our schema was created using that version psql. We may not be able to restore database dumps created using a vastly different version of psql.

From the root SidewalkWebpage directory, run the following (where new_city_name is underscore_separated):

make import-dump db=sidewalk_init
make ssh target=db
psql -U postgres -d sidewalk
ALTER SCHEMA sidewalk_init RENAME TO sidewalk_<new-city-name>;
CREATE USER sidewalk_<new-city-name>;
GRANT sidewalk TO sidewalk_<new-city-name>;
ALTER SCHEMA sidewalk_<new-city-name> OWNER TO sidewalk;

Put the data in the database

  • Connect your database to QGIS by going to Layer -> Add Layer -> Add PostGISLayers... Use "sidewalk" as the user and pw.
  • Go to Database -> DB Manager -> DB Manager -> PostGIS -> sidewalk -> sidewalk_<new-city-name>
  • Click on Import Layer/File, choose the layer with your roads, put it in the sidewalk_<new-city-name> schema, call it qgis_road, fill in the primary key with road_id, make sure to check "Do not promote to multi-part".
  • Click on Import Layer/File, choose the layer with your polygons, put it in the sidewalk_<new-city-name> schema, call it qgis_region, fill in the primary key region_id, but DO NOT check "Do not promote to multi-part".

Getting the database tables filled correctly

Now we take the data that we put into the database and reformat it and put it into the correct tables so that we can actually run Project Sidewalk using this database. This is all in the form of SQL queries. The best way to run these is to run the queries as the new sidewalk_<new-city-name> user, because that user will run queries on the correct schema by default. This way you don't need to manually specify the schema name every time a table is referenced.

  • Fill in the street_edge table using the qgis_road table (note that highway might be called type depending on data source).
    INSERT INTO street_edge (street_edge_id, geom, way_type, deleted, timestamp, x1, y1, x2, y2)
        SELECT road_id, geom, highway, FALSE, now(),
               ST_X(ST_StartPoint(geom)), ST_Y(ST_StartPoint(geom)), ST_X(ST_EndPoint(geom)), ST_Y(ST_EndPoint(geom))
        FROM qgis_road;
    
  • Fill in the osm_way_street_edge table to link streets to their original OSM ways.
    INSERT INTO osm_way_street_edge (osm_way_id, street_edge_id)
        SELECT CAST(osm_id AS INT), road_id FROM qgis_road;
    
  • Add the tutorial street edge from DC into the database. We are just giving it a new street edge id that should be added to the cityparams.conf file.
    INSERT INTO street_edge(street_edge_id, geom, x1, y1, x2, y2, way_type, deleted, timestamp)
        SELECT MAX(street_edge_id) + 1, '0102000020E6100000040000007C9E3F6D544453C00A2E56D460784340ECF7C43A554453C02B685A6265784340F29A5775564453C0C4D2C08F6A784340F73DEAAF574453C0B0CBF09F6E784340', -77.067653, 38.940455, -77.067852, 38.940876, 'tertiary', FALSE, '2015-11-17 04:20:19.46+00'
        FROM street_edge;
    
  • Fill in the region table using the qgis_region table. NOTE fill in the data_source column in a way that seems appropriate. The name column should hold the name of the neighborhood; you can always fix that manually later.
    INSERT INTO region (region_id, data_source, name, geom, deleted)
        SELECT region_id, '', region_name, geom, FALSE
        FROM qgis_region;
    
  • Fill in the street_edge_region table.
    • First fill in the streets for the new city
      INSERT INTO street_edge_region (street_edge_id, region_id)
          SELECT road_id, region_id
          FROM qgis_road;
      
    • Then fill in the tutorial street with an arbitrary region_id. This is a hack that probably needs to be updated somehow.
      INSERT INTO street_edge_region (street_edge_id, region_id)
          SELECT MAX(street_edge_id), 1
          FROM street_edge;
      
  • Initialize the street_edge_priority table. If starting your city with only a subset of the neighborhoods, you should mark the ones that are closed as deleted = TRUE in the region table, then run this. Again, make sure that the tutorial street is in an opened neighborhood!
    UPDATE street_edge
    SET deleted = TRUE
    FROM street_edge_region
    INNER JOIN region ON street_edge_region.region_id = region.region_id
    WHERE street_edge.street_edge_id = street_edge_region.street_edge_id
        AND region.deleted = TRUE;
    
    INSERT INTO street_edge_priority (street_edge_id, priority)
        SELECT street_edge_id, 1
        FROM street_edge
        WHERE deleted = FALSE;
    

Optional: Filter for streets in boundary separate from neighborhoods

We often start off working on just a subset of streets around transit stations, for example. In this case, we'll be given a different set of region boundaries for it, and we want to only reveal those streets.

  • Use the "extract by location" tool back in qgis. Extract features from your finished street layer, comparing to features from your given boundary. Check the "intersect" box.
  • Manually examine the streets that were included. Depending on what you're looking for, you might remove some streets that are mostly outside the boundary or all streets that are not fully in the boundary. Sometimes I'll use aerial imagery to help decide.
  • Add this layer of streets to your database as well. Call it qgis_road_filtered.
  • Mark all streets except those in your smaller boundary as deleted:
    UPDATE street_edge SET deleted = TRUE;
    
    UPDATE street_edge
    SET deleted = FALSE
    FROM qgis_road_filtered
    WHERE street_edge.street_edge_id = qgis_road_filtered.road_id;
    
  • Delete the entries in the street_edge_priority table for all deleted streets:
    DELETE FROM street_edge_priority
    USING street_edge
    WHERE street_edge_priority.street_edge_id = street_edge.street_edge_id
        AND deleted = TRUE;
    

Run the site for the first time!

  • In the docker-compose.yml, set the DATABASE_USER to sidewalk_<new-city-name> and the SIDEWALK_CITY_ID to the valid ID for any other city for now. Run make dev and npm start. The landing page should load successfully, but don't visit other pages yet.
  • Now you need to update some configs into the database. The entries that should be updated to see most of the functionality for the website are the following:
    • open_status: either fully or partially
    • city_center_lat/lng
    • southwest/northeast_boundary_lat/lng: I typically do +/-5 degrees from the center
    • tutorial_street_edge_id
    • excluded_tags: I would typically copy from the most similar city
  • Add the English name for the city in the conf/messages file. Add the state and country names if they are not already there. Then enter the city name into Google Translate to see if a specific entry should be added to the files for any other languages.
  • There are just a few configs to add in the conf/cityparams.conf file. Mostly just things like the URL, etc. Instructions for the Google Analytics ID are below, and can be skipped for now.
  • Update the SIDEWALK_CITY_ID in the docker-compose.yml to use the new one that you created in the cityparams.conf. Rerun make dev and npm start. Visit the landing page using an incognito window again. Everything should still be working, and your landing page map should actually show the city you're working with!
  • Create logins for both Mikey and Jon. Set those accounts to both be "Owner" in the database (again, don't visit anything but the landing page yet unless you've already removed streets with no imagery).
  • Delete the qgis_* table from the databases.

Removing streets with no imagery

  • Create a CSV from the following query, naming it street_edge_endpoints.csv. Note that if you are starting with only a subset of the neighborhoods, you probably want to filter for just those regions in this query.
    SELECT street_edge.street_edge_id, region_id, x1, y1, x2, y2, geom
    FROM street_edge
    INNER JOIN street_edge_region ON street_edge.street_edge_id = street_edge_region.street_edge_id;
    
  • Run the check_streets_for_imagery.py file (python2 check_streets_for_imagery.py). Depending on the city, this could take a very long time. You can leave it to run in the background. But it will occasionally fail to connect with the Google Maps servers and quit. Just restart the script and it will start from where it left off. When it is completely finished, you will have a CSV named streets_with_no_imagery.csv.
  • Mark the deleted column as TRUE in the street_edge table for all of the IDs listed in streets_with_no_imagery.csv.
    UPDATE street_edge
    SET deleted = TRUE
    WHERE street_edge_id IN ();
    
  • Delete the entries in the street_edge_priority table for all deleted streets:
    DELETE FROM street_edge_priority
    USING street_edge
    WHERE street_edge_priority.street_edge_id = street_edge.street_edge_id
        AND deleted = TRUE;
    
  • Force the region_completion table to update by running the query TRUNCATE TABLE region_completion;
  • Note that the total street distance (which is used to calculate percentage of the city that is complete on the landing page) does not update automatically when the streets are marked as deleted. If you are doing this on your dev environment, you can just restart the web server to invalidate the cache. If this is being done on a live server, you can use the 'Clear Play cache' button on the admin page.

If you want to start with a subset of neighborhoods

  • Figure out the region_ids of the regions you want to start with. For the example queries, I'll use regions 1, 2, and 3.
  • Make sure that tutorial street is in one of the regions that is not deleted by modifying it's entry in the street_edge_region table.
  • Mark the other regions as deleted (UPDATE region SET deleted = TRUE WHERE region_id NOT IN (1, 2, 3);).
  • Mark the streets outside those regions as deleted:
    UPDATE street_edge
    SET deleted = TRUE
    FROM street_edge_region
    WHERE street_edge.street_edge_id = street_edge_region.street_edge_id
        AND street_edge_region.region_id NOT IN (1, 2, 3);
    
  • Delete the entries in the street_edge_priority table for streets outside those regions:
    DELETE FROM street_edge_priority
    USING street_edge
    WHERE street_edge_priority.street_edge_id = street_edge.street_edge_id
        AND street_edge.deleted = TRUE;
    
  • Truncate the region_completion table (TRUNCATE TABLE region_completion;).

Final changes to config/cityparams.conf file and other configuration changes

  • You'll need to create a new Google Analytics ID for this city (if you're working on our team). Log in to Google Analytics, click on "Admin" in the bottom left, create a new account and call it "Project Sidewalk <new-city-name>". Try to copy the same setup parameters as for the other accounts. Once you create the account you should get a new ID that you can add to the cityparams.conf file. You'll need to make a separate property for the prod and test server. When you create a property, you'll then create a data stream; the data stream is what will give you your Google Analytics ID.
  • You'll need to log in to the Google Cloud Console, click on APIs & Services -> Credentials. Click on the main API key, and add the future URLs for the test and prod servers to the list of websites that can be used with the API key.
  • Once you've made the changes above, you can visit the audit page and test to make sure the entire site works.
  • Try to audit the tutorial street by going to /audit/street/<tutorial-street-id>. This is just to avoid end users having the DC tutorial street pop up for them.
  • Finally you'll then want to fill in any remaining params in the config table.
    • update-offset-hours can be set to any integer, we just use it to spread out work on the server.
    • All of the API page params will need to be filled in at some point. If you have time now, you can do some auditing in an area, manually run clustering, then center all the APIs over the data you just collected so that there is something to show. If in a hurry, at a to-do item to fix these later, once others have collected more data.
  • TODO check that the mission_type table looks correct, compare mission_type_id to current servers. Not sure if this is being set correctly, needs testing.

Get a server set up with this database

This is internal documentation for setting up the server at UW. We do not have experience setting up a dedicated server elsewhere, so we won't be able to assist with this step, unfortunately!

Start by creating a dump of the database: pg_dump -Fc -U sidewalk -d sidewalk -n '<new-schema-name>' > <new-dump-name>

Then message Matt or Jason asking to set up a new server for this city. Here is a template email: Hey Jason,

I believe we are ready to roll out a Seattle test server! Here are the specs I remember needing...

  • The URLs should be sidewalk-sea-test.cs.washington.edu and sidewalk-sea.cs.washington.edu.
  • We would like a redirect from sidewalk-seattle-test.cs.washington.edu / sidewalk-seattle.cs.washington.edu to the above.
  • I have attached a database dump to use.
  • The environment variable "SIDEWALK_CITY_ID" should be set to "seattle-wa".
  • The environment variable "ENV_TYPE" should be set to "test".

Setting up pano scraper (optional)

This is section is only necessary if the goal of setting up this database is to create a dataset for training computer vision models. You can skip if your goal is just to collect accessibility data through Project Sidewalk. Much of these instructions will need to be adapted for your own situation either way.

  • Make sure the production server is set up before setting up the pano scraper.
  • Log on to the sftp UW server (ask Mikey how to get on it) and create a new directory for the city.
  • In your own file system, create a file called log.csv with the same headers we have for the scraper logs for other cities. Make sure the file does not end in a newline with the command truncate -s -1 log.csv.
  • On the SFTP server with the panos, create a new directory sidewalk_panos/Panoramas/scrapes_dump_<city-name>.
  • Optionally, chmod 775 <new-dir-name> to match the others.
  • In that directory, run put /<local>/<path>/<to>/log.csv to upload that file to the remote server.
  • Log on to the EC2 Instance (ask Mikey how to get on it). Edit the /var/spool/cron/crontabs/ubuntu and add a new entry for the new city. Just copy one of the other commands and change the storage location on the sftp server and the new city's URL.
  • Change the hour that the scraper runs for each city (2nd column in the crontab file) to space them out evenly throughout the day.

Setting up Uptime Robot to ping servers (optional)

  • Log in to Uptime Robot
  • Click "Add New Monitor", choose the HTTP(s) monitor type at a 5 minute interval.
  • Set the URL to use https and set it to the signIn endpoint (e.g., https://sidewalk-pittsburgh-test.cs.washington.edu/signIn).
  • Share prod server alerts with Jon, Mikey, and the ML Lab, but share test server alerts with just Mikey.
Clone this wiki locally