Project 4: Building a Data Website

Corrections / Clarifications

none yet

Handin

When you're done, you'll hand in a .zip file containing main.py, main.csv, and any .html files necessary (basically whatever we need to run your website).

You can create a zip file from the terminal. Let's say you're already in a directory named p4. You can run this to create a compressed p4.zip file alongside your directory:

zip ../p4.zip main.py main.csv *.html

Important: make sure your program is named main.py (we've been more flexible about this in the past, but naming it something else causes problems for us now that it's in a zip).

Overview

In this project, you'll build a website for sharing a dataset -- you get to pick the dataset (More on possible sources for data later)!

You'll use the flask framework for the site, which will have the following features: (1) multiple plots on the home page, (2) a page for browsing through the table behind the plots, (3) a link to a donation page that is optimized via A/B testing, (4) a subscribe button that only accepts valid email addresses, and (5) robots.txt and 429 requests discouraging access to the browse page.

Your .py file may be short, perhaps <100 lines, but it will probably take a fair bit of time to get those lines right.

Setup

First, install some things:

pip3 install Flask lxml html5lib beautifulsoup4

Group Part (75%)

Data

You get to choose the dataset for this project. Find a CSV you like somewhere, then download it as a file named main.csv.

The file should have between 10 and 1000 rows and between 3 and 15 columns. Feel free to drop rows/columns from your original data source if necessary.

Mandatory: Leave a comment in your main.py about the source of your data.

Two good places to check while looking for a dataset are Kaggle and Google's Dataset Search.

Pages

Your web application should have three pages:

index.html
browse.html
donate.html

We have some requirements about what is on these, but you have quite a bit of creative freedom for this project.

To get started, consider creating a minimal index.html file:

<html>
  <body>
    <h1>Welcome!</h1>

    <p>Enjoy the data.</p>
  </body>
</html>

Then create a simple flask app in main.py with a route for the homepage that loads index.html:

import pandas as pd
from flask import Flask, request, jsonify

app = Flask(__name__)
# df = pd.read_csv("main.csv")

@app.route('/')
def home():
    with open("index.html") as f:
        html = f.read()

    return html

if __name__ == '__main__':
    app.run(host="0.0.0.0", debug=True, threaded=False) # don't change this line!

# NOTE: app.run never returns (it runs for ever, unless you kill the process)
# Thus, don't define any functions after the app.run call, because it will
# never get that far.

Try launching your application by running python3 main.py:

trh@instance-1:~/p4$ python3 main.py
 * Serving Flask app "main" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

This program runs indefinitely, until you kill it with CTRL+C (meaning press CTRL and C at the same time). Open your web browser and go to http://your-ip:5000 to see your page ("your-ip" is the IP you use to SSH to your VM).

Requirements:

Going to http://your-ip:port/browse.html should return the content for browse.html, and similarly for the other pages.
The index.html page should have hyperlinks to all the other pages. Be sure to not include your IP here! A relative path is necessary to pass our tests.
You should put whatever content you think makes sense on the pages. Just make sure that they all start with an <h1> heading, giving the page a title.

Browse

The browse.html page should show an HTML table with all the data from main.csv. Don't truncate the table (meaning we want to see all the rows). Don't have any other tables on this page, so as not to confuse our tester.

The page might look something like this:

Hint 1: you don't necessarily need to have an actual browse.html file just because there's a browse.html page. For example, here's a hi.html page without a corresponding hi.html file:

@app.route('/hi.html')
def hi_handler():
    return "howdy!"

For browse, instead of returning a hardcoded string, you'll need to generate a string containing HTML code for the table, then return that string. For example, "<html>{}<html>".format("hello") would insert "hello" into the middle of a string containing HTML code.

Hint 2: look into _repr_html_ for DataFrames (or possibly to_html()).

Emails

There should be a button on your site that allows people to share their email with you to get updates about changes to the data:

When the button is clicked, some JavaScript code will run that does the following:

pops up a box asking the user for their email
sends the email to your flask application
depending on how your flask application responds, the JavaScript will either tell the user "thanks" or show an error message of your choosing

We'll give you the HTML+JavaScript parts, since we haven't taught that in class.

Add the following <head> code to your index.html, before the <body> code:

  <head>
    <script src="https://code.jquery.com/jquery-3.4.1.js"></script>
    <script>
      function subscribe() {
        var email = prompt("What is your email?", "????");

        $.post({
          type: "POST",
          url: "email",
          data: email,
          contentType: "application/text; charset=utf-8",
          dataType: "json"
        }).done(function(data) {
          alert(data);
        }).fail(function(data) {
          alert("POST failed");
        });
      }
    </script>
  </head>

Then, in the main body of the HTML, add this code for the button somewhere:

<button onclick="subscribe()">Subscribe</button>

Whenever the user clicks that button and submits an email, it will POST the data to the /email route in your app, so add that to your main.py:

@app.route('/email', methods=["POST"])
def email():
    email = str(request.data, "utf-8")
    if re.match(r"????", email): # 1
        with open("emails.txt", "a") as f: # open file in append mode
            f.????(email + ????) # 2
        return jsonify(f"thanks, you're subscriber number {n}!")
    return jsonify(????) # 3

Fill in the ???? parts in the above code so that it:

use a regex that determines if the email is valid
writes each valid email address on its own line in emails.txt
sternly warns the user if they entered an invalid email address to stop being so careless (you choose the wording)

Also find a way to fill the variable n with the number of users that have subscribed so far, including the user that just got added.

Note: you can find information about jsonify here.

Donations

On your donations page, write some text, making your best plea for funding. Then, let's find the best design for the homepage, so that people are most likely to click the link to the donations page.

We'll do an A/B test. Create two version of the homepage, A and B. They should differ in some way, perhaps trivial (e.g., maybe the link to donations is blue in version A and red in version B).

The first 10 times your homepage is visited, alternate between version A and B each time. After that, pick the best version (the one where people click to donate most often), and keep showing it for all future visits to the page.

Hint 1: consider having a global counter in main.py to keep track of how many times the home page has been visited. Consider whether this number is 10 or less and whether it is even/odd when deciding between showing version A or B.

Hint 2: when somebody visits donate.html, we need to know if they took a link from version A or B of the homepage. The easiest way is with query strings. On version A of the homepage, instead of having a regular link to "donate.html", link to "donate.html?from=A", and in the link on version B to donate.html, use "donate.html?from=B". Then the handler for the "donate.html" route can keep count of how much people are using the links on both versions of the home page.

Hint 3: You don't necessarily need to have two different versions of your homepage to make this work. You could use the templating approach: once you read your index.html file into your program, you can edit it. At that point it should be a string, so you could add something to it or replace something in it.

Individual Part (25%)

robots.txt

Your application should have a "robots.txt". Most crawlers/agents should be allowed to crawl anything. User-agent "busyspider" should be blocked from everything and User-agent "hungrycaterpillar" should be blocked from one page only: "browse.html".

You can manually test your robots.txt with the following:

from urllib.robotparser import RobotFileParser
r = RobotFileParser("http://VM_IP:5000/robots.txt")
r.read()
r.can_fetch("hungrycaterpillar", "http://VM_IP:5000/browse.html") # should be False

Dashboard

Implement a dashboard on your homepage showing at least 3 SVG images. The SVG images must correspond to at least 2 different flask routes, i.e., one route must be used at least twice with different query strings (resulting in different plots), similar to the lecture reading.

Requirements

All plots are based on the data chosen for browse.html, but you are free to choose what is plotted. Plots should have labels for both axes and optionally a title.
Similarly, there is no restriction on the choice of query string parameters, except that the resulting plots should be distinct.

E.g., We could have a dashboard with the following lines added to the index.html file (you're encouraged to use more descriptive names for your .svg routes).

<img src="dashboard_1.svg"><br><br>
<img src="dashboard_1.svg?cmap=damage"><br><br>
<img src="dashboard_2.svg"><br><br>

The dashboard SVGs may look something like this:

dashboard_1.svg

dashboard_1.svg?cmap=damage

Here, the query string uses cmap, which specifies an additional third column to use for a colormap.

dashboard_2.svg

When using query strings, ensure appropriate default values are supplied.

Important

Ensure you are using the "Agg" backend for matplotlib, by explicitly setting
```
matplotlib.use('Agg')
```
right after importing matplotlib.
Ensure that app.run is launched with threaded=False.
Further, use fig, ax = plt.subplots() to create the plots and close the plots after savefig with plt.close(fig) (otherwise you may run out of memory).

Concluding Thoughts

Get started early, test often, and, above all, have fun with this one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Project 4: Building a Data Website

Corrections / Clarifications

Handin

Overview

Setup

Group Part (75%)

Data

Pages

Browse

Emails

Donations

Individual Part (25%)

robots.txt

Dashboard

Requirements

dashboard_1.svg

dashboard_1.svg?cmap=damage

dashboard_2.svg

Important

Concluding Thoughts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project 4: Building a Data Website

Corrections / Clarifications

Handin

Overview

Setup

Group Part (75%)

Data

Pages

Browse

Emails

Donations

Individual Part (25%)

robots.txt

Dashboard

Requirements

dashboard_1.svg

dashboard_1.svg?cmap=damage

dashboard_2.svg

Important

Concluding Thoughts