Flask Web Scraping



Flask Microservice - Product Scraper A simple web-scraper and microservice for processing categories/products, written in Python/Flask. The scraper itself has been specifically written to scrape , using the Python library Beautiful Soup. It scrapes category and product information into JSON files. Flask: Web Forms¶. Previous: Flask intro: A very simple Flask app. Flask, part 2: Values in routes; using an API. Flask templates: Write HTML templates for a Flask app. Flask: Deploy an app: How to put your finished app online. Code for this chapter is here. In the Flask Templates chapter, we built a functioning Flask app. In this chapter, we’ll explore how to add functional web forms to a. 47 Likes, 1 Comments - University of Central Arkansas (@ucabears) on Instagram: “Your gift provides UCA students with scholarships, programs, invaluable learning opportunities and”. In this part of the series, we’re going to scrape the contents of a webpage and then process the text to display word counts. Updates:: Upgraded to Python version 3.8.1 as well as the latest versions of requests, BeautifulSoup, and nltk.

For this lesson, we'll start off with a boilerplate, one-file Flask app, i.e. just app.py. We won't worry yet about making multiple pages or multiple file.

Web

See if you can create the boilerplate from memory:

And switch to the command-line and get it running:

Then visit 127.0.0.1:5000 (i.e. localhost:5000)

Program your app.py to return a multi-line text string message for the / route, e.g.

No matter how many newlines of text are in your message, your web browser will render it as one single line:

By default, the Flask app responds to the web browser with a heads-up that it is sending along data that should be interpreted as 'text/html'.

Inspecting a web server's response

A quick segue that will be more relevant when we learn web-scraping: let's see the metadata behind our localhost web server's response. More specifically, let's confirm that the Flask app is sending its response with the indication that it is mean to be interpreted as HTML

If you know how to use your browser's developer tools, you can view the headers your response:

Or heck, get some more Python practice in. Open up a new Terminal window/tab, jump into ipython and perform a HTTP request against your local web server. Then examine the response object's headers:

HTML and whitespace insensitivity

If you haven't figured it out by now, the Python language is whitespace sensitive; or, in other words, white space is significant.

Or, in more specific words: Python cares about exactly how many consecutive space characters are at the beginning of each line of code.

This works:

And this throws an error:

HTML, on the other hand, does not care. So when a browser sees _one or more whitespace characters__ – and whitespace characters include spaces and new lines – within a string of text, it will render those consecutive whitespaces as a single whitespace. Well, as long as those whitespace characters occur between non-whitespace characters.

This means that the following browser output:

– can be represented by any of the following HTML strings:

So if HTML is just plain text, then we can write our web app's responses as plain text strings. But what makes HTML more than just plain text is its syntax and specification, particularly, its concept of text elements enclosed in tags.

To have a browser render 2 separate lines of text, we don't rely on newline whitespace characters. We enclose the each line of text, separately, within HTML tags. Traditionally, the paragraph tag – represented by <p> – is used to denote paragraphs of body text:

Add this as a string to app.py; I include the entirety of app.py, so far, just in case you're lost:

Reload localhost:5000 in your browser to see the result:

For reasons that are beyond the scope of this lesson, it is not necessary to memorize that <p> stands for 'paragraph', or that it is the only way to denote blocks of text. Or that all browsers/sites render <p> the same way.

It is important to understand the nature of HTML and its syntax, such as how the angled brackets denote tags that are interpreted by the browser, but are not rendered by the browser…i.e. the <p> and </p> parts of our web app response do not show up to the web browser user. They are code that is meant only for the browser.

This concept of what you see is not what you get is fundamental to understanding HTML, and well, computational languages in general. It is definitely a core concept in understanding how web scraping works…

Making a hyperlink

We won't be learning 1% of all the menial details of HTML and its syntax. But it is important to understand one common feature: making a hyperlink. You do want to memorize that the tags for hyperlinks are denoted with <a>. To designate that a string of text is intended to be a link to be clicked – and also, the destination of the hyperlink, we use the following syntax:

Alter your app.py:

And the result:

Don't worry about the style of the hyperlink – changing what it looks like is beyond the scope of our HTML lesson. But note how only the word world is a hyperlink. And note how, in the HTML, the href attribute denotes the destination, which has nothing to do with the content of the visible text itself. That is, the web page would look exactly the same if we did this:

(note, again, how whitespace is insignificant in what the browser renders)

What Is Web Scraping

The fact that a modern web browser will actually show something for a web server response as simple as:

Flask Web Scraping Tools

– or even, just:

Is a reflection of the fact that modern web browsers have been engineered to just 'roll with it' when it comes to with malformed HTML that doesn't follow the actual official HTML specification…this lenient attitude is obviously not the way the Python interpreter has been designed…but it makes for a nicer user experience when visiting a webpage that is slightly off.

When making an actual web app, though, it's worth following as much of the HTML spec as possible just so that when your served-up webpages don't act like you think they should, it won't be because you had overestimated how much the web browser would cover up for your sloppiness.

So what is the most minimalist HTML document that meets the HTML5 spec? According to the answer to this Stack Overflow question:

While that's a valid webpage, it will display exactly nothing in a web browser. So here's a 'fatter' webpage that contains the typical tags of most websites in production today:

Adding an image

Of course, webpages generally have more than just text. So let's use the image tagto include the image at this URL:

Flask Web Scraping

As I've said at the beginning of these set of simple Flask lessons, we're trying to keep our initial web apps contained to a single file, for simplicity's sake. So for now, we 'hotlink' a remote image (retrieve it from a remote server):

In general, it's better to host our own images rather than use another web server's bandwidth. But let's at least give the original author credit for the image by including the source URL:

And let's make the image itself a clickable link by wrapping the <img> element with an <a> element:

Adding an external CSS style sheet

Even with a pretty image, our webpage looks a bit dated with the default typography styles and spacing:

CSS – the language used to define the visual style of a website – is yet another thing we'll skim over. But we can get its benefits by including the CSS from a stylesheet someone else has already created.

Let's include the stylesheet from the popular Foundation framework by using the <link> tag within the <head> element. Here's the URL for that particular stylesheet, which you can click through to see the raw code if you're curious:

(again, note that we're hotlinking to a remote file)

(Read more about CSS and stylesheets at the W3C)

A little inline CSS

If you view the webpage now generated by the app.py, you'll see that it's a little too flush against the edge of the browser:

We can change that by adding some CSS to the style attribute of the entire <body> element, like so:

Note that this is most definitely not best practice. I'm just doing it here as a quick hack because we're almost done with this lesson. And because I'll be using it again in subsequent examples to keep the entire web app contained in one file. Best practice is to include CSS code in a self-contained stylesheet that we link to, as we did with the Foundation CSS.

Flask Web Scraping

The result of this styling:

Much nicer!

All together

Here is all the code needed for the app.py that runs our little pretty-HTML generating Flask app:

And this concludes the extent of the HTML knowledge that we need to cover in order to build a web application…for now. If you're really fascinated by the specifics of Hypertext Markup Language, one of the best introductions to HTML (and CSS and JS) can be found in Chapter 3 of Scott Murray's Interactive Data Visualization, which can be read online for free.

In subsequent lessons, we'll be creating more complex webpages with more HTML tags and syntax, but it's enough to just know that writing HTML has nothing to do with writing the Python code that runs the Flask app, other than changing the contents of a text string.

How to make a HTML page without a web application

If HTML is just text, then why can't we just type all that HTML in the above example and save it as a HTML file? In fact, you can do that. Go ahead and copy this:

And save it somewhere on your computer with a .html extension. Double-clicking it should result in your operating system in opening it with your web browser. And it will look exactly the same as the webpage generated by our Flask app…which should make sense because it's the exact same HTML.

So what's the whole point of this Flask app business and programming in Python? In the next lesson, we'll create our first dynamic web app functionality.