Web Scraping with Python: Collecting More Data from the Modern Web Paperback – 4 April 2018
|New from||Used from|
Frequently bought together
Customers who viewed this item also viewed
From the Preface
What Is Web Scraping?
The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.
In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.
About This Book
This book is designed to serve not only as an introduction to web scraping, but as a comprehensive guide to collecting, transforming, and using data from uncooperative sources. Although it uses the Python programming language and covers many Python basics, it should not be used as an introduction to the language.
If you don’t know any Python at all, this book might be a bit of a challenge. Please do not use it as an introductory Python text. With that said, I’ve tried to keep all concepts and code samples at a beginning-to-intermediate Python programming level in order to make the content accessible to a wide range of readers. To this end, there are occasional explanations of more advanced Python programming and general computer science topics where appropriate. If you are a more advanced reader, feel free to skim these parts!
If you’re looking for a more comprehensive Python resource, 'Introducing Python' by Bill Lubanovic (O’Reilly) is a good, if lengthy, guide. For those with shorter attention spans, the video series 'Introduction to Python' by Jessica McKellar (O’Reilly) is an excellent resource. I’ve also enjoyed 'Think Python' by a former professor of mine, Allen Downey (O’Reilly). This last book in particular is ideal for those new to programming, and teaches computer science and software engineering concepts along with the Python language.
Technical books are often able to focus on a single language or technology, but web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of 'data gathering.' It should not be used as a complete treatment of any of these subjects, but I believe they are covered in enough detail to get you started writing web scrapers!
What other items do customers buy after viewing this item?
No customer reviews
|5 star (0%)||0%|
|4 star (0%)||0%|
|3 star (0%)||0%|
|2 star (0%)||0%|
|1 star (0%)||0%|
Review this product
Most helpful customer reviews on Amazon.com
The read is quick, because the author structured the book so only the first half of the books is "required reading." The second half is more of a reference section, that you can selectively read if the topic applies to your project. I find myself referring to a couple of these later chapters for my project, but not all.
Other than a familiarity with python, the author assumes no prior technical knowledge on the subjects covered. Some folks might find this too basic, but was fine for me, because I had little knowledge of how the internet works, beyond a superficial level.