Automated Data Collection with R Blog

Win a hardcover copy of ADCR

Posted by The Authors on December 03, 2014

The rapid growth of the World Wide Wide over the past two decades has tremendously changed the way we share, collect and publish data. Firms, public institutions and private users provide every imaginable type of information; new channels of communication generate vast amounts of data on human behavior. In order to help researchers cope with the data avalanches, a variety of new techniques for collecting and analyzing large data sets have been devised.

As the relevant information for large data applications is frequently unstructured and spread across numerous webpages, collecting data manually quickly becomes unfeasible. If you have identified online data as an appropriate resource for your project, it is likely necessary to automate the data collection and tidying process, particularly if you plan to update your database regularly. The same is true if the collection task is non-trivial in terms of scope and complexity, or if you want others to be able to replicate the collection process. As it happens, R has been turned into a powerful and flexible tool for tasks of that kind. It has caught up with other programming langugages that used to be most commonly applied for web scraping and text mining.

For these reasons, the market for books on data science, machine learning and big data applications has notably grown in recent years. Surprisingly often, however, the books remain silent on how data for data science applications is actually acquired. To fill this gap, we have written an accessible introduction to data collection with R that will be published by Wiley in January 2015. Our book serves as a preparatory step for data analyses but also provides guidance on how to manage available information and keep them up-to-date. While writing the book we had at least three groups of readers in mind.

  1. Autodidacts who are looking for a gentle introduction to the necessary techniques that covers the basics and also presents real-life examples.
  2. Lectures and students who are looking for a book to accompany a course on web data collection.
  3. Initiates who have some familiarity with web scraping already but who would like a volume to fill remaining gaps.

To accommodate all three, the book has a three-part structure that starts with the fundamental web technologies, proceeds by introducing solutions to individual scraping problems and concludes with full fledged web scraping projects that cover the whole research arc of planning, download, extraction, cleansing and analysis.

Win a hardcover copy of ADCR!

Thanks to our publisher Wiley, we are able to raffle off three hardcover versions of the book. You can participate by following the book's Twitter account @RDataCollection. The winners are selected randomly from all @RDataCollection followers. Closing date for the competition is December 22nd, 2014. Winners will be notified within seven days of the drawing date.