Whether your business is looking to capture data about US highways, real estate property values, or seasonal retail shopping trends, chances are, this data exists in fragments across multiple locations.
At Icreon, we've built multiple enterprise solutions that scrape data from the web – from the retail to real estate to consumer electronics data. We've scraped data from the web across multiple sites, but have built tools in the past to also scrape large sources of data from digital documents such as PDFs and image files as well.
Large businesses typically tend to publish APIs to get access to proprietary data. For example, retrieving data from Yelp or Facebook to build a new software is relatively simple by using existing tools that are available.
When there aren't available APIs to use, we generate scripts that are able to scrape data directly. We read content directly from a webpage - whether it’s restaurant listings, calendars of events, eCommerce websites products, or some other set of data.
Every scraping project needs to ensure that junk data is being accounted for. We ensure that we don't pull in duplicate sets of data. Sites change on a daily basis, so we build tools that ensure when a scraped site is changing, we know how to handle it.
If the type of data your business is scraping is more nuanced, there are techniques such as microdata and microformat parsing where we read from every site's Document Object Model (DOM) – to capture information for your scraping needs.
The web changes at such a fast rate that the data you're looking to scrape may not be centralized to just a few sites. Like search engines like Google & Bing, we use machine learning tools that can crawl across the entire web to retrieve data.
The majority of scraping projects require pulling in massive amounts of data on a regular basis. With large datasets, we build tools to let you gather insight from the terabytes you're pulling in for you, your employees and your customers.
The web is the greatest source of publicly available data in the world. However, much of this data is not easily accessible. To access data from the web, one of the key skills required is data scraping. This is a technique where a piece of software surveys and combs through a website to gather and extract data.
Every Internet user relates to this process on a smaller scale – we go to websites, see interesting information and try to copy it for later use. Yet this is often not applicable when the necessary information is too large in scale or is spread across multiple websites.
The main advantage of data scraping is its ability to work with virtually any web site – from government sites to weather forecasts to organizational data. Hidden data can also be extracted from PDFs and web pages using data scraping techniques. It is among one of the most useful tools for software applications or websites in need of valuable information.
A computer creates machine-readable data to enable efficient processing. This structured machine-readable data comes in different formats such as CSV, JSON and XML. Most of the data available on the web is published in these formats.
The goal of data scraping is to access machine-readable as well end-user facing data and combine it with other data-sets for a user to explore independently of the source websites. When one is looking for data to use in individual applications, it is not always in the required format.
For instance, government sites are known for publishing PDFs instead of raw data. However any content that can be viewed on a webpage can be scraped. During screen scraping, structured content is extracted from a web page with the help of a scraping tool or by writing a small piece of code.
While this method is quite powerful, it requires a bit of understanding about how the web works and what can be and what cannot be scraped.
Data scraping is conducted with the help of either a scraping tool or by writing pieces of code referred to as web scrapers. There are many tools that effectively scraping data from websites. Depending on the browser, a tool like Readability helps extract text from any web page.
Another tool DownThemAll allows users to download many files from a website in one go. Chrome's Scraper extension also helps with extracting tables from web sites. Then there are web scrapers written in programming languages such as Python, PHP or Ruby.
These web scrapers are then targeted at pages and elements therein and desired data is extracted. For effective scraping, understanding the structure of the web site, web pages and database being used is very important.
If someone wants to get started with scraping without the hassle of setting up a coding environment then ScraperWiki is a web site that allows users to code scrapers in Python, PHP or Ruby.
There are of course limits to what can be scraped. Some factors that make data hard to scrape from a site include:
Another set of limitations are legal and regulatory barriers. Many countries recognize database rights which limit the re-use of some online information. Commercial organizations and NGOs for instance forbid data scraping in most cases.
Scraping freely available governmental data is acceptable, but information that infringes the privacy of individuals and violates data privacy laws is not. Some websites try to prevent scraping by prohibiting it in their online Terms of Service. However, depending on jurisdiction one may have special rights to access (journalists for example).
In a perfect world, all data would be easily available to everyone. Unfortunately this is far from the truth (especially when it comes to government research!). But all-in-all, data scraping helps people retrieve and extract specific data efficiently.
Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It's also one of the hardest tools to parse from a legal standpoint.
For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by "scraping" through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.
It's also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.
The applications of this "scraped" information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he's working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.
While web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.
In this "wild west" environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.
For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder's Edge. In this very early case, eBay argued that Bidder's Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.
Then in 2003's Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder's Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.
So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.
In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook's content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.
In denying Power.com's motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook's Terms of Service don't allow for scraping, that act of copying constituted an infringement on Facebook's copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.
In 2010, hacker Andrew "Weev" Auernheimer found a security flaw in AT&T's website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.
Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.
Earlier this year, the court vacated Auernheimer's conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that "any password gate or code-based barrier was breached." This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.
Meltwater is a software company whose "Global Media Monitoring" product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater's scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.
While it's already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater's use of the articles failed to meet the established fair use standards, and could not be defended on that front either.
By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:
While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.
Digital marketing strategies have reinvigorated the importance of quality content. For most websites, content is the primary driver of vital web traffic that equates to leads and transactions. The company blog is often a designated platform for internal marketing content distribution and exposure.
Over the past several years, the tactic of creating a separate site focused on a niche segmented product or service area has appeared. The strategy is known as micro-sites. Micro-sites are associated websites to an organization, with a unique and separate domain with its own user interface, design layout and content.
This new tactic is used frame content in a very specific and strategic fashion. From contextual targeted messages, to stronger delivery of calls to action, microsites are maturing into the digital marketer's new best friend.
Controlling and specifying the context for which content is displayed, is a major imperative in digital marketing. There is a reason that blogs, news stories, and social media shares are written in their individual way. Users interacting with the various channels are doing so, in a certain for specific reasons.
For instance, when accessing a site via mobile, minimal touch points and optimal button placement are crucial to the user experience. In a similar way, microsites are able to frame content regarding a focused product group to deliver the marketing message in an optimal fashion.
With a microsite, the distractions of a standard website are removed (such as menu bars, side bars, banner ads, etc.). Stripping down the unnecessary functionality results in the user being exposed to the exact information they are intended to experience. By using microsites, the direct impact of messaging is laser focused and thus more effective.
One of the most valuable aspect of designing microsites, is that the design and exposure of content will align directly with the intended context for receiving that content. Whereas the main company website is designed for any and all of your customers, a microsite attends to a particular segment or product line.
Microsites act as a laser focused approach to marketing a brand. In this case, a specific product or marketing campaign is zeroed in on by the microsite. Ecommerce sites especially serve to benefit from this tactic due to the expanse of product offerings for a main site. Certain products may garner the focus of a microsite campaign to drive interest and exposure.
Social sharing specifically is well suited to correspond with light-weight, easily sharable microsites with creative domain names. For instance, Starbucks incorporated a microsite to drive exposure and interest in a new beverage line with the domain Frappuccino.com. In some instances domains are expensive, but if you are creative you can land a domain name that drives the impact of microsite content even more.
Given that a microsite is targeted for specific demographics and focused segments of your entire customer base, the usage data covering the interactions with the microsite will provide equally niche insights. Focused data on a specific segment is crucial to marketing campaigns.
Not only does the microsite serve initial value for exposing your brand's new offerings, but once the product launch and initial reception are completed, the site is still valuable due to the captured user data. Analyzing such usage data in the context of how users interact with the main site, as they do with the microsite, can provide strategic perspective regarding the success of the microsite at reaching out to specific groups.
Understanding that your customer base is diverse, is the first step towards more impactful marketing. Microsites can provide exact data regarding which online customers are interested in specific product groups or web experiences. If you created microsites for multiple segments of your audience, it is possible to track which groups responded more energetically to the strategy.
As a relatively new online marketing tactic for eCommerce operations, microsites have immediate value as well as long term use. The contextualization of content has never been more important than it is in today's market. Multiple devices and varying segments, allow for an entirely new opportunity to differentiate your organization’s marketing from competitors.
A recent Forrester report predicts that online retail sales will exceed $370 billion by 2017. Accompanied by growing innovation in the mobile and social space, eCommerce brands have a lot to deal with in 2014.
Similar to every industry that relies on the Internet, retail brands are constantly combating the rise of disruptive technologies. eCommerce giants like Ebay and Amazon altered the entire retail landscape. In 2014, social media and mobile devices are driving even greater change in the way that customers shop online.
From the 'Pinterization' of web design to the rising tide of mobile customers shopping online and in-store, eCommerce is set for some interesting developments in 2014.
Pinterest and Instagram have had a tremendous impact on the way brands design their eCommerce websites. Large product images, infinite scrolling, and social trending are features that originated with social media. As software development techniques advance to build better apps and websites, expect some new trends this year.
As 2014 progresses, aspects and functions of social media networks will begin to appear more and more on eCommerce websites. Even Amazon has taken a cue from Pinterest with the introduction of ‘Amazon Collections’ (a similar feature to Pinterest boards).
Over 30% of tablet owners use the device as a go-to option for shopping online. In the past two years alone tablet ownership has increased by 282%.
When it comes to eCommerce development for 2014, tablet applications will be a priority for brands. Creating interfaces that are touch-friendly with appropriate finger-friendly layouts, will drive mobile transactions.
High quality zooming to inspect a product in detail and high quality videos describing an item will be common place in 2014. With progression in pixilation and screen resolution, shoppers are expecting interactive media experience online.
In the same way that Pinterest has placed a renewed emphasis on product images, the gallery setup and interactive features are also set to improve.
On Singles Day, the Chinese equivalent to Cyber Monday, shoppers purchased $5.7 billion in transactions from the Alibaba Group alone in a 24 hour period. Compare that to the $1.7 billion from Cyber Monday. In 2014, American brands will increase the targeting of eCommerce on a global scale.
Social shopping platform Wanelo, which boasts more than 28 million visits a month, recently introduced a content marketing initiative. eCommerce stores and independent sellers alike can create 'Stories' that detail the production, history and production of a product.
Similar to a short form blog post, content marketing that accompanies eCommerce products will rise in 2014. The trend will not be limited to the written word. Trending videos and even specified mobile apps will enhance the brand experience for shoppers.
Fancy, a socially curated eCommerce website, offers a service where shoppers pay $40 a month to receive a celebrity curated package of products. Despite significant commentary questioning the subscription model for multiple businesses, it is growing in prevalence.
The model is even becoming a successful approach for small and medium sized businesses. LoveWithFood is an organic grocer that delivers all-natural snacks to customers for $10 a month. In 2014, expect a resurgence of subscription models in the eCommerce industry.
Social media and eCommerce have combined to produce several innovative websites and brands. Wanelo, Fancy and Polyvore are socially driven eCommerce experiences where users discover a crowd-curated list of trending products. Heavy design inspiration from Pinterest and Instagram can be seen in the web development and design approaches used for creating the site.
As soon as a shopper arrives at the website, they are treated to a newsfeed of product suggestions. The more the shopper browses and likes different products, the more personalized the product-feed. In 2014, expect socially driven curation of products throughout the eCommerce space.