What are the Ethical Aspects of Web Scraping?

Understanding legal considerations and ethics in web scraping

Poonam Rao
8 min readJul 26, 2021

Purpose
This paper does a critical review of literature on the topic of ethics in Web Scraping.

Web Scraping Explained
Big Web Data is dynamic content including HTML tables, blog, tweets, photos, audio, videos, structured and unstructured data. It evolves at extreme velocity, has high volume and variety.

Web scraping, a revolutionizing research practice, is described as the automated method of extracting and harvesting publicly available web data (Luscombe et al., 2021). Macapinlac (2019) defines web scrapers as a bot, involving components of website analysis, web crawling and data organization. A web request to retrieve data is sent to the website, it is parsed, intel extracted and reformatted. Crawling applications are typically written in R and Python including advanced “point-and-click” tools.

Presently there are less than 150 papers on this topic (Krotov et al., 2020) and few that discuss socio-technical perspectives. In this paper, we evaluate the benefits, ethical aspects and concerns of web scraping.

Source: http://applied-r.com/web-scrapping/

Benefits of Web Scraping
Academic & business researchers find it arduous, expensive and cumbersome to gather data necessary for research projects. Vast volumes of publicly available digital and transactional information is not organized or easily accessible for downloading and fostering research activities. Web scraping makes it easy to mine website data.

Use cases include lead generation, scraping real estate data/reviews, monitoring competition pricing, academic research, gathering testing and training data. Krotov et al. (2020) mentions that as organizations focus on growth, they are leveraging web scraping to formulate strategies and ultimately gain competitive advantage by improving organization performance. Luscombe et al. (2021) and Krotov et al. (2020) suggest how web scraping can help researchers with opportunities to gain new insights to novel as well as old research questions.

Ethical & Legal Considerations
Following are some considerations, (Luscombe et al., 2021 and Krotov et al., 2020):

  • Minimal viable scope: In a traditional survey or interview, researchers seek consent and a narrow representative sample is carefully selected. There’s a paradigm shift with web scraping. In “gather-narrow-extract” framework, researchers seek exhaustive sets of data of potential value and then narrow it to the interested subset a.k.a. data wrangling. The sample may not be representative of the population and could lead to biased decisions and discriminations. For example, can FaceBook insights be used to generalize human behaviour and attitudes? Additionally, proxy and incomplete data makes it hard to prove validity of the insights.
  • Performance degradation: The performance of websites being scraped may degrade with scraping algorithms overloading the website’s server with constant data collection. It may also result in a “denial of service” DoS attack, crashing the website, making the content inaccessible to others. One naive user could inadvertently roll a bot and rapidly consume content that would have been otherwise available to a wider audience base.
  • Lack of transparency: Many times researchers do not mention about creative scrapes that might have potentially violated website terms and conditions of use. Contentious decisions made, could go undiscussed, lacking transparency. With lack of coherent policies, absence of ethical review boards and guiding principles for data scraping algorithms there could be potential harm.
  • Potential of harm: Researchers may be required by their funding bodies to make the data collected available publicly, putting them at risk of litigation. They may also inadvertently violate terms of usage, overlooking site “scrapability” or scraping approvals. Confidential organization information may get exposed, website value & brand tarnished, and individual privacy compromised. The terms of website usage may not always serve the best interests of the public. With this, in the future, we might see more lawsuits against researchers. Ramifications being that researchers might shy away from legally ambiguous endeavors.
  • Values of sharing & algorithmic thinking in public interest: Web scraping can revitalize the important values of sharing & fostering trust. The concept of algorithmic thinking in public interest makes the complex dimensions of web scraping visible to the public. Researchers could share the code used as supplemental materials that future researchers could leverage and repurpose. Evaluating the code, raw data and insights would help the public understand the reasoning of decisions and policies formulated in a transparent fashion. Code inspections could help identify problems and biases.

Critical Position
I agree with the concerns raised by the authors. The concept of web scraping is surrounded by inherent paradoxes around legal and ethical aspects. On one end, openness and sharing is supposed to help the public at large. On the other hand, website data could be viewed as proprietary, private assets needing protection by their owners. There are also questions on who owns the data presented in the website and is it truly the website owner.

The use of web scraped data poses ethical questions about research ethics including consent, privacy, anonymity, trust, and transparency. Individual’s confidential and company sensitive information could be exposed. The methods used for storing harvested data may not be in compliance. API’s like Dice, LinkedIn, Facebook, Twitter & Craigslist could be bypassed to retrieve more than approved volumes of information and use it with malicious intent. Backward Google analysis could lead to policy violations. This could be perceived as an intentional act of hacking, or breach of confidentiality, or copyright infringement and illegal use of data. While web scraping that does not cause harm and is cognizant of the website performance is legal, massive data collection and unauthorized or prolonged excess use is a violation. Illegally obtained and harvested that is further commercially used as in the case of eBay v. Bidder’s Edge crawled more than 100K times a day (constituting 2% of daily requests and entire site crawling) to compile its own auction database while overloading the eBay network. Copyright infringement is common with scrapers using data and content for financial gain and staying “safe” with minimal transformation. Uber has been accused of spying on competition companies and individual drivers, by using web scraping to reveal trade secrets, threatening organization privacy.

Policies need to be established that prevent irresponsible scraping, overburdening servers, data abuse and address “softer issues”. Researchers need to be aware of the unintended consequences of their activities and the scale of potential harm and exercise reasonable precaution. They should be able to justify on ethical grounds that scraping activities, methods of gathering/mining, and the associated risks undertaken are in public interest and that they do not violate the rights of research subjects or change the perceived value of websites.

Alternative Perspective
All the web scraping activities can be done by humans manually (Macapinlac, 2019) . Web scraping technology enables doing so at scale, at a fast pace and approximately constitutes 25% of the Internet traffic (Macapinlac, 2019) . Search engines, that have a crucial role in the digital ecosystem, are in fact based on the concept of web scraping, helping us link to pertinent information. Mint is another application of legal and authorized web scraping, where users authorize their financial information to be accessed to track expenses, spending habits and budget. Web scraping allows many small businesses to compete and grow in the marketplace, by automating laborious work with little investment and diverting their energies to more productive activities (Macapinlac, 2019).

Social media profiles are considered public information, freely accessible and cannot be “hacked”. Such data is equivalent to a “public square” Macapinlac (2019), open to be observed at will. The concept of stealing or invading privacy does not apply. Big Web Data provides lucrative opportunities to capitalize on the potential. Web scraping is immensely helpful to any researcher who needs to work with large volumes of data that would otherwise not be available in the absence of the web. Consumers might find it helpful too, especially, when they use price crawler portals for their gain to avail the best possible deal.

Ethically conscious activities help overcome the drawbacks associated with traditional data sources and cumbersome access methods. Researchers can keep up with the need for valid data, all efficiently organized aiding timeliness and precision of their undertaking. Responsible scraping can help organizations gain insights and execute on impactful strategies. Organizations that master the craft of web data scraping can better understand their customers and refine their product strategy.

Counter Arguments on Alternative Perspective
Currently there is no legislation around web scraping; any old laws barely cover all the legal, ethical nuances. With less than 5% of the research papers (150 total) suggests dearth of socio-technical oversight. There have been several cases that have come to federal courts. Large giants corporations want to use price scraping software on small companies, but do not want bots scraping their sites. Some organizations thrive on insights gained from web scraping and selling it. Most of the disputes involve using web scraping to better their own business models at the cost of others, rather than for academia or research purposes.

Harms caused by web scraping cannot be always predicted or anticipated. Layering of information with online and offline sources could expose personal and sensitive data and individual identity. Such data could encompass health, biometric, search data, political affiliations, hobbies, photos, job history, etc. potentially helping create customer profiles and then further used for decision making on those individuals with detrimental consequences. Website owners could do their part by prohibiting automated scraping to protect its customers’ information. Proactive measures to address concerns could avoid expensive money and time draining lawsuits and reputation damage. Companies like LinkedIn, Dice, etc. should educate their consumers on the privacy options and potential threats while also mitigating the privacy implications and securing data.

Computer Fraud and Abuse Act (CFAA), “trespass to chattels” (tort law expanded to electronic property) and copyright infringement laws are outdated and inadequate to cover the technological impacts and ramifications with web scraping. They are still predominantly used to resolve disputes relating to illegal access and use of data. Web scraping is here to stay. There is a pressing need to define and amend legal and ethical boundaries of web scraping to ensure that public data is truly available to all entities availing it. Some countries take a deontological approach (means or duties rather than outcomes) to ethics and privacy while some take the utilitarian approach (greatest good for the greatest number). Holistic sociotechnical perspective needs to be adopted to establish universal ethical norms, principles and best practices which will protect business entities hosting public information, those trying to access it without facing litigation and ultimately the individual who is at stake.

References

--

--

Poonam Rao

Exec Director StratEx - I bring to the table blend of data science, finance and strategy management skills with 20+ years of experience in insurance & fintech.