Articles on Data Collection

Governments are the best sources of data

A column in Bloomberg BusinessWeek, July 2014 at 10, argues against House Republicans’ efforts to prevent the Department of Education from collecting and publishing data on college costs. Without good information on such matters as all-in costs of attending a school or graduation rates, prospective students will be left mostly in the dark.

The column brought to mind that when governments require data to be submitted and make it available to the public, the data is much more reliable, comprehensive, and timely than data collected by other means. Voluntary efforts lead to low compliance and selection bias; efforts by publishers or players in a market can never reach a government agency’s level of certitude; and privately collected data is, well, private. If you want data collected over time so that you can tease out trends, the problems of non-governmental data are magnified.

To my knowledge, no Federal or state government agency obtains and makes public any information about either corporate legal departments or private law firms. There is data about the legal industry sector and labor numbers (employees, gross revenue, possibly numbers of firms) but nothing else. Particularly, data is lacking about individual law firms. You can painfully extract some data from sources such as EDGAR filings or patent applications, but the collecting agencies are not focused on metrics regarding legal industry participants.

The best legal-industry analysts can obtain comes from their own efforts or the data collection of others, flawed and incomplete though they may be. Even with that somewhat pessimistic summary, I stoutly maintain that much more can be learned from legal industry data sources and analyses.

Understanding the underlying metrics: an example from search engine results

I was interested how many times certain law departments show up in Google search results. When I searched “Google law department”, Google returned what it determined are the top 10 web pages for that search. At the top of the first page, in modest grey font, it said “Page 1 of 4,060 results (0.16 seconds)”. In fact, those “results” merely estimated the total number of “hits” the search would have found had the search engine carefully scoured what had been indexed on the Web. Those are not actual hits.

Moreover, the grey results number drops as you call up subsequent pages of 10 results each. The second page showed 4,050 results, while the third and final page showed 25 results. Eventually the results estimate stabilizes as on this search it did at 25. My second search, for “Microsoft law department”, started at 242 results but that estimate shrank to 37 by the fifth and final page.

Out of curiosity, I ran identical searches on Bing. The Google search returned 15 on the last of two pages while the Microsoft search returned 22 on the second page. I do not know why Google stabilized at more than twice as many results for Google and 60% higher for Microsoft.

My point goes to the heart of data analysis. You have to do your best to understand the accuracy and reliability of numbers that you use. Then, you owe it to those who might rely on your results to explain them as well as you can and to point out possible limitations in those numbers.

Web-scraping for data and the power of Application Program Interfaces (APIs)

An API, which is an acronym for application program interface, is software that lets programmers work with a program, such as to extract information from on online site. Many APIs are available for specific websites. For example, Amazon or eBay APIs allow developers to use their platform to create specialized web stores.

To search Twitter, for example, and find out how many tweets there have been about a particular general counsel, you have to be authorized to use the Twitter API. That step takes a bit of work, but you eventually receive a personalized set of access codes (called keys).

Once you have API access, you can search and retrieve. Using the data returned, you can turn to other software to analyze the frequency, volume, and content of the tweets. In our world that is dense with online information, some facility with APIs will be crucial for those who want to harvest that trove.