Articles Posted in Data Collection

Published on:

A column in Bloomberg BusinessWeek, July 2014 at 10, argues against House Republicans’ efforts to prevent the Department of Education from collecting and publishing data on college costs. Without good information on such matters as all-in costs of attending a school or graduation rates, prospective students will be left mostly in the dark.

The column brought to mind that when governments require data to be submitted and make it available to the public, the data is much more reliable, comprehensive, and timely than data collected by other means. Voluntary efforts lead to low compliance and selection bias; efforts by publishers or players in a market can never reach a government agency’s level of certitude; and privately collected data is, well, private. If you want data collected over time so that you can tease out trends, the problems of non-governmental data are magnified.

To my knowledge, no Federal or state government agency obtains and makes public any information about either corporate legal departments or private law firms. There is data about the legal industry sector and labor numbers (employees, gross revenue, possibly numbers of firms) but nothing else. Particularly, data is lacking about individual law firms. You can painfully extract some data from sources such as EDGAR filings or patent applications, but the collecting agencies are not focused on metrics regarding legal industry participants.

Published on:

I was interested how many times certain law departments show up in Google search results. When I searched “Google law department”, Google returned what it determined are the top 10 web pages for that search. At the top of the first page, in modest grey font, it said “Page 1 of 4,060 results (0.16 seconds)”. In fact, those “results” merely estimated the total number of “hits” the search would have found had the search engine carefully scoured what had been indexed on the Web. Those are not actual hits.

Moreover, the grey results number drops as you call up subsequent pages of 10 results each. The second page showed 4,050 results, while the third and final page showed 25 results. Eventually the results estimate stabilizes as on this search it did at 25. My second search, for “Microsoft law department”, started at 242 results but that estimate shrank to 37 by the fifth and final page.

Out of curiosity, I ran identical searches on Bing. The Google search returned 15 on the last of two pages while the Microsoft search returned 22 on the second page. I do not know why Google stabilized at more than twice as many results for Google and 60% higher for Microsoft.

Published on:

An API, which is an acronym for application program interface, is software that lets programmers work with a program, such as to extract information from on online site. Many APIs are available for specific websites. For example, Amazon or eBay APIs allow developers to use their platform to create specialized web stores.

To search Twitter, for example, and find out how many tweets there have been about a particular general counsel, you have to be authorized to use the Twitter API. That step takes a bit of work, but you eventually receive a personalized set of access codes (called keys).

Once you have API access, you can search and retrieve. Using the data returned, you can turn to other software to analyze the frequency, volume, and content of the tweets. In our world that is dense with online information, some facility with APIs will be crucial for those who want to harvest that trove.