Articles on Data Science

One of the Ten Commandments of programmers working with data: record each change you make!

One hugely important lesson branded into me from analyzing data is the importance of step-by-step procedures. This may sound elementary, but when you start with an Excel file of data from a client, it is crucially important to keep an audit trail of each step of your transformations and calculations.

If you change the names of columns so that they are consistent with code you have already written, you should record and store each change. If you add another variable [think of a variable as a column in Excel], you need to track how you made that addition. Do likewise for any calculations, such as calculating and storing external spending per lawyer. And, by the way, comments along the way complement your efforts to be logical and measured.

Data preparation always involves learning as you go, so if you haven’t saved the steps you have taken, you create nightmares of uncertainty about the quality of your data. Or you can’t figure out how you got (or failed to get) some result later on.

I visualize data preparation as starting from the original data and then methodically molding it: cleaning it, re-arranging it, adding to it, sub-setting it, and naming it. When you save that sculpting, you can go back and confirm each step, or alter one or more of them, and be confident that the final data set gives you a consistent, accurate, and reliable platform for graphics and exploratory data analysis.

Subsetting and aggregating: two fundamental programming steps for analysts of data

Two very common steps for a data analyst are to subset data or to aggregate data. When you write code that subsets data, you instruct the computer to pick out a portion of the data and work with that smaller set. For example if you have data on law firm mergers, you might want to isolate the mergers in a single state or for a particular year. You would subset the larger data collection so that only the particular state or year would be worked on thereafter. Or you might want to isolate the states of a particular region. In all these instances, you would need the work-horse of programming: subset.

The reciprocal function of subsetting is aggregating. Pivot tables in Excel perform aggregation quite easily. In fact, every programming language that does quantitative analysis has the function. Very commonly, a data analyst writes code so that data is combined. Staying with the law-firm merger example, a short program segment – actually, only a line or two of code, would add up all of the lawyers in the acquiring firms of a particular state. The computer will dutifully aggregate that amount.

Many graphical plots present either subsetted data or aggregated data, or both. The two concepts and the program code that carry them out are ubiquitous in data science.

Ideologies trump arguments based on data

I had just written about levels of state regulatory burdens when I read two editorials in the New York Times, July 7, 2014 at A17. One of them describes four ways that GDP calculations mismeasure the size of our economy. For example, the author writes “In its first 20 years, the Clean Air Act generated health savings and other benefits valued at $22 trillion, compared with $500 billion in compliance costs.” He points out that the net gain is not counted in GDP. But my point is that some people will cheer that finding and accept it; others will jeer at it and vehemently reject the methodology as well; almost no one will reconsider their views.

Coincidentally, right next to that editorial, Paul Krugman bemoans the disjunction he perceives in many people between the beliefs they hold and how they process facts: “Confronted with a conflict between evidence and what they want to believe for political and/or religious reasons, many people reject the evidence.” Worse, the better informed they are, the more fervently opponents will toss out the contrary findings.

Those of us who collect and analyze data that reflect law department or law firm management decisions come to realize that the best benchmark data, the most insightful correlations, the clearest graphs stand almost no chance to persuade, or even inform, those who “just know” something different. Incentives work; money matters; technology speeds up; law firms gouge; convergence saves …. All of us, even as we cherish our self-image as being thoughtful, willing to change our minds, and open to different beliefs, are for the most part in ideological straight-jackets.

The legal industry loses when legal management data stays guarded and proprietary

I am deeply invested in this topic, but I am struggling. And, this is a huge topic that deserves multiple thoughtful posts. But I will stick my toe into the water because of a piece in MIT Tech. Rev., May/June 2014 at 10.   A professor of political science and computer science at Northwestern University, David Lazer, wrote about methodological shortcomings of data analysis.

One lesson Lazer draws is that “methods and data should be more open.” Applied to the world of data that managers of lawyers would like, it means that those who collect legal management data and publish results should explain how they collected it, what pre-processing they did (meaning, how did they clean the data before they ran their analyses), and what limitations they are aware of in their methodology.

Unfortunately, those of us not in academia who arduously and expensively gather hard-to-get data do so ultimately in order to make money. Vendors, consultants, publisher and trade groups are not eleemosynary institutions. We don’t want to give away our blood, sweat and metrics, let alone expose to the critical world all the trade-offs, data messes, and tough decisions we made regarding that data. Yet, if we are not more open about our efforts, others can’t help us improve. Nor can they reuse the data for other purposes or complement the data with related metrics. Proprietary data stunts progress.

In short, while the underlying data and analytic steps on law firm revenue, numbers of law suits, law firms used, staffing, and other important management areas remain closed and largely unexplained, managers of lawyers can’t improve how they manage as rapidly.

Why it’s almost certainly wrong to claim that law firm costs “are rising exponentially”

When something increases the same fraction or percentage, rather than the same amount, during each period of time, the numerically savvy call it exponential growth. It would almost certainly be inaccurate for a general counsel to announce that outside counsel spending by her department, for example, has grown exponentially for a period of years. It might have doubled from one year to the next but it is highly unlikely that it doubled once again the next year let alone doubled also in the most recent year. Too many people use “exponential” as an adjective when they really mean “dramatic”.

Had the general counsel’s spend been $1 million in year one, an exponential doubling would see $2 million in year two, $4 million in year three, and $8 million in year four. That level of explosive growth can only happen for a short time and starting from a small base. This comment arose from Samuel Arbesman, The Half-Life of Facts: Why Everything We Know Has an Expiration Date (Current 2012).

For lawyers, the importance of “using data to make decisions”

Rick Klau has spent most of his professional life with Internet startups. Now with Google Ventures, Klau keynoted this year’s ABA Techshow. The ABA Journal, June 2014 at 37, summarized three lessons Klau offered for the legal industry.

His first lesson “emphasized the importance of using data to make decisions.” Klau argued that “facts and figures are exponentially more important than gut feelings and informed opinions.”

This blogger agrees, completely. Whether a decision concerns the choice of counsel for a matter, the promotion of a lawyer, the selection of a matter management system, the maintenance of a patent, the client satisfaction initiatives to undertake, or other calls, data is available or can be collected that informs thinking. No, algorithms can’t replace experience and thinking, but data analysis significantly strengthens the quality of decisions by managers of lawyers. Even if data simply challenges commonly-held assumptions, that helps. Numbers help people make better decisions.