Published on:

Subsetting and aggregating: two fundamental programming steps for analysts of data

Two very common steps for a data analyst are to subset data or to aggregate data. When you write code that subsets data, you instruct the computer to pick out a portion of the data and work with that smaller set. For example if you have data on law firm mergers, you might want to isolate the mergers in a single state or for a particular year. You would subset the larger data collection so that only the particular state or year would be worked on thereafter. Or you might want to isolate the states of a particular region. In all these instances, you would need the work-horse of programming: subset.

The reciprocal function of subsetting is aggregating. Pivot tables in Excel perform aggregation quite easily. In fact, every programming language that does quantitative analysis has the function. Very commonly, a data analyst writes code so that data is combined. Staying with the law-firm merger example, a short program segment – actually, only a line or two of code, would add up all of the lawyers in the acquiring firms of a particular state. The computer will dutifully aggregate that amount.

Many graphical plots present either subsetted data or aggregated data, or both. The two concepts and the program code that carry them out are ubiquitous in data science.