Opening Doors with Open Data

Posted on March 3, 2017

Social_Network_Analysis_Visualization

Disclosure: The author is a member of the Harvard Open Data Project. The group pushes for open data policies at Harvard and conducts projects with public datasets.

Stores of data are growing so quickly that we now create as much data in two days as the entirety of mankind did up until 2013. Now that we have the power to collect data, what are we doing with it?

The recent rise of big data

Big data is the practice of using collected data to create algorithms, models, and analytics to better understand human behavior. It allows companies to provide targeted advertisements to consumers, suggest shows to watch, and perhaps most frighteningly, detect when customers are pregnant. Data has become a key engine for businesses to optimize their business practices and offer personalized experiences to consumers, benefits that research estimates create $3 trillion in additional economic value every year. While corporations have led the way in harnessing the powers of big data, government agencies are slowly modernizing their efforts as well. For instance, the United States Postal Service was able to drastically increase the efficiency of its delivery routes, saving millions of dollars in fuel costs and decreasing the number of trucks used on a daily basis.

Data proponents argue that big data can be bolstered when data is made widely available as open data. When organizations release their data to the public, innovation occurs more frequently. For instance, Google attempted to use search analytics to track the spread of influenza but failed during the 2013 flu season. Their real-time predictions of flu levels differed with Center for Disease Control and Prevention ground-level data by 140 percent. Using the same data that drove Google Flu Trends, researchers led by Harvard statistics professor Samuel Kou were able to devise an improved model that is many times more accurate Google Flu Trends. Situations like this are prime examples of why opening up data is important– datasets are growing increasingly large and more chaotic, so more brains tinkering with models and analytics will inevitably result in more successes like Kou’s.

While it is understandable that companies like Google are unwilling to release their in-house datasets because they contain valuable business insights, data scientists argue that government agencies at all levels should do so by having their data “default to open.” Defaulting to open policies would require governments to openly release all datasets, with the exception of those that are classified or otherwise private. This differs from status quo, where government data are closed by default, and third-parties must petition to gain access to even the most innocuous datasets. Several governments have already adopted open data policies. Harvard, as a leading research institution, should learn from the successes of these initiatives and ought to take steps toward implementing its own open data policy.

Open Data in Government

The arguments for open data inspired United States Chief Information Officer Vivek Kundra to create Data.gov in May 2009 as a centralized location for over 180,000 datasets that belong to the United States federal government. Data.gov represents the first comprehensive open data effort in the federal government.

Researchers have already taken advantage of these datasets. A public-private partnership between Microsoft and the United States Department of Agriculture (USDA) encouraged researchers to make use of USDA datasets in fostering innovations that help American farmers prepare for the agricultural impact of climate change. Collaboration and innovation is made much easier when the means of innovating, datasets in many cases, are easily accessible.

Local governments such as those in Cambridge and Boston have followed suit, developing open data ordinances of their own. Cambridge’s Open Data Initiative cites “creating meaningful opportunities for the public to help solve complex challenges” and “improving delivery of city services” among its many goals. Open data proponents believe that requiring governments to openly release data will foster a greater sense of trust and transparency between constituents and their governments.

Open Data at Harvard

Attitudes towards open data at Harvard and similar institutions of higher learning fall somewhere between those in the private and public sectors. The case for data being open remains the same, but open data policies at Harvard have been met with bureaucratic roadblocks and privacy concerns. Jim Waldo, Chief Technology Officer at the School of Engineering and Applied Sciences, said, “Part of the problem here is that Harvard as a whole doesn’t have an institutional approach [to open data].” The absence of a University-wide policy about data creates the current situation where oftentimes it is unclear even to administrators where ownership of a certain dataset lies. Across Harvard’s twelve degree-granting schools, data is usually controlled by each school’s respective Registrar. According to Waldo, there are policy ambiguities regarding whether registrars hold the power to open up their school’s data.

Harvard has created its own version of Data.gov to house public datasets at data.harvard.edu. Though the university presumably owns thousands of datasets, only 23 are available through the portal. This limitation on the supply of datasets is prime evidence for the amount of bureaucratic red tape that surrounds the University’s data.

In the conversation surrounding open data at Harvard, it is important to recognize legitimate privacy concerns that prevent the University from fully releasing all data. The Family Education Rights and Privacy Act (FERPA) protects the privacy of student records. Additionally, data that reveals how applicants are admitted to the University and unfinished research ought not be made available to the public. These complications may mean that a “default to open” data policy is unrealistic for Harvard, but there are still ways Harvard can improve its current system.

Firstly, the need for a consistent University-wide open data policy is clear. According to Waldo, “all of the data is around, how to open it is more a problem of bureaucracy than it is of technology […] One of the aspects of a large bureaucracy like we have at Harvard is lots of people have the ability to say ‘no’ and it’s not clear who has the ability to say ‘yes.’ Vesting a single University officer appointed by the President with the power to open up datasets would streamline the current dysfunctional system. This officer could manage open data through a system built around the University’s current policy on data regarding information security. Data currently classified at Level 2, “information the University has chosen to keep confidential but the disclosure of which would not cause material harm,” is an obvious candidate to default to open.

Secondly, Harvard could simply decide to open more data requested by students. Waldo points to efforts led by Harvard University IT to increase the amount of data that is being opened to members of the Harvard community as an example of the win-win nature of open data. In opening University data, HUIT grants students the power to help develop the applications that they want most, like Mange for House grill orders or Omni for shuttle tracking. In both the cases of Mange and Omni, the student-led projects were eventually transferred to the appropriate University office for widespread use on campus. The successes of these programs offers a glimpse into a future where students can use University data and resources to improve their own experiences.

Another noticeable example of open data driving innovation and having a measureable effect on the student experience is the development of courses.cs50.net. The unofficial course tool is incredibly popular among students and has usage that rivals the official my.harvard offering. When FAS course data and Q guide feedback was made available, developers associated with Harvard’s introductory computer science course were able to create a more intuitive web application to help students better navigate course selection and shopping week.

What is the future of open data?

A new student group at Harvard is spearheading the push towards open data. After observing the growing role of open data at the federal level, Harvard undergraduates Neel Mehta ‘18, Athena Kan ‘19, and Brian Sapozhnikov ‘19 founded the Harvard Open Data Project (HODP) in the spring of 2015. The group began with the construction of the Harvard Open Data Portal to serve as a singular destination for all data belonging to Harvard University. According to Mehta, HODP’s mission is twofold—it aims to generate awareness of and excitement for what can be done with Harvard’s existing datasets while advocating for a new policy that requires more university data to be opened up.

HODP soon grew beyond the maintenance of the data catalog, adding numerous project teams led by students seeking to use Harvard data to drive change. HODP projects currently in motion include a collaboration with the Office for Sustainability to monitor energy consumption and encourage efficiency in Houses and a partnership with the Harvard University Police Department to map crime reports and identify the crime-prone areas on campus.

Co-founder Athena Kan’s motivation towards achieving HODP’s goals lies in her belief that oftentimes “working with data has a high barrier to entry” and that “needless secrecy” regarding closed datasets impedes innovation. Over the course of its relatively short existence, members of HODP have run into the same bureaucratic and institutional roadblocks mentioned by Waldo. The group hopes that through their work, they can showcase to administrators the many areas in which open data policies can positively affect student life.

At Harvard, an institution renowned for knowledge and understanding, open data policies will help further the University’s mission. Put best by HODP co-founder Brian Sapozhnikov, “Open data serves as both a common ground which allows for inclusivity and publicity of knowledge, and a foundation for creating models and applications which allow us to understand ourselves and the world around us better.”

Image Source: Wikimedia/Martin Grandjean