linking back to brembs.net






My lab:
lab.png

Today is the deadline for the White House OSTP RFI of digital data and due to this time constraint I have generated my answer largely (but not exclusively!) from copying and pasting parts of John Wilbanks' and Kitware's responses. Now you should go and send them your response as well, they're expecting it and Open Access depends on it!
To (1):
First, Standards should be developed that can be used to grade data sharing plans, so that grant review panels can know both whether or not a specific data sharing plan is satisfactory and so that for any given call for submissions the reviewers have a sense of how important data sharing is versus the scientific goals of the project.  Second, data sharing plans should be made public alongside the notices of awards and contact information for the principal investigators, so that both taxpayers and scientists know what promises were made and how to contact a scientist and ask for data under the plan approved.
Third, tracking should be possible to begin to estimate compliance: annual grant review forms should contain fields where the researcher is obliged to place URLs to data shared under the plan (or if left blank, explain why), for example. It should also be easy to create a data request system in which those asking for data send a copy of their request to the grants database, which can then be cross-referenced against the review forms to provide at least a rough estimate of compliance. And fourth, scientists with a record of subpar execution against data sharing plans should be downgraded in their applications for new funding. Taken together, these four elements create an incentive structure that would significantly increase the incentive for scientists to provide public access to the digital data resulting from federally funded research.
In tandem, the funding agencies might develop financial models for the preservation of these digital data in much the same way that models exist for estimating overhead and other baseline costs as a percentage of the grant. This could fund not only new library services and jobs in the research enterprise but also serve as a non dilutive funding source for a new breed of data science startup companies focused on preservation, governance, querying, integration, and access to digital data.

To (2):
In addition to the stakeholders listed in this question, it is critical to note that the general public is one of the primary (if not the primary) stakeholders to be considered here. Given that in the context of federally funded scientific research, it is the public’s tax dollars that are paying for the scientific research being undertaken, and thus the public’s interest is the first one that should be considered when making trade-offs between available options.

Scientists who gathered data in federally funded scientific research did so as part of their job duties, and therefore under U.S. copyright laws they were performing “work for hire.” This means that their employers are the copyright holders of any creative aspect of that data gathering (as pointed above, that only include the organization of data collections). Given that the scientists’ employers received funds from the federal government, it should be expected that they will be subject to the same demands of the Federal Acquisition Regulations (FAR) as other contractors of the federal government. In particular with respect to the licensing of data acquired as part of federal contracts.
Some of the best examples of proper licenses are:


Federal agencies should identify a set of licenses that ensure the rights of the general public to deal with the data, in particular to copy, distribute, and create derivative works, and in this way ensure that the data get to reach their maximum economic potential to foster the growth of the economy.

To (3):
Working groups should be established for different disciplines, involving representatives of leading research institutions for each discipline.
Working groups should define differences with how the data are represented, indexed, stored and exchanged, but should not have the latitude to restrict in any way the free dissemination of information. All the policies should consistently have as a common factor the requirement for immediate and full release of data, unconstrained by any embargo periods or licensing restrictions. Credit for the acquisition of data could be ensured by data publications (eg http://datacite.org) that can be cited by further works.
In this process, it is vital to invest in and commit to the emergence of standards that enable interoperability of, and thus reuse of, digital data. Standards lie at the heart of the Internet and the World Wide Web, and together lower the cost of failure to such a low point that companies built on the web and the internet can begin in garages. Such is not the case in the sciences. And it will not spontaneously emerge, even if data flow onto the web. As long as those data are in a tower of babel of formats, incoherent names, and might move about every day, they will be a slippery surface on which to build value and create jobs. Federal policy could call for a standard method for providing names and descriptions both for digital data and for the entities represented in digital data, like the proposed standard of the Shared Names project at http://sharedname.org .
Standards also make it far easier to provide credit back to scientists who make data available, as well as increasing the odds that a user gets enough value from data to decide to give credit back. Embracing a standard identifier system for data posters will make it easier to link back unambiguously to a researcher as well as to make it easier for grant review committees and universities to receive a full picture of a scientist’s impact, not just their publication list.

To (4):
The working groups in the different disciplines (from Question 3) should establish guidelines on practices for dissemination and storage for different types of data. For example, in genomics, it may be reasonable to store the secondary sequence information but not the primary sequence (given their great difference in data size). Analogously, the guidelines may require primary sequences to be stored only for 2 years, while the secondary sequences should be stored for 10 years.

In astronomy it may be required that certain types of images be stored for different periods of time. Some images may be required to be stored with different compression ratios, and therefore correlate their storage cost with the potential expected benefit for future studies. In this cost-benefit evaluation, the original cost of acquiring the data should be taken into account. For example, a project that invested $50M in acquiring data should not attempt to make savings of a few hundred dollars in storage.

Economists must be involved in the working groups charted with the mission of providing guidelines for storage and dissemination, given that this is a problem in which the trade-off for the benefit of society at large must be continually evaluated.

The policies of federal agencies should be affected by the constant advances in storage technology and the rapid decrease in the cost of storage. The federal government should stimulate the development of storage technology, either by creating large storage decentralized facilities, creating consortia to manage data storage services, involving the public in facilitating distributed (and redundant) storage systems based on peer-to-peer technology that has already proven to handle large amounts of data.

All these guidelines should be prepared following open and transparent procedures in order to prevent proprietary standards and vendor lock-in situations that would prevent the policies from maximizing the utility of federally funded scientific research to the general public.

To (5):
If the policies suggested above are implemented, all stakeholder will have sufficient incentives to implement data management plans.

To (6):
Preserving and making digital data accessible is closely related to the issue of preserving and making scientific publications accessible. If libraries and other non-profit organizations take over these tasks from the current commercial publishers as suggested in my answers to the RFI on scientific literature, there will be more than enough funds available from the current publisher profits to allow libraries to store and make digital data publicly accessible.
Once data and literature are stored in a database where both are linked semantically, innovators have a bounty of opportunities to provide commercial services and develop new applications and drugs/therapies to then generate a profit from.
In the current system, this information is restricted to a small set of academics, with innovators largely barred from access.

To (7):
For existing data, researchers, innovators and other stakeholders will demand compliance from the data stewards and provide feedback for improvements.Compliance with the policies for making the data accessible as it is being generated can be achieved as  described above, by developing proper data tracking technology.

To (8):
Once all data and literature are available to innovators, market forces should be allowed to take over without any additional policy interference, as the government is already funding the establishment of this resource.

To (9):
A combination of attribution meta-data and an attribution system awarding attribution scores to researchers. A commonly defined set of metadata annotations will facilitate tagging data with identifiers that point to funding source, researcher name, research lab, institution, and other key attribution information.
Publication venues should in their turn, when considering articles for publication, require researchers to disclose if they used data from third parties, and if so, to provide the proper attribution using the standard annotation identifiers corresponding to that third-party data source.
Researchers would then be able to accumulate attribution scores not only for publications and their citations as is done today, but also for generating data and their use and re-use.

I'm not sure I'm competent to provide expert answers to questions (10)-(13).

Posted on Thursday 12 January 2012 - 14:47:51 comment: 0
{TAGS}

Render time: 0.1224 sec, 0.0047 of that for queries.