Discovery Net
Encyclopedia
Discovery Net is one of the earliest examples of a scientific workflow system
Scientific workflow system
A Scientific Workflow Systems is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a scientific application...

 allowing users to coordinate the execution of remote services based on Web service
Web service
A Web service is a method of communication between two electronic devices over the web.The W3C defines a "Web service" as "a software system designed to support interoperable machine-to-machine interaction over a network". It has an interface described in a machine-processable format...

 and Grid Services (OGSA and Open Grid Services Architecture
Open Grid Services Architecture
The Open Grid Services Architecture describes an architecture for a service-oriented grid computing environment for business and scientific use, developed within the Global Grid Forum...

) standards.
The system was designed and implemented at Imperial College London
Imperial College London
Imperial College London is a public research university located in London, United Kingdom, specialising in science, engineering, business and medicine...

 as part of the Discovery Net pilot project funded by the UK e-Science Programme (E-Science#The UK e-Science programme). Many of the concepts pioneered by Discovery Net have been later incorporated into a variety of other scientific workflow systems.

History: The Discovery Net e-Science Pilot Project

The Discovery Net system was developed as part of the Discovery Net pilot project (2001–2005), a £2m research project funded by the EPSRC under the UK e-Science Programme (E-Science#The UK e-Science programme).
The research on the project was conducted at Imperial College London
Imperial College London
Imperial College London is a public research university located in London, United Kingdom, specialising in science, engineering, business and medicine...

 as a collaboration between the Departments of Computing, Physics, Biochemistry and Earth Science & Engineering. Being a single institution project, the project was unique compared the other 10 pilot projects funded by the EPSRC which were all multi-institutional.

The aims of the Discovery Net project were to investigate and address the key issues in developing of an e-Science
E-Science
E-Science is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid...

 platform for scientific discovery from the data generated by a wide variety of high throughput devices.
It originally considered requirements from applications in life science, geo-hazard monitoring, environmental modelling and renewable energy. The project successfully delivered on all its objectives including the development of the Discovery Net workflow platform and workflow system. Over the years the system evolved to address applications in many other areas including bioinformatics, cheminformatics, health informatics, text mining and financial and business applications.

Discovery Net Scientific Workflow System

The Discovery Net system developed within the project is one of the earliest examples of scientific workflow
Workflow
A workflow consists of a sequence of connected steps. It is a depiction of a sequence of operations, declared as work of a person, a group of persons, an organization of staff, or one or more simple or complex mechanisms. Workflow may be seen as any abstraction of real work...

 systems. It is an e-Science platform based on a workflow model supporting the integration of distributed data sources and analytical tools thus enabling the end-users to derive new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid.

Architecture and Workflow Server

The system is based on a multi-tier architecture, with a workflow server providing a number of supporting functions needed for workflow authoring and execution, such as integration and access to remote computational and data resources, collaboration tools, visualisers and publishing mechanisms. The architecture itself evolved over the years focusing on the internals of the workflow server (Ghanem et al. 2009) to support extensibility over multiple application domains as well as different execution environments.

Visual Workflow Authoring

Discovery Net workflows are represented and stored using DPML (Discovery Process Markup Language), an XML-based representation language for workflow graphs supporting both a data flow model of computation (for analytical workflows) and a control flow model (for orchestrating multiple disjoint workflows).

As with most modern workflow systems, the system supported a drag-and-drop visual interface enabling users to easily construct their applications by connecting nodes together.

Within DPML, each node in a workflow graph represents an executable component (e.g.
a computational tool or a wrapper that can extract data from a particular data source). Each
component has a number of parameters that can be set by the user and also a number of input
and output ports for receiving and transmitting data.

Each directed edge in the graph represents a connection from an output port, namely the tail of the edge, to an
input port, namely the head of the edge. A port is connected if there is one or more connections
from/to that port.
In addition, each node in the graph provides metadata describing the input and output ports
of the component, including the type of data that can be passed to the component and parameters of the service that a user might want to change. Such information is used for the verification of
workflows and to ensure meaningful chaining of components. A connection between an input
and an output port is valid only if the types are compatible, which is strictly enforced.

Separation between Data and Control Flows

A key contribution of the system is its clean separation between the data flow and control flow models of computations within a scientific workflows. This is achieved through the concept of embedding enabling complete data flow fragments to be embedded with a block-structured fragments of control flow constructs. This results both in simpler workflow graphs compared to other scientific workflow systems, e.g. Taverna workbench
Taverna workbench
Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...

 and the Kepler scientific workflow system
Kepler scientific workflow system
Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions...

 and also provides the opportunity of applying formal methods for the analysis of their properties.

Data Management and Multiple Data Models

A key feature of the design of the system has been its support for data management within the workflow engine itself. This is an important feature since scientific experiments typically generate and use large amounts of heteregeneous and distributed data sets. The system was thus designed to support persistence and caching of intermediate data products and also to support scalable workflow execution over potentially large data sets using remote compute resources.

A second important aspect of the Discovery Net system is based on a typed workflow language and its extensibility to support arbitrary data types defined by the user. Data typing simplifies workflow scientific workflow development, enhances optimization of workflows and enhances error checking for workflow validation . The system included a number of default data types for the purpose of supporting data mining in a variety if scientific applications. These included a Relational model
Relational model
The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar F...

 for tabular data, a bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 data model (FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...

) for representing gene sequences and a stand-off markup model for text mining based on the Tipster
Tipster
The term tipster refers to someone who on a regular basis provides information on likely winners in sporting events. In the past tips were bartered for and traded but in modern times, thanks largely to the Internet and premium rate telephone lines, it's increasingly likely that a tip will be...

 architecture.

Each model has an associated set of data import and export components, as well as specific
visualizers, which integrate with the generic import, export and visualization tools already
present in the system. As an example, chemical compounds represented in the widely used
SMILES (Simplified molecular input line entry specification
Simplified molecular input line entry specification
The simplified molecular-input line-entry specification or SMILES is a specification in form of a line notation for describing the structure of chemical molecules using short ASCII strings...

) format can be imported inside data tables, where they can be rendered adequately using either a three-dimensional representation or its structural formula. The relational model also serves as the base data model for data integration, and is used for the majority of generic
data cleaning and transformation tasks.

Applications

The system won the “Most Innovative Data Intensive Application Award” at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems, and especially features found in Bioinformatics workflow management systems
Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

.

Beyond the original Discovery Net project, the system has been used in a large number of scientific applications, for example the BAIR: Biological Atlas of Insulin Resistance project funded by the Welcome Trust and also in a large number of projects funded by both the EPSRC and BBSRC in the UK. The Discovery Net technology and system have also evolved into commercial products though the Imperial College spinout company InforSense Ltd, which further extended and applied the system in a wide variety of commercial applications as well as through further research projects, including SIMDAT, TOPCOMBI, BRIDGE and ARGUGRID.

External links

1. List of e-Science Pilot Projects funded by the EPSRC "http://www.epsrc.ac.uk/about/progs/rii/escience/Pages/fundedprojects.aspx"

2. SIMDAT "http://www.simdat.org/".

3. The BRIDGE Project "http://www.bridge-grid.eu/"

4. The ARGUGRID Project "http://www.argugrid.eu/"

5. BAIR project: "http://www.bair.org.uk/"

6. InforSense Ltd. "http://www.inforsense.com/"

See also

  • Workflow
    Workflow
    A workflow consists of a sequence of connected steps. It is a depiction of a sequence of operations, declared as work of a person, a group of persons, an organization of staff, or one or more simple or complex mechanisms. Workflow may be seen as any abstraction of real work...

  • Bioinformatics workflow management systems
    Bioinformatics workflow management systems
    A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

  • Kepler scientific workflow system
    Kepler scientific workflow system
    Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions...

  • Taverna workbench
    Taverna workbench
    Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK