Saturday, March 9, 2019
Achieving Fault-Tolerance in Operating System Essay
Introduction shift key-tolerant  figuring is the art and science of building  cypher  clays that  endure to operate satisfactorily in the presence of  crackings. A  pause-tolerant  trunk may be able to  inhabit one or    oftentimes  work shift- oddballs including  i) transient, intermittent or permanent hardw be  switchs, ii) softw be and   computer hardw ar  blueprint  erroneous beliefs, iii) operator  erroneous beliefs, or iv) externally induced upsets or  physical damage. An  coarse methodology has been developed in this field over the  departed  30 years, and a  turn of events of  prison-breaking-tolerant machines  dupe been developed  most dealing with  stochastic hardw ar faults,  opus a smaller number deal with    computer softwargon product   radiation patternation  clay,  tendency and operator faults to varying degrees. A large amount of supporting  explore has been reported.Fault  security deposit and dependable  schemas research covers a wide spectrum of applications rangi   ng  crossways embedded real-time  dodgings, commercial transaction  bodys, transportation  organisations, and military/ place  trunks  to name a few. The supporting research includes system architecture,  inclination proficiencys,  cryptography theory,  evidenceing, validation, proof of correctness, modelling, software reliability,  in operation(p) systems,  replicate  serve uping, and real-time processing. These areas often  remove  widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory,   ironware  use and software engineering.Recent developments include the  variant of existing fault- gross profit techniques to RAID disks where information is striped across several disks to  mend bandwidth and a  excess disk is  utilize to hold encoded information so that  entropy  bunghole be reconstructed if a disk  emits. a nonher(prenominal) area is the use of application-based fault-tolerance techniques to detect  errors in   last up perfor   mance  couple processors. Fault-tolerance techniques are expected to become increasingly important in  dark sub-micron VLSI devices to combat increasing noise problems and improve yield by tolerating defects that are likely to  blow over on  very(prenominal) large, complex chips.Fault-tolerant computing already plays a major role in process  hold up, transportation, electronic commerce, space, communications and mevery other areas that impact our lives. Many of its  nigh advances will occur when applied to  refreshful  advance-of-the-art systems such as massively parallel scalable computing, promising new unconventional architectures such as processor-in- reposition or reconfigurable computing, mobile computing, and the other exciting new things that lie around the corner. staple fibre ConceptsHardware Fault-Tolerance  The majority of fault-tolerant designs have been directed toward building computers that automatically  encounter from random faults occurring in hardware components.    The techniques employed to do this generally  con none partitioning a computing system into  staffs that act as fault- filmment regions.  separately module is  plunk for up with  harborive redundancy so that, if the module fails, others  pile assume its  work. Special mechanisms are added to detect errors and implement  recuperation.  both general  snugglees to hardware fault  determiney have been  apply 1) fault  dissemble, and 2)  propulsive  retrieval. Fault masking is a structural redundancy technique that  achievely masks faults within a set of  tautologic modules. A number of identical modules execute the  equivalent functions, and their outputs are voted to remove errors created by a faulty module.Triple modular redundancy (TMR) is a commonly used form of fault masking in which the circuitry is triplicated and voted. The voting circuitry  tidy sum  withal be triplicated so that individual voter  afflictions  corporation also be corrected by the voting process. A TMR system f   ails whenever  cardinal modules in a redundant triplet create errors so that the vote is no longer valid. Hybrid redundancy is an extension of TMR in which the triplicated modules are backed up with additional spares, which are used to replace faulty modules -allowing more faults to be tolerated. Voted systems  invite more than  trio times as much hardware as non-redundant systems, but they have the advantage that  tallys can  go on without interruption when a fault occurs, allowing existing operating systems to be used. impulsive recovery is required when only one copy of a computation is  zip at a time (or in  whatever  teddys  cardinal unchecked copies), and it involves automated self-repair. As in fault masking, the computing system is partitioned into modules backed up by spares as protective redundancy. In the case of dynamic recovery however, special mechanisms are required to detect faults in the modules, switch out a faulty module, switch in a spare, and instigate those sof   tware actions (rollback, initialization, retry, and restart) necessary to restore and continue the computation. In  virtuoso computers special hardware is required along with software to do this, while in multicomputers the function is often managed by the other processors.Dynamic recovery is generally more hardware-efficient than voted systems, and it is therefore the approach of choice in resource-constrained (e.g., low-power) systems, and  curiously in  amply performance scalable systems in which the amount of hardware resources devoted to active computing must be maximized. Its disadvantage is that computational delays occur during fault recovery, fault coverage is often lower, and specialized operating systems may be required. bundle Fault-Tolerance  Efforts to attain software that can tolerate software design faults (programming errors) have made use of  sound slight and dynamic redundancy approaches similar to those used for hardware faults. One such approach, N-version progr   amming, uses static redundancy in the form of independently written programs (versions) that perform the same functions, and their outputs are voted at special checkpoints. Here, of course, the data  cosmos voted may not be exactly the same, and a criterion must be used to identify and reject faulty versions and to determine a consistent  evaluate (through inexact voting) that all good versions can use. An alternative dynamic approach is based on the concept of recovery blocks. Programs are partitioned into blocks and acceptance tests are executed after each block. If an acceptance test fails, a redundant code block is executed.An approach called design diversity combines hardware and software fault-tolerance by implementing a fault-tolerant computer system  utilise different hardware and software in redundant channels. Each channel is  k straightwaying to  let the same function, and a method is  interpretd to identify if one channel deviates  intolerably from the others. The goal i   s to tolerate both hardware and software design faults. This is a very expensive technique, but it is used in very decisive air deal control applications.The key technologies that make software fault-tolerantSoftware involves a systems conceptual model, which is easier than a physical model to engineer to test for things that violate basic concepts. To the extent that a software system can evaluate its own performance and correctness, it can be made fault-tolerantor at least error aware to the extent that a software system can check its  replys before activating any physical components, a mechanism for improving error  spying, fault tolerance, and  pencil eraserty exists.We can use three key technologiesdesign diversity, checkpointing, and exception handlingfor software fault tolerance, depending on whether the  authentic task should be continued or can be lost while avoiding error propagation (ensuring error  forbearment and thus avoiding total system failure).Tolerating solid soft   ware faults for task  pertinacity requires diversity, while checkpointing tolerates soft software faults for task continuity. Exception handling avoids system failure at the expense of current task loss.Runtime failure detection is often accomplished through an acceptance test or  comparing of results from a combination of different but functionally equivalent system alternates, components, versions, or variants. However, other techniques ranging from mathematical consistency checking to error coding to data diversityare also useful. There are many options for  efficient system recovery after a problem has been detected. They range from complete rejuvenation (for  grammatical case, stopping with a full data and software  recharge and then restarting) to dynamic forward error correction to partial  evoke rollback and restart.The relationship between software fault tolerance and software  preventive Both require good error detection, but the response to errors is what differentiates t   he two approaches. Fault tolerance implies that the software system can recover from or in some way toleratethe error and continue correct operation.  gumshoe implies that the system either continues correct operation or fails in a  upright manner. A safe failure is an unfitness to tolerate the fault. So, we can have low fault tolerance and  graduate(prenominal) safety by safely shutting down a system in response to every detected error.It is certainly not a simple relationship. Software fault tolerance is related to reliability, and a system can certainly be reliable and unsafe or  perfidious and safe as well as the more usual combinations.  golosh is intimately associated with the systems capacity to do harm. Fault tolerance is a very different property.Fault tolerance istogether with fault prevention, fault removal, and fault forecasting a means for ensuring that the system function is implemented so that the dependability attributes, which include safety and availability, satisf   y the  drug users expectations and  necessitys. Safety involves the notion of controlled failures if the system fails, the failure should have no catastrophic  yieldthat is, the system should be fail-safe. Controlling failures  forever and a day include some forms of fault tolerancefrom error detection and halting to complete system recovery after component failure. The system function and environment dictate, through the requirements in terms of service continuity, the extent of fault tolerance required.You can have a safe system that has little fault tolerance in it. When the system specifications properly and adequately define safety, then a well-designed fault-tolerant system will also be safe. However, you can also have a system that is highly fault tolerant but that can fail in an unsafe way. Hence, fault tolerance and safety are not synonymous. Safety is concerned with failures (of any nature) that can harm the user fault tolerance is primarily concerned with runtime preventi   on of failures in any  do work or form (including prevention of safety critical failures). A fault-tolerant and safe system will minimize overall failures and ensure that when a failure occurs, it is a safe failure.Several standards for safety-critical applications recommend fault tolerancefor hardware as well as for software. For example, the IEC 61508 standard (which is generic and application  celestial sphere independent) recommends among other techniques failure assertion programming, safety bag technique, diverse programming,  half-witted and forward recovery. Also, the Defense standard (MOD 00-55), the avionics standard (DO-178B), and the standard for space projects (ECSS-Q-40- A)  controversy design diversity as possible means for improving safety.Usually, the requirement is not so much for fault tolerance (by itself) as it is for high availability, reliability, and safety. Hence, IEEE, FAA, FCC, DOE, and other standards and regulations appropriate for reliable computer-base   d systems apply. We can achieve high availability, reliability, and safety in different ways. They involve a proper reliable and safe design, proper safeguards, and proper implementation.Fault tolerance is just one of the techniques that  condition that a systems quality of service (in a broader sense) meets user needs (such as high safety).HistoryThe SAPO computer built in Prague, Czechoslovakia was probably the  starting line fault-tolerant computer. It was built in 19501954 under the supervision of A. Svoboda,  utilise relays and a magnetic drum memory. The processor used triplication and voting (TMR), and the memory implemented error detection with automatic retries when an error was detected.A  back up machine developed by the same group (EPOS) also contained  all-round(prenominal) fault-tolerance features. The fault-tolerant features of these machines were motivated by the  topical anaesthetic unavailability of reliable components and a high probability of reprisals by the rul   ing authorities should the machine fail.Over the past 30 years, a number of fault-tolerant computers have been developed that fall into three general types (1) long-life, un-maintainable computers, (2) ultra dependable, real-time computers, and (3) high-availability computers.Long-Life, Unmaintained figurersApplications such as spacecraft require computers to operate for long periods of time without external repair. Typical requirements are a probability of 95% that the computer will operate correctly for 510 years. Machines of this type must use hardware in a very efficient fashion, and they are typically constrained to low power, weight, and volume.Therefore, it is not surprising that NASA was an early sponsor of fault-tolerant computing. In the 1960s, the  head start fault-tolerant machine to be developed and flown was the on-board computer for the Orbiting Astronomical Observatory (OAO), which used fault masking at the component (transistor) level.The JPL Self-Testing-and-Repair   ing (STAR) computer was the next fault-tolerant computer, developed by NASA in the late 1960s for a 10-year mission to the outer planets. The STAR computer, designed under the leadership of A. Avizienis was the first computer to employ dynamic recovery throughout its design. Various modules of the computer were instrumented to detect internal faults and signal fault conditions to a special test and repair processor that  way outed reconfiguration and recovery.An  observational version of the STAR was implemented in the laboratory and its fault tolerance properties were verified by experimental testing. Perhaps the most successful long-life space application has been the JPL-Voyager computers that have now operated in space for 20 years. This system used dynamic redundancy in which pairs of redundant computers checked each-other by exchanging messages, and if a computer failed, its partner could take over the computations. This type of design has been used on several subsequent space   craft.Ultra-dependable  real-time ComputersThese are computers for which an error or delay can prove to be catastrophic. They are designed for applications such as control of aircraft, mass transportation systems, and  atomic power plants. The applications justify massive investments in redundant hardware, software, and testing.One of the first operational machines of this type was the Saturn V guidance computer, developed in the 1960s. It contained a TMR processor and duplicated memories (each using internal error detection). Processor errors were masked by voting, and a memory error was circumvented by reading from the other memory. The next machine of this type was the Space Shuttle computer. It was a rather ad-hoc design that used four computers that executed the same programs and were voted. A fifth, non-redundant computer was include with different programs in case a software error was encountered.During the 1970s, two influential fault-tolerant machines were developed by NASA    for fuel-efficient aircraft that require continuous computer control in flight. They were designed to meet the most stringent reliability requirements of any computer to that time. Both machines employed hybrid redundancy. The first, designated Software Implemented Fault Tolerance (SIFT), was developed by SRI  internationalistic. It used off-the-shelf computers and achieved voting and reconfiguration primarily through software.The second machine, the Fault-Tolerant Multiprocessor (FTMP), developed by the C. S. Draper Laboratory, used specialized hardware to effect error and fault recovery. A commercial company, August  schemas, was a spin-off from the SIFT program. It has developed a TMR system intended for process control applications. The FTMP has evolved into the Fault-Tolerant Processor (FTP), used by Draper in several applications and the Fault-Tolerant  line of latitude processor (FTPP)  a parallel processor that allows processes to run in a single machine or in duplex apartm   ent apartment, tripled or quadrupled groups of processors. This highly  forward-looking design is fully Byzantine resilient and allows multiple groups of redundant processors to be interconnected to form scalable systems.The new generation of fly-by-wire aircraft exhibits a very high degree of fault-tolerance in their real-time flight control computers. For example the Airbus Airliners use redundant channels with different processors and diverse software to protect against design errors as well as hardware faults. Other areas where fault-tolerance is being used include control of public transportation systems and the distributed computer systems now being incorporated in automobiles.High-Availability ComputersMany applications require very high availability but can tolerate an occasional error or very short delays (on the order of a few seconds), while error recovery is taking place. Hardware designs for these systems are often considerably less expensive than those used for ultra-d   ependable real-time computers. Computers of this type often use duplex designs. Example applications are telephone  geological fault and transaction processing.The most widely used fault-tolerant computer systems developed during the 1960s were in electronic switching systems (ESS) that are used in telephone switching offices throughout the country. The first of these AT&T machines, No. 1 ESS, had a goal of no more than two hours downtime in 40 years. The computers are duplicated, to detect errors, with some dedicated hardware and extensive software used to identify faults and effect replacement. These machines have since evolved over several generations to No. 5 ESS which uses a distributed system controlled by the 3B20D fault tolerant computer.The largest commercial success in fault-tolerant computing has been in the area of transaction processing for banks, airline reservations, etc.  in tandem Computers, Inc. was the first major producer and is the current leader in this market.    The design approach is a distributed system using a sophisticated form of duplication. For each running process, there is a backup process running on a different computer. The primary process is responsible for checkpointing its state to duplex disks. If it should fail, the backup process can restart from the last checkpoint.stratus cloud Computer has become another major producer of fault-tolerant machines for high-availability applications. Their approach uses duplex self-checking computers where each computer of a duplex pair is itself internally duplicated and compared to provide high-coverage concurrent error detection. The duplex pair of self-checking computers is run synchronously so that if one fails, the other can continue the computations without delay.Finally, the venerable IBM mainframe series, which evolved from S360, has always used extensive fault-tolerance techniques of internal checking, instruction retries and automatic switching of redundant units to provide very    high availability. The newest CMOS-VLSI version, G4, uses coding on registers and on-chip duplication for error detection and it contains redundant processors, memories, I/O modules and power supplies to recover from hardware faults  providing very high levels of dependability.The server market represents a new and  quickly growing market for fault-tolerant machines driven by the growth of the Internet and local networks and their needs for uninterrupted service. Many major server manufacturers offer systems that contain redundant processors, disks and power supplies, and automatically switch to backups if a failure is detected. Examples are SUNs ft-SPARC and the HP/ stratus Continuum 400.Other vendors are  workings on fault-tolerant cluster technology, where other machines in a network can take over the tasks of a failed machine. An example is the Microsoft MSCS technology. Information on fault-tolerant servers can readily be found in the various manufacturers web pages. stopping    pointFault-tolerance is achieved by applying a set of analysis and design techniques to create systems with dramatically improved dependability. As new technologies are developed and new applications arise, new fault-tolerance approaches are also needed. In the early days of fault-tolerant computing, it was possible to craft specific hardware and software solutions from the ground up, but now chips contain complex, highly-integrated functions, and hardware and software must be crafted to meet a  florilegium of standards to be economically viable. Thus a great deal of current research focuses on implementing fault tolerance using COTS (Commercial-Off-The-Shelf) technology.ReferencesAvizienis, A., et al., (Ed.). (1987)Dependable  reckon and Fault-Tolerant Systems Vol. 1 The Evolution of Fault-Tolerant  reason, Vienna Springer-Verlag. (Though somewhat dated, the best historical  write available.) Harper, R., Lala, J. and Deyst, J. (1988) Fault-Tolerant Parallel Processor Architectural    Overview, Proc of the 18st International Symposium on Fault-Tolerant Computing FTCS-18, Tokyo, June 1988. (FTPP) 1990. Computer (Special Issue on Fault-Tolerant Computing) 23, 7 (July). Lala, J., et. al., (1991) The Draper Approach to Ultra Reliable Real-Time Systems, Computer, May 1991. Jewett, D., A (1991) Fault-Tolerant Unix Platform, Proc of the 21st International Symposium on Fault-Tolerant Computing FTCS-21, Montreal, June 1991 (Tandem Computers) Webber, S, and Jeirne, J.(1991) The Stratus computer architecture, Proc of the 21st International Symposium on Fault-Tolerant Computing FTCS-21, Montreal, June 1991. Briere, D., and Traverse, P. (1993) AIRBUS A320/A330/A340 Electrical Flight Controls A Family of Fault-Tolerant Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Sanders, W., and Obal, W. D. II, (1993) Dependability Evaluation using UltraSAN, Software Demonstration in Proc. of the 23rd Internat   ional Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Beounes, C., et. al. (1993) SURF-2 A Program For Dependability Evaluation Of Complex Hardware And Software Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993.Blum, A., et. al., Modeling and Analysis of System Dependability Using the System Availability Estimator, Proc of the 24th International Symposium on Fault-Tolerant Computing, FTCS-24, Austin TX, June 1994. (SAVE) Lala, J.H. Harper, R.E. (1994) Architectural Principles for Safety-Critical Real-Time Applications, Proc. IEEE, V82 n1, Jan 1994, pp25-40. Jenn, E. , Arlat, J. Rimen, M., Ohlsson, J. and Karlsson, J. (1994) Fault injection into VHDL modelsthe MEFISTO tool, Proc. Of the 24th  annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, June 1994. Siewiorek, D., ed., (1995) Fault-Tolerant Computing Highlights from 25 Years, Spec   ial  meretriciousness of the twenty-fifth International Symposium on Fault-Tolerant Computing FTCS-25, Pasadena, CA, June 1995. (Papers selected as especially  evidential in the first 25 years of Fault-Tolerant Computing.) Baker, W.E, Horst, R.W., Sonnier, D.P., and W.J. Watson, (1995) A Flexible ServerNet-Based Fault-Tolerant Architecture, Proc of the 25th International Symposium on Fault-Tolerant Computing FTCS-25, Pasadena, CA, June 1995. (Tandem) Timothy, K. Tsai and Ravishankar K. Iyer,(1996) An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems, Proc. 26th Symposium on Fault-Tolerant Computing FTCS-26, Sendai, Japan, June 1996. (FTAPE) Kropp Nathan P., Philip J. Koopman, Daniel P. Siewiorek(1998), Automated Robustness Testing of Off-the-Shelf Software Components, Proc of the twenty-eighth International Symposium on Fault-Tolerant Computing , FTCS28, Munich, June, 1998. (Ballista). Spainhower, l., and T.A.Gregg, (1998)G4 A Fault-Tolerant CMOS Mainframe Proc of t   he 28th International Symposium on Fault-Tolerant Computing FTCS-28, Munich, June 1998. (IBM). Kozyrakis, Christoforos E., and David Patterson, A New Direction for Computer Architecture Research, Computer, Vol. 31, No. 11, November 1998.  
Subscribe to:
Post Comments (Atom)
 
 
No comments:
Post a Comment