uk There is also a ENGLISH VERSION of this site


Mijn Engelstalige boek over IT infrastructuur architectuur





Links

Aanbevolen
Genootschap voor Informatie Architecten
Rene Hamberg
Eric Meijer
Bas Varkevisser
Ruth Malan
l-rs.org
Informatiekundig bekeken
Bredemeyer Consulting
Gaudi site
Hans Bot ArchITectuur Bedrijven
Security.nl
Byelex
XR Magazine
Esther Barthel's site on virtualization


Meer artikelen


01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Jul - 31 Jul 2011
01 Jun - 30 Jun 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 28 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 Jun - 30 Jun 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 28 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jun - 30 Jun 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 28 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 Jun - 30 Jun 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 28 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 Jun - 30 Jun 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 28 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006



Diversen

Powered by Pivot - 1.40.1: 'Dreadwind' 
XML: RSS Feed 
XML: Atom Feed 


Availability - Human factors

28 June 11 - 16:02
Aandachtsgebied: default - Link naar dit artikel

Usually only 20% of the causes of failures are technology failures. In 80% of the cases, human errors are the reason. For instance, a system administrator accidentally pulls a wrong cable or enters an incorrect command. Users sometimes delete inportant (system) files.

Of course it helps to have highly qualified and trained personnel, with a healthy sense of responsibility. Errors are human, however, and there is not a cure for it. End users can introduce downtime by misuse of the system. When a user for instance starts the generation of 10 very large reports at the same time, the performance of the system could suffer in such a degree that in fact the system is unavailable to other users.

Also when a user forgets a password (and maybe tries an incorrect password for more than 5 times) he is locked out and the system is unavailable for him. If that person has a very reponsible job, for instance approving some steps in a business process, being locked-out could mean that a business process is unavailable to other users as well.

Most unavailability issues however are the result of actions from system managers. Some typical actions (or the lack thereof) are:

  • Performing a test in the production environment (hopefully by accident - testing in production is of course not recommended at all)
  • Switching off a wrong component (not the defective server that needs repair, but the one still operating)
  • Swapping a good working disk in a RAID set instead of the defective one
  • Restoring the wrong back-up tape to production • Accidentally removing files (mail folders, configuration files, database files)
  • Incorrect changes to configuration files (for instance the routing table of a network router, or a change in the Windows registry)
  • Tripping over cables, creating a broken or disconnected cable • Incorrect labling of cables, later leading to errors when a change is performed
  • Stopping an incorrect virtual machine (the one in production instead of the one in the test environment)
  • Making a typo in a system command (in UNIX: sudo rm -rf / *.back instead of sudo rm -rf /*.back where one space too many leads to a complete erasure of a hard disk - did you notice the difference?)
  • Insufficient testing, for instance the fall-back procedure to mover operations from the primary datacenter to the secondary was never tested, and failed when it was most needed
  • A system manager or architect made a mistake in the design of the infrastructure, leading to downtime (we thought the Windows cluster was designed in a good way, but when one of the cluster nodes failed, we found that the complete cluster went down)

Many of these mistakes can be avoided by using proper system menegement procedures, like have having a standard template for creating new servers, using formal deployment strategies with the appropriate tools, using adminstrative accounts only when absolutely needed, etc.

When in some UNIX environments the user works under a administrative account (root), automatically he gets the following message:

We assume you have received the usual lecture from the local System Administrator. 
It usually boils down to these three things: 
#1) Respect the privacy of others. 
#2) Think before you type. 
#3) With great power comes great responsibility. 

I think this message makes people aware, leading to fewer mistakes.

Business Continuity Management (BCM) and Disaster Recovery Plan (DRP)

14 June 11 - 15:44
Aandachtsgebied: default - Link naar dit artikel

Business Continuity Management is a management process that identifies potential impacts that threaten an organisation and provides a framework for building resilience and the capability for an effective response which safeguards the interests of its key stake holders, reputation, brand and value creating activities.    

BCM is not about IT alone - handling processes and the availability of people and work places is also part of BCM. BCM includes disaster recovery, business recovery, crisis management, incident management, emergency management, product recall and contingency planning.

A Business Continuity Plan describes the measures to be taken when a critical incident occurs. The measures relate to the continuation of critical operations, but also to measures to halt non-critical processes. Also, the BCS organization during the crisis is described: who will be part of it, what are the duties and responsibilities, etc.

Out of scope of the BCP are the repairs that are needed to return to normality. This is the subject of the Disaster Recovery Plan (DRP).

The BS:25999 norm describes guidelines on how to implement BCM.

Two measurable objectives are related to BCM: the RTO and the RPO. The Recovery Time Objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a lack of business continuity. It includes the time for fixing the problem without a recovery, the recovery itself (if needed), tests, and communication to users. Decision time for users representative is not included in the RTO.

The Recovery Point Objective (RPO) describes the amount of data loss a business is willing to accept, measured in time. The Recovery Point Objective (RPO) is the point in time to which you must recover data as defined by your organization. This is generally a definition of what an organization determines is an "acceptable loss" in a disaster situation.

If the RPO of a company is 2 hours and the time it takes to get the data back into production is 5 hours, the RPO is still 2 hours. Based on this RPO the data must be restored to within 2 hours of the disaster.

Disaster Recovery Planning (DRP) involves making preparations for a disaster, but also addresses the procedures to be followed during and after a loss. A Disaster Recovery Plan contains the set of measures to take when in case of serious incidents (parts of) the business must be accommodated in an alternative location.

The measures taken to repair the damage and/or disruption at the location is part of the Disaster Recovery Plan. Key staff in preparing this are IT and Facilities managers and their staff.

The IT disaster recovery standard BS:25777 can be used to implement DRP in practice.

DRP assesses the risk of failing IT systems and provides solutions. A typical DRP solution is the use of fall-back facilities and having a Computer Incident Respose Team (CIRT) in place. A CIRT is usually a team of system management, and senior management that decides how to handle a certain crisis once it becomes reality.

The steps that needs to be taken to resolve a disaster highly depend on the type of disaster. It could be that the company's building is damaged or destroyed (for instance in case of a fire), maye even people got hurt or died. One of the first worries is of course to save people. But after that procedures mus be followed to restore operations as soon as possible.

A new (temporary) building might ne needed, temporary staff might be needeed and new equipment must be installed or temporarily hired. Then steps must be taken to get the systems up and running again and to have the data restored. Connections to the outside world must be established (not only to the Internet, but also to business partners) and business processes must be initiated again.

The CIRT can use multiple teams to perform these tasks and to restore operations. A recovery team ensures a recovery of IT services is performed on a remote site.

A salvage team ensures the primary site is available again as soon as possible (after cleanup if necccesary).

Finally a team needs to ensure the resuming of normal operations on the primary site. A good practice is to bring least critical work back first, to see if the situation at the primary site is working again as expected, and then bring back the more critical systems ony by one.

Of course this procesure needs to be tested regularly.


Meer artikelen: Zie de linkerbalk.
Twitter LinkedIn Facebook RSS


Over Sjaak Laan

Sjaak Laan

Ik ben 46 jaar oud, getrouwd met Angelina, en we hebben 3 kinderen van 13, 8 en 6 jaar oud. Ik woon in Friesland (Drachten).

Ik werk voor Logica als Principal IT Architect. Ik heb 20 jaar IT ervaring.

Ik bezit de volgende certificaten:

ITAC Master Certified IT Architect


CISSP_logo CISSP (Certified Information Systems Security Professional)


TOGAF8_Certified_web TOGAF Certified Architect



Ik ben lid van:


Mijn zakelijke contacten onderhoud ik via Linkedin.

U kunt mij ook volgen op Twitter: twitter.com/sjaaklaan

U kunt mij bereiken via sjaak.laan [ a t ] gmail [puntje] com.

Deze site bevat mijn eigen mening, en niet noodzakelijkerwijs die van mijn werkgever of van de klanten waar ik voor werk.