Thursday, April 4, 2019

How to define a CloudWatch Alarm in CloudFormation using Metrics Math

It took my quite some time to figure out how to define a CloudWatch Alarm using CloudFormation using Metrics math. Especially because I could not find a proper example, especially not a yaml example. 

After some searching I found a JSON example and converted it for my purposes into yaml


Starting point for my example is a RDS Oracle DB instance defined in the same CloudFormation template,
 but you can use any resource which exposes metrics

Oracle:
  Type: "AWS::RDS::DBInstance"
  DeletionPolicy: Retain
  Properties:
    AllocatedStorage: !Ref DBInstanceStorage
    DBInstanceClass: !Ref DBInstanceClass
    ....

I wanted to create an alert based on the available storage space on the instance
using a percentage value. So I used Metrics Math to convert the FreeStorageSpace metric provided by RDS. The metric comes in bytes, the RDS storage is given in Giga Bytes. So Storage size is converted into Bytes and put into relation to free storage to get a percentage value. 



The Metrics section is the important one. To better understand it, read it bottom up. 
The second entry (Id: m1) actually pulls in the RDS FreeStorageSpace metric for a 5 minute period. "ReturnData: False" makes the data available for calculation but does not return the data in the Metric.
The first entry (Id e1) does the actual math and returns the percentage value for free storage. 

LowStorageSpace:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmDescription: !Sub 'The FreeStorageSpace of ${Oracle} DB is below X%'
    Metrics:
      - Id: e1
        Expression: !Sub 'm1/(${DBInstanceStorage}*1024*1024*1024)*100'
        Label: free storage percentage
      - Id: m1
        MetricStat:
          Metric:
            Dimensions:
            - Name: DBInstanceIdentifier
              Value: !Ref Oracle
            MetricName: FreeStorageSpace
            Namespace: AWS/RDS
          Period: 300
          Stat: Average
          Unit: Bytes
        ReturnData: False
    Threshold: 10
    ComparisonOperator: LessThanOrEqualToThreshold
    EvaluationPeriods: 5
    AlarmActions:
    - Fn::ImportValue: alarming-topic-alarmNotificationTopicArn'
    OKActions:
    - Fn::ImportValue: alarming-topic-alarmNotificationTopicArn'
    InsufficientDataActions:

    - Fn::ImportValue: alarming-topic-alarmNotificationTopicArn'


An overview of all available metrics provided by AWS services can be found in the docs.

The AWS CLI is also helpfull to better understand metrics: 

aws cloudwatch list-metrics --namespace "AWS/RDS" --metric-name "FreeStorageSpace"

for example shows the available Dimension. 

aws cloudwatch get-metric-statistics --namespace "AWS/RDS" --metric-name "FreeStorageSpace" --start-time 2019-04-04T05:00:00Z --end-time 2019-04-04T10:00:00Z --period 300 --statistics Average

Allows to experiment with exactly the parameters you also have to specify in the MetricStat section. 

Unfortunately the "Unit:" Parameter in the MetricStat section does not allow you to choose the unit you would like to receive but rather is used as a filter, so it has to match to the metric you want to load. 


Sunday, April 24, 2016

How to get Ricoh Theta S ready for live streaming on a mac

Being curious about 360 degree photography and video life streaming possibilities just announced by youtube, I acquired a Ricoh Theta S camera.

On the first sight, the Ricoh Theta S has a rather unusual design compared to a usual cameras. But wait, this is a 360 degree camera, so it has to be somehow different. To understand how it can make 360°x360° images and videos, it helps to view it from the side.

 

There are fisheye lenses on both sides of the device, which each can see more then 180°, so the camera can create a full 360 view by stitching the two images together.

Stand alone photography or video shooting is straightforward

  • Power on the device using the power switch (topmost button in the side)
  • choose between photo or video mode using the mode switch
  • position camera and press the trigger button on the front side, bam thats it.
as you need to hold the camera in your hand using, you will see a lot of your arm in the image. If you do not want to be so huge in the image, you can use a tripod and a mobile app (available for iOS and Android), to remotely control the camera.

Mobile apps for iOS and Android allow remote controlling the camera

The Theta S comes with a build in wifi, you can enable using the middle button at the side of the device. So you can use it wherever you want.
  • Download the Theta S app from the App Store
  • Switch on the camera
  • Enable wifi
  • Connect your smartphone
I only tested the iOS app. This app allows you to remote control the Theta S. In photo mode, you can preview the 360° images, in video mode you do not see the actual camera view, you can only start / stop recording. You can later download the created video into the app and view it. Also in VR mode, using the gyroscope of your phone to control movement and in stereoscopic view to use with google cardboard or other devices of this kind. The iOS app will not work with the camera, when the camera is in live streaming mode.

By the way, you might want to change the wifi password, as the default is the numeric part of the wifi SID, visible to everyone close enough to access your wifi signal. You can do this after connection the app, within the app.

Live Streaming mode

Before you start

Most of this section I learned from this blog post on schleeh.de (in german)

Download and install two software modules from Ricoh for the camera. "Theta App" and "Theta UVC Blender". 

The first one is helpful to update the firmware of the camera. Connect the camera via USB to your mac and start the app. Then select the firmware update option in the file menu. An update will only happen, if the batteries are suitably charged.

The second app will register a video driver that provides a stitched video channel with the full 360 video. If you use the theta S camera directly, you will rather see a video with two circular images showing a half sphere side by side. So for live streaming you should use the UVC blender. 

Now that everything is prepared you can ...

Start streaming 

I always start with a disconnected, powered off camera.

 
  1. To enable live streaming mode, hold down the mode button while powering on the Theta S. The camera will then start in live streaming mode, indicated by a blue "live" on the camera front. 
  2. Connect the camera to your computer. As I currently can not stream a HDMI signal to my Mac, I use USB. HDMI video quality will likely be better, but I would need extra hardware like Blackmagic Design UltraStudio Mini Recorder which I currently do not own.
  3. Choose the channel you want to stream to. I tried google hangouts on air. This seems to be the easiest way to live stream, as it does not require any additional broadcaster software. Using only myself to share with and started the hangout. a captured video of this stream will be available in your youtube channel. (this might require some user validation before you can start) 
  4. I created an event and started the hangout. Than I selected the UVC blender camera as video source. To do this in hangout, select the settings icon  and choose the "THETA UVC Blender". This was basically it. I could start broadcasting.

Wednesday, September 26, 2012

Single Minute Exchange of Die or how to counter "this will not work"

Just learned some details about the software development process of a big german enterprise. They have a 4 month iteration cycle, where any project has to run through several (3-4) of these iterations from idea to live. Thus an ideas takes something around 12 to 16 month to go live. And only one iteration is actually working on the code. The others are analysis, specification and testing and deployment. Then (hopefully) every 4 month the whole system consisting of a lot of components will be deployed.

Thinkng about this situation I wonderd how it would be like, to try to convince the responsible people to reduce this cycle time and / or decouple the components in terms of their release cycle. In this discussion I expected a lot of "This will not work" statements.

Changing an enterprises mindset to replace "this will not work" with "we currently do not know, how this can possibly work but we will give it a try" sounds like a good first challenge to take in such an organization. Especially as it felt like the middle management was not very much looking for a change in this process.
This change in langauge could bring discussions and focus more on the "how" than on the "if". Unbelievably they changed the process from three to four month just recently, which just sounds like a change into the wrong direction. Not following one of my favourite agile quotes "If it hurts, do it more often".

So how to convince a group of middle management folks to try this first step and get a change going?

That is, where I remembered the Single Minute Exchange of Die part of "just in time production" which I heard about in the context of the toyota production system. I would assume, that Shigeo Shingo experienced a lot of "this will not work", when he proposed his idea of changing a process that usually took hours or days to only take minutes. He definitely did not know the perfect solution when he started, they changed the process iteration by iteration to make it more efficient. But they started and made it a great success.

The previously mentioned process change from three to four month iteration interval additionally indicates a batch size or complexity problem. And still it seams to be counter intuitive for many people that smaller batch sizes are more efficient. I remember reading a blog post disussing batch size in a software development context, but unfortunaltely could not remember where I read it. When searching for batch size however, I found this article very helpfull. And the type of constraints section from this page on the same site fits even better. Actually the whole site seams worth reading.

Maybe I should give Gordratts book What Is This Thing Called Theory of Constraints a try.

Saturday, June 30, 2012

What is wrong with log levels, Takeaway from #devopsdays

Looking at logfiles, it should be easy to determine required actions depending on some attribute of each log message. The most ovious attribute from my perspective should be the log level. But an open space discussion group at #devopsdays Mountain View 2012 came to the conclusion that this does not currently work.

Developers and operations people have a different understanding about what actions are required to happen on log messages depending on their log level. One of the participants (sorry could not remember her name) suggested that we should be able to agree on the convention that log entries with either error or warning level should be acted upon, if seen on production. Suprisingly for me many disagreed. The loudest one was Jordan Sissel. He said this is a language problem, as devs using a different interpretation than ops, and we will not be able to change the world and make all third party apps and tools to stick to this convention. And thinking about our logfiles, and how many errors show up there nobody would like (and be required) to be woken up for at 3 o clock in the morning, I think he is right.

But why? What is wrong with out current log levels? Shouldn't it be obvious how log levels translate into (required) actions?

Having a Java History, and being used to use log4j, I revisited the available log levels and how I used them.
  • FATAL -> Very rarely used, Only on startup failure, App will shut down after this message
  • ERROR -> There is something wrong the application can not suitably compensate for
  • WARN -> There is something wrong that should not be, but the application can compensate
  • INFO -> Some context what is currently going on in the app, what relevant business activities are happening (major workflow steps)
  • DEBUG -> show a lot of details abaout what is going on in the app.
  • TRACE -> never used this level
I would like to see my production systems run on INFO level. Only in special situation I would switch log level to DEBUG for some interesting classes on production, to get more context what is happening. As Log4j allows to set the log level by category, which is usually bound to the class name of the class that is calling the logger, I can usually do that safely for some time (without restarting the app) as I will do it on a single class or package only.

Unfortunately log4j is not the only logging framework in the world, not even in java. There are lots of them, and many of these try to differentiate themselfes on the log levels.  This is was the java builtin Logger uses :
  • SEVERE (highest value)
  • WARNING
  • INFO
  • CONFIG
  • FINE
  • FINER
  • FINEST (lowest value)
The lower level values of this logger indicate something that I did not get on the log4j levels. There are different concerns addressed by the levels. Severity of an issue at the higher level, and detail or volume control on the lower levels. 

From the "what should I (or an automatic log analyser) do" perspective, and to remove the language problem, more explicit levels would be helpfull, like:  
  • PAGE_SOMEONE_INSTANTLY
  • PAGE_SOMEONE_IF_REPEATED
  • CONTEXT
For detailed tracing a LOGFILE_KILLER level could be added. This should clarify the language problem in terms of actions required (assuming that there is a basic understanding of the english language), as it explicitly says what should happen.

Maybe it should even be simpler, only PAGE_SOMEONE, as the instantly vs. if repeated decision could happen outside of the system.

And of course we will not be able to change the world, but we could start to change the part of the world, we can control.

log4j is a trademark of The Apache Software Foundation, Oracle and Java are registered trademarks of Oracle and/or its affiliates.

Bug Levels, or how simplicity made our life easier

Had a conversation with Jordan Sissel at devops days regarding log levels and the different understanding between dev and ops people. Jordan called this a language problem. I will write another post about that topic, but the discussion reminded me of another situation, where some confusion existed, due to unclear definition of levels. Here is the story.
 
Some years ago, we had more than 300 Bugs in our bugtracking system, about 50% of them older than 3 month and some even older than a year. It took us some effort to manage these bugs, and we were quite frustrated with this situation. Especially as the IT seemed to be the only department that cared about the bugs. Product management did not care, they more focussed on new features. 

Here are some of the things we did and what we achieved:

Many bugs were left unassigned for quite a long time, and sometimes bugs would ping pong back and forth between teams. To overcome that, we introduced a dedicated role to manage these bug, called bug dispatcher. This guy would try to find out which team was most probably responsible for fixing the bug and push them to fix it. He would among other things be involved in discussions with product owners if a given bug should be fixed now, or not. This helped a lot to bring new bugs to the right teams fast, but it did not really reduce the amount of open bugs a lot.

So next we decided to close all bugs older then 3 month. If they were not fixed within this time, they could not be important. But this also required some effort, as we were not consequent enough to just automatically close them, but we ran around, and talked with the PO to close them.

As this was not such a great succes, we had a look at our bug levels and the duration a bug would stay open depending on the level assigned. We had quite a lot of different bug levels in our tracking system at this time : 
  • Blocker
  • Critical
  • Major
  • Normal
  • Minor
  • Enhancement
As far as I remember, we basically took the levels the tool was shipped with. Nobody could explain, if and how the handling of a minor bug in relation to a major bug should differ. What time would be acceptable.

It showed up, that bugs having the first three levels assigned, would be fixed within a reasonable amount of time, all other levels were interpreted as: "we will never fix this".
    
Having understood that, we drastically changed the number of bug levels and the rules applied for handling these.

We now have only two bug levels left:
  • Critical
  • Normal
Critical means, we are or will be loosing money or reputation due to this bug, which means we fix it immediately if found in production, or we will not push this release to production if found on staging.

Normal means, we fix it with one of the next releases.

If the bug is in the tracking system, it will be fixed, unless the product owner excepts the bug behaviour as acceptable for the product, and closes the bug.

This reduced the number of open bugs drastically. And it also took away a lot of discussions. Some simple and easy to understand rules made solving this problem much easier. Getting rid of all the other levels enforced a decision.

Simplicity makes our live much easier now.

Monday, April 30, 2012

I want office desks with wheels

The one problem I remember from all IT retrospective meetings, many from back in times where DevOps was not yet practiced, is "communication", or more precisely "missing communication".

One of the principles behind the agile manifesto I think fits here is:

The most efficient and effective method of conveying information to and within a development
team is face-to-face conversation. 

Although I would replace development team with organization.
I think it is well known, that geographical distance affects communication. Somewhere (can't remember where) I read a hypothesis, that you basically have four classes of communication depending on distance: Same room, same building, same country, other country. And Between each class communication quality/rate/propability will drop significantly. Maybe by an order of magnitude?

Communiction needs people to overcome a wittingly or unwittingly present barrier. The more complicated communication is, the higher will be this barrier. Will you pick up the phone, will you move to check if the other one is at his desk or do you fall back to writing an email, which will make communication asynchronous, and is a one way communication prone to misunderstanding. 

If you are in the same room, you can communicate with everybody just by speeking up. Hopefully only people involved with the same product are in this room, and there are no (cubicle) walls.

If you are in the same building, You can stand up and walk to the person you need, but you may not find her and waste time.

In the same country, you will most probably use electronical communication options. 


Other countries may add timezone and language issues.  

In short, people should talk, face to face if possible. 

But what, if you are not in the same room, but in the same building. And your issue with somebody else is more than a short talk. Something you have to work on for let's say hours or even days together.

Well, I think you should sit together for this time. People/teams should relocate to optimize communication. And this should be as easy as possible. Great if you have Laptops and wifi, and there is some free desk space around. If not, why not using desks with wheels, so it is easy to move it around, plug in power and network, and off we go.

I want that for years, and did not get it yet. But now I read the Valve employee handbook. Wow, these guys, among other very interesting ideas, have desks with wheels, and an automatic tool to locate where people are sitting, depending on where there workstations are plugged in.

I like that idea. Especially as I think it is not a coincidence, that the word agile is a term related to motion / movement.

Let's move!  

Wednesday, March 21, 2012

Reading "The design of everday things" will change your life

Some month ago, I read a tweet from @oschoen, about the book "The Design of Everyday Things" from Donald A. Norman stating that reading this book (quote:) "will ruin your ability to walk through doors without complaining". I wondered why, and started reading.

Now some month later, after reading the book, I know what @oschoen meant.
Just visited the office of a law firm. Very stylish office design, but even people working there for some time do not know how to operate the doors of the wardrobe. And that is only one example. Doors that give no clues about how to use them, or even worse, give misleading cues. Taps that win design prices but until you know how they work, you always have to fiddle around with them for some time.

To make it worse, there are catastrophes that happen because of the bad design of switches or other things needed to operate power plants, trains, airplanes or whatever. And people feeling bad about themselfes not understanding how things work, instead of blaming the designer who built them.

Everybody designing products, mobile or PC applications and web interfaces should give this book a try.