About Gene Kim

I've been researching high-performing technology organizations since 1999. I'm the multiple award-winning CTO, Tripwire founder, co-author of The DevOps Handbook, The Phoenix Project, and Visible Ops. I'm an DevOps Researcher, Theory of Constraints Jonah, a certified IS auditor and a rabid UX fan.

I am passionate about IT operations, security and compliance, and how IT organizations successfully transform from "good to great."

SEARCH BLOG

Entries in talks (30)

Thursday
Jan262012

Talk Notes: "Why Does Bad Software Happen To Good People?", Matt Tesauro: LASCON Keynote

LASCON 2011: October 27, 2011

Matt Tesauro was the project lead for the LiveCD OWASP Project and is on the OWASP board. My notes are below...

Click to read more ...

Thursday
Jan262012

Talk Notes: A Statistical Journey through the Web Application Security Landscape: Jeremiah Grossman: LASCON 2011

LASCON 2011: October 27, 2011

Jeremiah Grossman is the founder of White Hat Security, where my good friend Stephanie Fohn is currently CEO (she helped us with our first initiatives and product launches at Tripwire over a decade ago, for which I'll be forever grateful). Jeremiah is also very well-known for his work on metrics and benchmarking all aspects of vulnerabilities.

Here are my notes/tweets from Jeremiah's presentation:

Click to read more ...

Thursday
Jan262012

Talk Notes: The Infosec Perspective of DevOps: James Wickett: LASCON 2011

LASCON 2011: October 27, 2011

James Wickett and his ex-boss @ernestmueller are both a very special breed of people. James is well-known for his experience as an information security practitioner and his leadership in the OWASP community (he is the conference chair for the upcoming 2012 OWASP USA conference). But what makes him so interesting to me is that a boundary spanner. Beyond just infosec, he has experience doing IT Operations, as well as Development and DevOps practices.

(Incidentally, I believe his presentation on "The Rugged Way in the Cloud--Building Reliability and Security into Software" as one of the seminal works on how to information security integrates into DevOps-style practices. It is shown below, even though that isn't the topic of this talk note:)

At LASCON, he presented with Peco Karayanev on the PIE tool they built to integrate security practices into daily development and IT operations work. It will look very similar to a DevOps presentation, but hints at how organizations can integrate and deliver the non-functional requirements from the Rugged Computing initiative (e.g., scalable, available, survivable, securable, supportable, etc..).

Here's how they describe PIE, which is a tool they developed at National Instruments to support developing applications that are served up in the cloud:

Click to read more ...

Monday
Jan232012

Talk Notes: Gamification: Gabe Zichermann: ISEPP Lecture Series

IESSP Lecture Series: November 17, 2011

This was a fantastic talk. Gabe Zichermann helped codify the gamification, writing a number of books on the topic, including "Game Based Marketing: Inspire Customer Loyalty Through Rewards, Challenges and Contests" and also the O'Reilly book "Gamification On Design".

My tweeted out notes are below:

Click to read more ...

Wednesday
Jan042012

Talk Notes: John Allspaw and Paul Hammond: "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr": Velocity 2009

2009 Velocity Conference: 6/22-24, 2009, Santa Clara, CA

I'm re-watching John Allspaw (@allspaw) seminal 2009 presentation called "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr." This talk is widely credited for showing the world what #devops coudl achieve, showing how Etsy was routinely deploy features into production at a rate scarcely imaginable for typical IT organizations who were doing quarterly or annual updates.

This SlideShare presentation and the Blip.tv video can be found on John's page here.

Presented with Paul Hammond (@ph), who was his VP Engineering counterpart at Flickr/Yahoo.

This is an awesome talk -- it was even better than I remembered it being. John and Paul discuss the incredible Dev and Ops challenges running one of the largest Internet sites, and how they created the breakthroughs. Nice job, guys!

Talk notes:

  • Funny how the Ops stereotype is oft the same as Infosec: "no, no, no"; "who wants to work w/that person?"
  • "Ops job is not to keep the site up; it's to enable the business; requires ability to enable business change"
  • "Problem: change is the root cause of most outages; Ops paranoia is warranted"
  • "Options: discourage change for stability (crotchety) OR allow change to happen as often as needed (smart)"
  • "You need Ops people who think like Dev; and Dev people who think like Ops"

  • Automated infrastructure

    • manually managing more a dozen servers makes Dev job almost impossible"
    • "Enablers: OS imaging and role & config management; all this enables cloud (e.g., EC2)"
  • Version control:
    • "Flickr source code was in CVS, but Ops stuff was in Perforce; 1 repository critical"
    • "One repository where all dev/ops changes reside, you can quickly see what change to mitigate issue"
  • One-Step Build:
    • showing screenshot of build/stage button: click, SVN checkout, compiles all templates...
    • "...copies everything to staging server for testing, automatically. No manual running commands, undocumented steps"
    • "Obviates issue of Dev/QA/Production config drift, undocumented steps, etc."
    • "After that comes One-Step Deploy; showing Flickr deployment screen; deploy log is poor man change control"
    • "Viewing deploy log may show other deploy in progress, so deployment can be aborted/delayed"
    • "Press 'I'm feeling lucky' will deploy code; no manual steps that can go wrong; continuous deployment/integration"
    • "Deploy log: we know who, when and what; deploy timestamp goes on top of all monitoring tools"
    • "You can't deploy 10 times/day if you're crashing 10 times/day. That's no agile, that's retarded" (haha)
    • "You can use capistrano, makefiles, RPM; we use Hudson to generate packages for ops"
    • "We can now make each deploy smaller, less risky, and more frequent changes; aids in faster recovery"
  • Feature flags
    • "aka branching; lots of branching come from'desktop software' lifecycle artifacts"
    • "For online services, there's only one version that matters: Production; we always ship trunk"
    • "We don't do all dev work in trunk; but by always shipping trunk, you always know which code/env is running"
    • "Instead of branching code, we enable all new features in code with configurable settings; enables private betas"
    • "Allows private betas on production servers with production traffic; we have great staging environments, but..."
    • "...you may not notice new diffs betw QA & Production; allows bucket testing (eg, enable for 5% of users/traffic)
    • "Obviates need for taking servers in/out of rotation, different code bases in production, etc. Do it in code"
    • "Allows dark launches, silently turning on new features, but not making it visible: gives ops experience w/o risk"
    • "For Ops, it takes away all the fear and suspense, because Ops gains experience before it goes live"
    • "Eg, new Flickr homepage had new features that created massive new db load; for weeks, db was being queried, but"
    • "...data thrown away. Dark launch period gave Dev/Ops time to prepare, improve, so launch was flawless"
    • (Brilliant dark launch techniques being discussed here -- I can think of so many times I would have used this!)
    • "We currently have several hundred of feature enable/disable flags; we can always turn things off; if db cluster
    • "...starts having problem, we can disable features to lessen database loads; we don't rollback, we fix forward"
  • Shared Metrics:
    • "We gather tons of operational metrics: Dev watch these metrics as obsessively as Ops"
    • "Each Dev person will have some tab open to Ops metrics (e.g., monitoring for 37 cluster ganglia install)"
    • "We show application level metrics, combined with CPU load, network stats; app metrics give context to it all"
    • Showing graph of, for previous minute, how long each image operation took (after you uploaded kitten pic)"
    • Showing graph of queue size of for some image processing step
    • "John's team makes it easy for us to create graphs: just create file w/{key,val} pair & it shows up in ganglia"
    • "We create adaptive feedback loops: if database is overloaded or queue size too lg, app will throttle back"
    • Describing multi-month process of Yahoo! shutting down photos site and migrating to Flickr; enormous async queues
    • "It takes a lot of time to take years of all your photos into Flickr; petabytes of image data, tons of metadata"
    • "We know how much storage was coming online: unknown: how many people who click 'Migrate to Flickr'"
    • "Predicting when we'd run out of storage space was a huge challenge."
    • "We put last deploy time on every monitoring tool" (showing example of impact of 'small image optimization')
  • IRC and IM robots:
    • We use IRC everywhere, lots of balls in the air, Dev & Ops on it; we squirt events into IRC"
    • "We put build & deploy logs, critical alerts into IRC; and then shove it into search engine"
    • "Now we can ask "has this happened before?'" and "what did we do about it?"
  • Respect.
    • Most important culture element at Flickr is respect, avoiding Dev/Ops stereotypes."
    • "Respect different people's responsibilities: John will get hauled in front of mgmt when sit goes down"
    • "I'm going to get hauled in front of mgmt when we don't ship features on time or enough of them" " "Saying 'no' is another way of saying 'I don't care about your problems'"
    • "Memcache is a marvelous example of what can be created when Dev/Ops work together"
    • "Dev hiding things from Ops is a bad idea: there's prob a good reason why Ops is afraid"
    • "Dev: ask Ops abt: what metrics will change & how? what are the risks? what are signs that something went wrong?"
    • "Dev: ask what are the contingencies? how can Ops recover and help site keep running?"
    • "Dev should come up with answers to all of these before going to Ops"
  • Trust:
    • "Imagine Dev person who says deploy this & if something goes wrong, set this to zero and blame me."
    • "That's obviously a Dev guy who cares about the site, and doesn't want to wake up my team unnecessarily"
    • "Dev needs to bring in Ops when it comes to features; Dev needs to bring in Ops when it comes to upgrading tools"
    • "It sounds obvious, but all too often, I've seen where this working relationship doesn't exist: those are cowboys"
    • "To encourage this, we create shared runbooks & escalation plans: how will new features be supported?"
    • "Provide knobs/levers: provide monitoring for features, enable Ops to change things (eg, # of threads)"
  • "Controversial: give Dev access to production systems: playing phone tag over shell commands is dumb"
    • "Dev should have shell into production systems that are read-only: let them see the system, logs, etc."
    • "Non-root accounts are low risk. Solving problems without it is too difficult"
  • Healthy attitudes around failures:
    • "Airline pilots days each month in simulators, training for emergencies; they develop procedures"
    • "If you have heart attack, do you want treatment from EMT who deals with it once/year, or once/weekly?" Practice.
    • "Fire drills: During Flickr outages, junior engrs observe & practice diagnoses. after site up, check answers"
    • Showing fingerpointyness slide: showing "mean time to innocence" principles
    • "Flickr culture: we figure out stuff, fix it; and often have multiple people blaming themselves!"
  • Avoiding blame
  • "Developers: remember that other people will be woken up when your code break" "Saying sorry next day helps"
    • "Saying sorry makes people feel better about it and shows lack of respect for Ops"
    • "Ask what you'd do if someone weren't there in middle of the night picking up your slack? what would you chg?"