Welcome. Check out the end of the file and dive in to actual editing. (This first section is copied and pasted from omnioutliner. It could be reformatted. All these notes follow closely with the slides, with an emphasis on getting more information and being more immediately usable with useful context.) - a little ragging on java programmers - we'll be talking about how to build enterprise apps without needing a lot of - java programmers - time - money - history of flickr, game neverending - don't use a datacenter on the 24th floor :) - basic hardware platform, first year -- 70 pretty much whitelabel - now: 200+ machine; dell, hp, netapp - "we really only have data, metadata, and the serving of it" - listen to this man: Don Knuth - forget about small efficiencies 97% of the time - most webapps don't need to scale because they just don't get enough users - you don't know where your bottleneck is going to be, so you don't need to waste money up front on hardware - build with the simplest tools to begin with - hardware mgmt scale - local - shared - dedicated - colo â € ” most large web applications - self hosted â € ” huge company with your own datacenters - self hosting not recommended unless it's really your core competancy - get hardware based on â - availability € ” how easy is it to get the part? - lead times â € ” how long until you can get it? - reliable vendors, importing and shipping - non-hardware needs â € ” easy to forget - rack space if you're at a colo and growing â € ”   enough in a good place for you to grow into - power, bandwidth, network ports - plan for one or two network ports to be broken in a switch - there's nothing worse than spending a bunch of money on hardware and plugging it in and not being able to use it - hardware redundancy as a part of platform design - hot spares €â ”   both are running, when one dies then traffic goes to the other - cold spares â € ” gotta turn it on and plug it in â € ” stuff that doesn't fail much - don't forget spare power strips and other cheap stuff - hard disks fail all the time! but are pretty cheap to replace - need at least one spare of each kind ofdisk you're using - also good to go with a standard and kind of disk - keep spares of everything: switches, firewalls, network cables, LBs... - everything will break at least once - avoid having single points of failure without cold spares - if you don't then you need toknow the replacement lead-times - redundant hardware platform is expensive â € ” do what you can afford - entire datacenter redundancy - gigabit ethernet is really, really fast. so is fast ethernet. not too big a deal. - network layout - completely flat is good â € ” flickr is completely flat - subnet is only really useful for security stuff these days - multi-homing can be useful - split off certain kinds of traffic and guarantee QoS - i.e. user traffic not affecting service between two servers - very useful for redundant network failures, but not really necessary - better to have a subset of servers on one switch and a subset on another - really, if you're not doing anything with video, just go completely flat - load balancing - hardware (expensive!) - software (cheap!) - Pound â € ” can't do it all, but simple - perlbal from Danga - LVM - Zebra - LNAT â € ” very complex, if you can understand the docs - need hardware to run it, but it's a lot cheaper than $100k - just a processor, some ram, and a slow disk. maybe $1k. - good for distribution, not so much for redundancy - layer 4 â € ”   balance based on packet destination - round robin - least connections â € ”   some connections take longer than others - least load â € ” load average, memory usage, whatever - i.e."shortest queue for image rotations" - layer 7 - URL hashing - a unique URL will always go to the same machine - really useful for cache files! (and cool!) - cache files only need to exist on one cache server - much more cache hits - very new technique â € ”   lots of hardware can't do it, only some software can - mod_proxy and mod_rewrite â ” not as fast as hardware but easier to implement - hash table or hash function - table has disadvantage of needing a lot of space - function isn't as good when a server goes down - careful of premature optimization! - huge scale balancing - GSLB - AkaDNS - LB Trees - non-WWW balancing - MySQL / DB € â € ” we'll be talking about this kind of stuff in depth later - SMTP â € ”   very similar to HTTP - Application specific, might need to write something from scratch - mostly completely in software on the webapp side - poor man's load balancing - Software Architecture - trifle - Sponge / MySQL â € ”   supports everything else â € ” database â € ”  data and its relationships - Jelly / PHP â € ” rules of how we manipulate data - complete separation - no access from cream to sponge! has to go through jelly - Fruit â € ” in the jelly, objects - Custard / PHP â € ” page logic â € ” how people interact with the business logic - very separated fromjelly so things don't get messy - Cream / Smary â € â” markup and content delivery - Fruit / CSS € ” presentation - lots of different technologies for these things - the importance of interface design â € ” v. important for complete separation - in the old days things were really smushed together - layer abstraction (don't need to invent your own stuff) - Database to Business Logic â € ” SQL - Business Logic to Page Logic â € ” this is the interesting stuff for us - very clear when you're going C <=> PHP for example - less clear when you're using the same technology - "what function calls do we need?" - add a photo to the database - get all comments to a photo - designing a feature - different actions that will be required becomes the feature - The web apps scale of stupidity - OGF (one giant function) <== Sanity ==> OOP - OGF: old perl apps, KB - OOP: zope, plone, rails - OGF: fine grained control - OOP: easy to maintain, maybe, we'll see - Tech overview â € ” what flickr uses - Linux and FreeBSD - 2.6 on i386, 2.4 on x86_64 - Distro doesn't matter! Consistency more important. - Apache - Apache 2 - Preform MPM - mod_php - MySQL - InnoDB, character sets - MySQL.com binaries - Compiling yourself tends to be misconfigured and slow â € ” waste of time - PHP - ensure you have consistent versions of everything - PEAR::PHP::Compat is your friend with PHP - JVM - Smarty - LAMP is cheap and easy - easy to find Java developers but their webapp experience is questionable - The twelve three rules - ABSOLUTE MUST - Use source control - Have a one step build - Use a bug tracker - Source Control €â ” What is it? - Versioning - rolback - blame - Tagging â € ” "at this point, this is a release" - Branching â € ”   fix bugs on just a release, typically for a longer release cycle - technologies - CVS (RCS) - Subversion / SVK â € ”   all of CVS without a lot of its problems - lots of others (perforce, darcs, bitkeeper, mercurial, etc) - nice to have but not essential - web interface â € ”   don't write your own, waste of time - mailing list - RSS feed - Bonsai - really useful to actually describe your changes - What goes in it? Everything! - App code - Website assets - Documentation (or on a wiki) - getting devs to write it: use emails and IM - when we're spending time asking other devs how things work - when we're spending time reviewing and relearning our own code - Configs â € ” anything you need to modify after a fresh install - Duild tools €â ” scripts used to support the building and deployment - if you use source control, you'll never lose more than a few hours of code - backup the source control repository - edit it live! - insane, not a good long term plan - Good long term plan - we need a setup that supports rapid iteration â € ” fast - but which supports some rigor â € ” a little disciplined - minimum of - development - working copy of the site (not local developer copies) - shared version of the site - All dev work happens here - before it goes anywhere it starts here - all dev services on one box is okay if you can manage that - we've got a handful - staging (is "testing" a misnomer?) - all changes going to production need to go here - testing is a loose term, more on that later - production - customer-facing - only updated via the staging site - In a rapid environment we're going to want to make a lot of releases - 10-20 releases a day, 30m release cycle (or even just one or two a week) - "the simpler you make the release process, the less likely you are to mess it up" - (previous are notes from omnioutliner) (I missed some of the stuff before this) Sync the staging server to the prod servers == Build Tools == just two buttons. about as simple as it gets. quite hard to go wrong. == Bug Tracking == === Track everything === * bugs * features * ops * store info for planned downtime, upgrades, etc * support cases === Minimal feature set === * title, description * notes * status * owner - person that created it * assigning * (how about project?) === Bug tracking software === * fogbugz -- not free, but Cal likes it * mantis - flickr uses this one, free, php * RT -- very complex, free, perl * Bugzilla -- very ("insanely") complex, free, perl * "Trac's fairly good" (urls?) === Dealing with bugs === Fix bugs first before adding new features. Fix easy to fix bugs first. The less bugs you have the less daunting it looks and feels, makes it easier to develop. Categorize your bugs (P1, P2, etc) * P1 - Must Fix. "I might even fix this in the middle of the night" * P4 - "hopefully nobody will find this bug" The fewer categories you have, the less likely you are to get them in the wrong category. Flickr uses Immediate, Urgent, High. (P1-P3) === CADT (Cascade of Attention-Deficit Teenagers) === Developers like writing new features, and hate fixing bugs. Bugs filed just get ignored, then mass marked as closed. This is not a good approach. Good book: Steve McConnell rapid development http://www.amazon.com/exec/obidos/tg/detail/-/1556159005?v=glance == Coding Standards == > It's more important for people on a team to agree on a single coding style... Set standards for * File naming (directory names, etc) * DB table/column naming (case, underscores, etc) * Function and variablen aming * Indentation, whitespace,comments * Braces, etc. (why not use existing standards like the gbu coding standards? or even just follow the lead of such as Rails. See quote above.) "Having a standard early on really pays as soon as you hire a second developer." 12 developers working on Flickr. Engineering team is about 6 or 7. == Testing Webapps == Testing webapps is *hard*. Anything that we can test easily, we can build an automated test for. For example, testing against a public API. Testing against pages is hard because they change so much. Easiest is to just use the features. Eat your own dog food. Flickr has 0 dedicated testers "Nearly everything we've done we've done wrong and fixed later." Editors used - everyone uses something different. They do have shared documentation and configs checked in to source control. * Emacs * Vim * BBEdit * Eclipse * TextMate = Part 2: data and protocols = == Unicode == Unicode is *a* character set, and *some* encodings. It is a standard. ISO 1446(?) It is good for i18n. It is not for L10n. L10n does require some i18n. ASCII is both a character and an encoding. a is code poing 0x61, and maps to byte 0x61. In Unicode, Bengali Vocalic RR is codepoint 0x09E0 in the character set, which maps to different bytes in different encodings. For example, 0x90 0xE0 in UCS2. In UTF-8, 0x09E0 => 0xE0 0xA7 0xA0. UTF-8 is our friend. It is ASCII transparent, so it's easy to upgrade to. Basically, it's easy to implement in HTML, XML, Email, and others... Set utf-8 in the html or xml header, content-type, charset=utf-8 and so on. You'll need a new substr, because utf8 is no longer 1 byte per char .You can use verify_utf8 in php. (Java? others? ask him. they do some Java stuff maybe he'll mention it.) Javascript mostly has native support, _except_ the escape() function. So you'll need to implement your own escape functionality. Email uses a header for the content block, but needs to be encoded inline for header fields (like To, From and Subject). See RFC 1342. == Data Integrity == You need a data integrity policy. It is sensible to filter data at the borders, only filter and manipulate valid data. You can't trust what's coming in. Strip out some characters, such as anything below 0x20. Normalize line returns, which are different on every platform. Many xml parsers barf on carriage returns in value strings. A public API will often get bad data that needs to be checked. Filter using iconv, tell it you want it to convert UTF-8 to UTF-8. Don't try to write a big regex for it. In PHP, check out utf8_encode to go from iso8859-1. == Filtering (X)HTML == Displaying user HTML in your application is a really bad idea. Style and JavaScript are big spoofing holes. Best to usea whitelist otherwise you're just in an arms race. Even then we still need to be careful, because parsers can be tricked. IE7 is just about to come out with a whole list of new tags to exploit "for fun and profit". :) (yes, people actually still use IE) For example, matching against href="javascript:foo" and its variants, which legitimately works in a lot of browsers. Whitespace, tabs, 0x0, encoding, encoding with zero padding, encoding in decimal or hex, mixing case, whatever you want. For starters. So use a good library like lib_filter from http://code.iamcal.com/ (PHP). == SQL Injection Attack == This ought to be good. i.e., pass in query string, password="foo' OR '1'='1" (pretty sure) Just in time escaping, escape just before sending sql to db, so you're not assuming that the data is good. Also avoids action-at-a-distance problems. Example of just-in-time escaping in PHP: db_insert('table_name', array( 'field_1' => db_escape($val1) ... ); In Java, use PreparedStatement. Rails ActiveRecord does this automatically as well with transparent support for the syntax of the db you're using. == Dealing with Email == Email allows mobile blogging, file uploads, support tracking Be lazy: don't write an SMTP server! Reuse existing software. Set up pipes in /etc/aliases for a script that reads from stdin. Use the same processing logic as a file upoad form Except for caveats, course... === MIME and attachments === Relevant email RFCs: * RFC 561 -- Sept 1973 * RFC 822 -- Aug 1982 * RFC 1521 -- Sept 1993 Basic explanation of MIME... Don't reinvent the wheel, parse MIME with a library that isn't too broken. MSFT has a special attachment format for Outlook. Fairly easy to unpack once you know the format. The spec might be on the MSFT site, good luck finding it... Incoming email might be in a number of different character sets. Old clients we can mostly assume it's Latin-1. iconv does the heavy listing pretty much for the rest. Special cases from wireless carries. Subject lines, attachments, links, and so on. Be careful. It's useful to capture weird emails and add them to the test suite. Flickr has 200 special cases for email that are all unit tested from one button. == Talking to other services == We can use XML but there is some ugliness. Namespaces, for one, plus it's kinda heavy. Sometimes you need a lightweight protocol, so write your own. In general, need to always assume everything is going to fail. Should have safe reads and writes for things like data replication. Also in the event of failure, pretty easy and beneficial to implement hot failover to another service. Always assume failure! Easy to use HTTP to communicate with other services. Well documented. Tried and true. GET, PUT, POST, DELETE. The first rule of remote-services club is always expect the service to fail. The second rule of remote-services club is always expect the service to fail. == Asynchronous Systems == Some services take a long -- or variable -- amount of time. Anything that takes longer than a few milliseconds. In Flickr, any page should take less than 300 ms of time to return. Rather than keeping a connection open, use a callback or return a ticket which you can use to query for the status later. Or just assume it'll happen, if you don't care when. Flickr daemons written in java: image daemon, storage master. They manage queues of job requests and assign a ticket which you can use to check on the job's status later. Why Java? Well, just anything but PHP, which leaks too much memory to be a system daemon. ----- = Bottlenecks = Bottlenecks are where the most time is spent. Most of the time, they are not what you expect. CPU is not usually the bottleneck. Only when you're doing heavy crunching like crunching images, audio, video, etc. Just don't do dumb things, and add more RAM. Databases are usually a big bottleneck. Flickr uses interesting monitoring feedback that measures time spent in database. Only enabled in staging environment, and shown when a certain GET parameter is provided. One major bottleneck is disk IO somewhere, which is really slow. Some tools for examining disk io: iostat, Bonnie/IOZone. Flickr uses mostly SATA with some SCSI for db servers. Ultimately, it is pretty easy to just add disks. (Missed the section on memory and swap.) Basically, don't let MySQL or Squid swap. Query profiling. Benchmark queries and turn off the cache. Track all the queries and group them into classes. Count how often and track the total time spent in each query. Gather objective data so we can go for the low hanging fruit. === Speeding up queries === Indexes, caching, denormalization. MySQL indexing, partial indexing. Getting the correct indexes is essential. Just hire a contractor to optimize for you. Some advice on MySQL indexes and column ordering and such. Columns with highest cardinality to the left. EXPLAIN is your friend: EXPLAIN SELECT * from 'Accounts' where 'id'=287 will give you a table of info about the query. rule of thumb is "the more stuff in the extras box, the slower your query will be" Memcached - resident memory cache. Basically a big hash table that runs on any computer with some memory to spare. Easy to drop in to a well designed app. "Leverage existing technology for explosive synergistic growth" == Monitoring == Monitoring is important because a lof of bottlenecks only occur when you have real traffic, so it's easier to measure than anticipate. Use a 'beacon' -- 1x1 transparent gif at the top of the page which you track requests for. Long term data tracking: Ganglia. ganglia.sourceforge.net. "Ganglia kicks ass!" Better at tracking trends than spotting emergencies. Ganglia ia also used in the "Rocks" Beowulf Cluster package. Nagios for real-time emergency spotting (ping, http, mysql, disk usage, replication lag, whatever). They use Nagios. We almost use Nagios. ---- = Scaling = Scalability is not: * Raw speed * Using XML - it's just a format for data * Using Java - you can build scalable stuff in Java. And assembly. * or byte-code * Separating page and business logic - that's for maintainability * Having a persistence layer What is scalability? * Platform growth (quote?) * Dataset growth ("the ability to store more images than we can now when we run out of space") * Maintainability ("doesn't necessarily take a team of hundreds of people") Vertical vs. horizontal scaling. Horizontal scaling, the costs are linear. Though it does result in more administration, more power, more plugs, more ports, and more space. == MySQL Back ends == MyISAM * Very fast SELECTs * Table level locks * FULLTEXT indexes -- word phrases, frequenct, stuff like that BDB * Pretty fast selects * Better concurrency than MyISAM with page level locks InnoDB (used by flickr for pretty much everything) * ATOM - Atomicity, Consistency, Isolation, Durability * MVCC - Multi Versioned Concurrency Control * Foreign keys Heap * Super fast, but all in memory * Data not persistent * Good as a cache with full SQL abilities == MySQL Replication (lots of reads) == Data is literally copied from one to another. # All writes go to master # Master writes to a binary log # Slave replays the log # Reads come from both master and slave Typically, webapps have much more selects than inserts and updates. A lot more reading than writing, which master-slave addresses well. It's pretty easy to add multiple slaves to handle ever-increasing reads. Master <--> master replication Each master writes for its own set of tables, and each replays the other's binary log. Can be confusing to use and set up. Useful for redundancy, even hot redundancy if you set things up right. Doesn't necessarily help scaling. Only really useful for pairs of machines. A tree setup lets us have alot of connections for reading. It adds some more points of failure and makes writes take longer to propagate. Bringing us to... Replication lag. For example, comments taking a while to show up. There are techniques around this. For example, have the user reading from the master for a little while so they can see the most recent stuff. Also affects backups, hot-cold failovers, and so on. Partial replication lets us have specialized slaves with only certain tables. We can also write to a slave, and that data isn't written to the master. For example, the Flickr search clusters had a slave replicating just one table, in MyISAM for fulltext search. By the way, choosing which database to read from is done from a PHP array which is shuffled. "our database schema is very complicated nowadays because we have lots of features" Database clustering splits data from tables across different databases. Makes for harder management and a bit more overhead per page, but it's usually not too big a deal for just a handful. "Always connect on demand" - re: mysql Federation is hard, and sometimes expensive. "within yahoo, we have no large commercial apps that run oracle. it's too expensive." == Denormalization == "Normalization is for sissies" Denormalization is basically a copy of what's in the database already, stored in a different format. Flickr has one normalized schema with denormalized views (extra tables and columns) on top of it. Requires maintaining data consistency in the application. Write the denormalized data whenever you write the normal. Or rebuild the denormalized view from the normal when required. normalized contacts table, could be used to query photos table denormalized table includes ids of five most recent photos basically a kind of caching within the database to get in a single query what would take many Swapping masters = PITA. See slides for steps to do that sort of thing. == Scaling File IO == SCSI vs. SATA Disk caching RAID ultimately, just add more disks. plus an overview of various other technologies. they use raid 10 on their 1U's and raid 4 on the larger servers About 600 TB of raw storage (including RAID overhead) Storage Mastr: Middleware for storage abstraction. == Scaling Network IO == more switches and routers, and gigabit ethernet is fast.