Look, I’ll admit, I’m not a fan of the Rolling Stones and I am not a fan of the cloud.
OK, put down the torches and pitchforks, and give me a chance to explain.
I deal with data, and trying to get data off the cloud so you can do analysis can be harder than watching Mick Jagger dance. It’s awkward, it hurts, and often it just can’t be done (because there are no APIs or only a limited subset of data is available through the APIs).
So when I hear that a business group wants to move to a cloud based data management platform, I intercede and explain the benefits of managing the platform on-premise.
When applications are on-premise:
- you have control of the platform
- you have control of the database
- you can access the data as needed and do with it as you want
- your data tends to be more protected/less vulnerable
Why would you want to give someone else control and limit your access to valuable data?!
As such, I can appreciate one Rolling Stones song: Hey! You! Get Off of My Cloud!
However, there are many good reasons to manage your data in the cloud to meet IT and business needs. And, when I went to AWS (Amazon Web Services) Summit 2017 in San Francisco, dare I say, it opened my mind to even more reasons for utilizing cloud platforms. The advancements in technology and improvements in services continue to make AWS a compelling solution to meet many requirements.
I like it, I love it, I want some more of it. Let me explain.
Cloud Database Services
You want to perform analysis on your data, and perhaps even join it with data from other sources, but that requires having your own database. And managing your own databases can be costly.
You have to procure the needed hardware; manage the hardware (disk or memory failures); set up production, test, and development environments; have disaster recovery processes; implement and continually test backups, snapshots and recovery; test and implement OS and database patches; and so on.
Well, AWS offers a plethora of cloud services, one of which is fully managed database solutions, which will get you started on the road to data management. This includes:
- Host: Fully responsible for hardware, Operating System, and Database software (upgrades and patches)
- Security and Compliance
- Network isolation
- Database instance IP firewall protection
- AWS IAM based resource-level permission controls
- Encryption at rest
- High Availability
- Scale up or down as needed (cost will adjust accordingly)
- Backups and Snapshots
- Multi Zone: Hardware problems will result in failover to another region/zone (you define which one)
AWS takes care of the grunt work, and at the end of the day, you are responsible for schema design, query construction, query optimization, and of course getting data into the database via applications or ETL.
Cloud databases offered by AWS are as follows:
- Oracle – requires license
- SQL Server – requires license
- MySQL – open source
- PostgreSQL – open source
- MariaDB – open source
- Amazon Aurora (MySQL and PostgreSQL compatible) – open source
- Redshift (PostgreSQL compatible columnar database) – open source
Once you have your cloud database, you can work on extracting data from your various sources, loading it to your database, and transforming the data as needed. The end result is data from multiple sources, which you can join together for analysis and reporting, using your favorite tool such as Tableau, giving you deeper insights into your data, and allowing you to create reports and executive dashboards that support informed business decisions.
Working with a recent client, we created a Redshift database and an ETL process to store their Google Analytics, DoubleClick, Dstillery, MediaMath, AdWords, and Bing data. They are now able to join the data from these different data sources, allowing them to easily report across the entire spectrum of their digital data, gleaning observations not previously possible.
But what if you already have a database and you want to migrate it to AWS?
Amazon has you covered with their database migration service (DMS) which easily and securely allows you to migrate your database (on-premise or cloud) to AWS or vice versa, at no cost. It even allows you to migrate from one database platform (Oracle) to another (Aurora). Based on a real time example, it takes about 15 minutes to walk through the setup needed for migration.
I love that they offer open source databases so you don’t have to deal with licensing or the dreaded licensing audit. Think about it, small and medium sized businesses which would have been hard pressed to set up the proper database infrastructure can now have the same secure, fully supported, highly available database environment as the big boys, for as low as a couple thousand dollars a year (assuming open source and smallest sizing).
Cloud Platforms Offer Fully Managed Servers
AWS also offers servers, or elastic computing (EC2). What does this mean?
- Computing is on demand: pay by the hour, no long term commitments
- Resizable: increase/decrease performance/memory as needed, almost seamlessly
- Flexibility: low cost options for development and test
- No up front investment: if you do this on your own, you are committed to what you purchased
- Fully managed: hardware issues are managed as part of the service, most times you won’t even notice a disk failed or memory had to be replaced, etc.
- Fully dedicated: you are not sharing with others
Running your own servers requires full-time staff, a dedicated facility with proper cooling, power management, and other needs. And planning is complex, you can easily over or undersize your computing needs. But AWS takes care of all the above, allowing you the flexibility to resize your computing needs all with a click of a mouse.
In addition, they even implemented ways to help you save money!
How many companies do you know that work to creatively provide solutions to lower your costs?
Examples:
- Not using your development or testing instance? Turn it off until it is needed. You won’t pay for it. And you can turn it back on as soon as you need it again. Can’t do this if you buy your own servers (well you can turn them off, but you still paid for them).
- Monitor your computing. If you determine you have excess bandwidth, downsize your server and you’ll pay less.
- Or monitor your computing and determine usage trends. If you determine computing is less from 5pm to 8am, and even less on the weekends, set up a scheduled script to re-size your servers during those time frames, and then bring them back to the size needed for peak times. You’ll pay less.
- Use spot pricing. AWS has unused compute capacity, so they auction off that unused capacity at a steep discount. Now, you need to be aware that if you don’t bid high enough, you could wind up without the full capacity you need, but this is a slick option.
AWS provided a very interesting customer story. A biological company needed to analyze their data to identify over 200 chemical compounds. They worked with their internal IT team to see what it would take. The research determined it would be a $20 million project which would take 2 years to staff, build the needed data center infrastructure, and procure and set up the hardware.
The company went to AWS, and for a little over $4,000, they were able to set up the servers with the necessary computing power (using spot pricing) and got the results from their data in a week. When done, they turned it off.
Nice savings, wouldn’t you say?
Serverless Computing
I also learned about a service called Lambda, which is serverless computing. This means you can create applications which will run, but you don’t need to buy a server (AWS EC2 service) to have it run. Instead, the application is set up to run on a virtual server, when needed, and you only pay for the processing used.
Think about it. You don’t need to spend the money to have an EC2 instance always available when the application runs. Instead, when the application is run, AWS will allocate the needed computing.
I envision changing our ETL processes to use Lambda. Just recently we had to upgrade our EC2 instance in order to handle some large files which are part of a daily ETL process to load to a Redshift database. We are paying for this larger EC2 instance, even though we are only using it for ETL purposes for certain hours during the day.
Instead, we can use Lambda and shut down our EC2 instance. Money saved.
Bonus Material
The Amazon Aurora database looks like a winner. As mentioned, it is MySQL and PostgreSQL compatible, but they have modified it to improve performance as well as to add an incredible amount of functionality. They are comparing it to Oracle, but at one tenth the cost.
They have integrated Aurora with AWS, so you can take advantage of Lambda, loading data from their S3 servers, storing snapshots and backups on S3, using AWS IAM roles to manage database access, as well as other. In addition, it is compatible with tools like Tableau.
You may ask, with all the improvements they have added, will they be added to MySQL and PostgreSQL since they are open source? The answer is no.
The changes are so integrated with AWS, and that integration is what has allowed so many of the improvements, it isn’t possible to add them back to the open source. This leads me to believe that over time, you’ll see a divergence with MySQL and PostgreSQL (just like you see with Redshift and PostgreSQL). Not necessarily bad, but you should be aware.
As for Redshift, they are implementing soon (or have already implemented) the following improvements:
- New Data Type of Timestampz – Timestamp with timezone
- New Encode Type of Zstandard (ZSTD) – They are seeing improvements of up to 30% compared to LZO, which is impressive.
- IAM Authentication – Map users in Redshift to an IM user (allowing single sign on)
- Auto Vacuum – Reclaims space and sorts when Redshift clusters are idle
- Redshift Spectrum – Run SQL queries directly against data in S3 (files can be CSV, TSV, Parquet, Sequence, RCFile, and JSON). Allows you to query data in Redshift, joining it with files on S3 (unstructured files that you don’t have to load, or transform any of the data), using the same SQL syntax as you do in Redshift.
- They gave an example of running a query using data in Redshift and joining to a file on S3 that was an ectobyte in size, and it provided results in under 3 minutes.
Final Thought
I’m reformed. I can see the value of cloud data management. It isn’t always the right solution, but there are many cases where cloud computing can provide huge cost savings and advantages to on-premise. You’ll need to evaluate each scenario and determine what’s best.
Considerations | Database | EC2 | Lambda |
Need servers but don’t want to purchase and maintain hardware | Y | Y | Y |
Don’t have the headcount or skill set to manage servers, hardware, and software patches/upgrade | Y | Y | Y |
Flexible computing needs, ability to increase or decrease based on needs | Y | Y | Y |
Ability save money by spin up and turn off instances when needed (e.g. Test and QA systems) | Y | Y | |
Don’t have a dedicated DBA to manage administrative functions such as backups, snapshots, etc. | Y | ||
Need a highly available, secure, multi-zone failover system but don’t have the time/resources to implement | Y | Y | |
Don’t have the resources to run application, or don’t want to purchase hardware required | Y |
I think the biggest win is for small to medium size companies that can take advantage of getting a completely compliant and fully supported IT environment for a fraction of the cost of implementing on their own, freeing up resources and money to focus on other opportunities. Like deeper analysis that leads to customer insights and product/service innovations.
So, I request that the Rolling Stones revise their song to “Hey! You! Get Onto the Cloud!” But I’m still not a fan of the Rolling Stones; frankly, I can’t get no satisfaction.