Posted: January 23rd, 2012 | Author: Pete Hunt | Filed under: Uncategorized | 4 Comments »
There’s a lot of discussion on HN right now in response to Paul Graham’s call to disrupt Hollywood.
A few commenters hit the nail on the head. With the movie business it’s no longer about distribution anymore. The music business has been disrupted even more, to the point where distribution was (sort of) solved in the late 90s and production is now so cheap and available that it’s no longer relevant.
The problem is filtering. How do I cut through all the crap to find the music I actually want to consume?
Traditionally, it’s been solved by tastemakers like radio DJs and syndicated label-sponsored playlists. Many think that it’ll be disrupted by an Amazon-style personalization system.
I think they’re missing the point. For the average person, entertainment is as much about creating a shared culture as it is about the content itself. Most people want to feel like they’re part of a movement. They want to go to a huge arena filled with people similar to them and enjoy an event together. They want to find common ground with each other and gossip about celebrities in their spare time. Or they want everyone else to do these things, and differentiate themselves by enjoying more obscure artists or genres.
Pandora, perhaps the most well-known music personalization service, realized this a long time ago; that’s why most of their non-niche channels very closely resemble the Billboard charts of their respective genres.
This is why the music industry is consolidated into a few powerful multinational corporations who own or influence the full stack. This consolidation has traditionally been the only way to have enough marketing firepower to create a shared culture around an artist. If you want to disrupt this industry, you need to solve this problem — otherwise you’ll never create a product that fits the masses. Right now, the only things that come close are Pandora and social media like Facebook and YouTube, but a service that deftly walks the line between providing a personalized service to the listener while being opinionated at the same time could be disrupt this industry in a major way.
Posted: January 11th, 2012 | Author: Pete Hunt | Filed under: Uncategorized | 1 Comment »
Recently the following presentation has been circulating the Python community: http://python-for-humans.heroku.com/#1
I think it hits the nail on the head regarding the usability of Python packages, and towards the end he talks about a lot of places for improvement in the Python world. Specifically, we need to get back to our philosophy of “there should be only one obvious way to do it” and come up with a set of conventions for common tasks. Here’s my take on it.
Packages and distribution/dependency management
pip and requirements.txt are great, but they feel crude. Additionally, it’s well-known best-practice to use virtualenv for all of your projects, but it’s as if it’s a glued-on Python feature (which it is). What we’re really trying to do with virtualenv is specify a small set of required Python packages at specific versions for the package you are working on. There’s no reason why we can’t build a better way of doing this.
I propose adding a __dependencies__ attribute to __init__.py which specifies dependencies and module names, effectively running each Python package in its own pseudo-virtualenv. It’d look something like this:
# __init__.py
# include Django 1.3.1 and any version of PyMySQL
__dependencies__ = (('Django>=1.3.1', 'django'), ('PyMySQL', 'pymysql'), ('deb:apache2', None))
Now, any Python file in this package can import modules from “django” and “pymysql” and they’ll go to the right version. Note that we specify a name as the second item of the tuple — this is the name that we’ll use for importing. This allows us to use multiple distributions that have the same package names.
The last one (apache2) is an idea that I’m kicking around. It would be nice if we could specify platform-specific dependencies as well (using a tool like yum or apt-get to resolve them). This part requires more thought.
When importing this package, we have enough information to automatically install dependencies from PyPI (no more python setup.py install or easy_install!), which makes the whole getting started process extremely easy. Additionally, if we try to import a module that isn’t there, we can rig the ImportError to include a message that says “hey, we couldn’t resolve pymysql, maybe you should add the PyMySQL dependency to __init__.py”
You need to be careful with this approach (for example, it could be possible to be accidentally using two different versions of the same package and passing their objects around haphazardly), but this is a tractable problem I think through, perhaps, runtime warnings.
I also thing setup metadata belongs in the package, rather than in a setup.py script. We could accomplish this by including a __meta__ attribute that includes package metadata.
“Blessed” packages
Since the above system makes it so easy to install any package from PyPI, we should have a drop-dead, obvious place that provides concrete answers for questions like “what is the best process manipulation library” and “what is the best HTTP library.” This would ideally replace the notion of an included standard library.
You could also imagine a package or author “karma” system to decide how good a package is. This could be based on test coverage, lint information and user reviews.
Configuration management
There are infinite ways people create configs for their Python packages, but there isn’t a clear standard for this yet. We need a standard way to do this, while at the same time maintaining flexibility.
I propose adding a __configure__() method to every package which takes an arbitrary set of parameters. Through here you’d pass all of your configuration data for a given package. Now, most of the time you won’t need it since packages just contain code that you call without configuration, but many projects do need configuration, like Django.
Often, we’ll want to create configs specific to an instance of our project. Our project will have its own __configure__() method which may or may not call __configure__() on its dependencies. But we’ll need to put the initial configuration data somewhere. I propose a file called __configure__.py which is auto-loaded when running a package. Its sole purpose is to call __configure__() on the correct packages.
If you don’t want to do your configs in Python, that’s fine; just call __configure__() with the path to your custom config file format.
You’ll notice that the system I’ve just described is almost a complete rip-off of Django. They have django.conf.settings.configure() and settings.py files. I’m proposing a generalized version of this.
Providing services
Check this out: http://blog.ianbicking.org/2011/03/31/python-webapp-package/. The key takeaway for me is: “an application requests services, and the container tells the application where those services are.” We already have a way for configuring those services, we just need a way to provide them. The canonical examples for this are DB-API modules and WSGI applications.
I propose a new package method, called __service__(), which takes a service name and a set of unspecified arguments and returns an object implementing this service. You can imagine how great it would be to just __configure__() your web app in __configure__.py in the current directory and then call __service__(‘wsgi’) to get the WSGI app.
We would come up with a standard set of service names as well (like wsgi, db, unittests etc), and standard __configure__() parameters for these services.
System integration
Last, I’d like to briefly touch on system integration. I find myself dealing with roughly three pain points when doing platform-specific integration:
- Where do my command-line scripts go and how do I ensure they’re set up correctly?
- My package needs a cron job. Where do I set this up? How do I do it in a platform-agnostic way?
- Ditto daemons
I think providing ‘scripts’, ‘cron’, and ‘daemon’ services as described above would be a natural way to provide this functionality. The API for this needs to be fleshed out more, but I would love to just be able to do this for a package:
python –install –package=mywebapp –configuration=__configure__.py
Or even:
python –install –distribution=”MyWebApp==1.0.1″ –configuration=/home/web-user/__configure__.py
and it would call __service__() appropriately, install the scripts to /usr/local/bin and /etc/init.d/ and ensure that __configure__.py was being used.
In closing
So that’s it. I may play around with building a prototype when work slows down. If you’re interested in building this let me know, I think it could be done in 1-2 day sprints.
Posted: December 27th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | No Comments »
I made this when I was bored during the holidays. Maybe someone out there finds it funny.
http://hipstergrammers.tumblr.com/
Posted: November 8th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | No Comments »
I just released PyMySQL 0.5. This version should be much more stable than previous versions as I’ve fixed a lot of the unicode handling.
As always, we support an extremely broad range of Python versions, including the 2.x and 3.x series. The 2.x version is in PyPI as PyMySQL, the 3.x version is in as PyMySQL3.
Check it out at http://www.pymysql.org/
Posted: October 18th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | 12 Comments »
Over a year ago when I was a lowly graduate student I wrote a blog about web templating engines. I’ve recently seen some posts on HN that are related to this topic and I’d like to clarify and expand on my position with this.
I’ll come right out and say it: templating languages have no valid use cases. Well, except for “platform X doesn’t have any better tools,” which is a lame excuse.
Most web templating languages are designed to be accessible to non-programmers or something. I think the rationale is to allow designers, who are presumed to be inferior engineers, to write the templates, and to let the engineers just write the backend code.
This is a mythical use case and it has never happened. Either your designer is not an engineer, in which case you need to have engineers adapt the design to the project you’re working on, or your designer has engineering chops, in which case you should trust them to work with real tools (provided that you have good ones).
At this point I’m sure someone will point out that ERB and its kin are designed for engineers. Yeah, maybe, but it’s really just a crude way of generating text that isn’t much better than string concatenation.
So now if you’ve committed to using one of these templating engines, you have condemned your front-end engineers to use a language that throws out most of the lessons we’ve learned about programming language design, best practices, and encapsulation over the past 20 years.
Damn. So what are we supposed to do? There isn’t a one-size-fits-all solution; here are two examples that I think fit a lot of use cases.
Use case #1: you are prototyping and you don’t have a huge front-end engineering staff
Chances are your designer hands you a bunch of HTML, CSS, and images, and you need to turn that into a dynamic template as quickly as possible since you’re strapped for time, engineering resources, or both. This is the case when you use something like PyQuery (or another user-friendly DOM manipulation library) to programmatically manipulate the markup. Your designer doesn’t need to know anything, and your engineers get a tool that is actually good at its job. They get to use a real programming language rather than half-baked templating language constructs.
Use case #2: you are building at scale
Facebook has a ton of front-end code. We use XHP to write it. Long story short, markup becomes a first-class expression in PHP and you can leverage all of the battle-tested object-oriented techniques to manage it. We build XHP components in a modular fashion which allows incredible levels of code reuse, protection from XSS attacks, and easy cross-team collaboration. Working with XHP vs something like Smarty (or even worse, vanilla PHP) is like building a huge project in Python vs C. Your level of abstraction is much higher, it’s safer, and you can move much faster.
The important thing to take away here is that we need to stop thinking of this as generating raw text. Instead, we need to understand that we’re working with markup that has semantic meaning. This lets us tap into the power of abstraction and encapsulation to make our job easier.
It just pains me to see all of these projects jump through hoops to make their projects fit into their templating engine’s idea of what the world should be like. Why not let your engineers leverage every tool in their toolbox to build scalable front-end code, rather than stick with these crude tools and perpetuate this myth of non-programmers building front-end templates?
Posted: July 28th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | Tags: github, pymysql | 1 Comment »
PyMySQL, a pure-Python drop-in replacement for MySQLdb that works across all major Python implementations including PyPy, IronPython, Jython and Python 3, has moved to GitHub. The old domain name should be redirecting there soon (http://www.pymysql.org/), or you can just check out http://www.github.com/petehunt/pymysql. With GitHub it will be much easier to manage community contributions and after a few months at Facebook I’ve entirely forgotten how to use SVN
Posted: April 8th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | 2 Comments »
I’ve been meaning to write a blog post on Node.js for a long time, but it wasn’t until now until I found an excuse to make myself sit down and write it. This afternoon the author of Node.js, Ryan Dahl, gave a talk at Facebook. It was pretty well attended, with around half of the audience having used Node.js beyond “hello world.”
Pythonistas could think of Node.js as the JavaScript equivalent of CPython, Twisted, and setuptools packaged together in a single binary residing server-side. It includes a package repository (npm), and is all about event-driven I/O. That means that every time you make a call that would block, you pass a callback to it, kind of like how Twisted’s Deferreds work.
Anyway, going into the presentation I was cautiously positive about the technology and had a pretty negative opinion about the Node.js community. I’m an avid reader of Hacker News and the Node.js spam is almost unbearable. But to his credit, Ryan acknowledged that in his presentation and on his Twitter.
What strikes me about Node.js is that it’s not particularly innovative. On the surface, it’s just a reimagined Twisted Python – but Twisted’s Deferred functionality is much cleaner in terms of error handling with errbacks and flow control with DeferredList. The syntactic advantages JavaScript has with anonymous functions are, to a degree, mitigated with inlineCallbacks() (the developers of Node.js had no solution to nested “callback hell”).
With that said, I’m totally stoked about Node.js, and I think it’s going to explode in growth. JavaScript has matured so much in recent years and there is a powerful set of best practices that turn it into a rather nice language, especially when coupled with Node.js’s implementation of CommonJS.
The design of Node.js is beautiful, largely because of JavaScript’s anonymous functions. JavaScript just seems like a better fit for this sort of framework than Python is. Additionally, one of the lesser-advertised features of Node.js is that it exposes a lot of V8′s internals to the JavaScript side (i.e. raw I/O buffers instead of strings), which results in awesome performance. Before the talk, I had no idea that Node.js was tied to V8 for any particular reason; now I know.
There’s also a ton of JavaScript developers who could be enlisted on the server-side and develop the client and the server without switching mental gears between languages. During the talk, Ryan cited a bunch of interesting npm packages, and apparently the library is extensive.
And finally, Node.js can do things that Twisted can’t – it explicitly forbids any blocking calls, which makes it a lot more difficult for developers to accidentally block the main thread. All of the Node.js packages are designed with this in mind. In contrast, the Python world needs all dependencies to expose a nonblocking API. Because of this, I think Node.js could revolutionize parallel I/O in a similar way that garbage collection revolutionized memory management.
As for me, I’m really excited. I plan on using Node.js for “real projects” in the future, but I’m holding out for V8 to support “use strict”. I am growing to like JavaScript, but until “use strict” comes out with its support for strong dynamic typing and lack of semicolon insertion, I’m going to stick with Python for my real projects.
Posted: February 4th, 2011 | Author: Pete Hunt | Filed under: Uncategorized | 11 Comments »
Hey everyone -
I am currently trying to find a solution for managing a large number of EC2 instances via the shell. For example, if I’m experimenting with running some sort of operation on the cloud, I often find myself trying to run “apt-get install” on multiple machines. Because I have Open MPI installed on these boxes, for now I’ve just been using mpiexec, but there has to be a better way. Is there some sort of SSH client that can manage a large number of concurrent nodes? I’ve googled around and found some half-baked projects, but I wanted to know if there was something that people were using in their day-to-day work.
Posted: December 27th, 2010 | Author: Pete Hunt | Filed under: Uncategorized | 4 Comments »
I’m proud to announce the immediate availability of PyMySQL 0.4. New features/bugfixes since 0.4:
- Implementation of SSL support
- Implementation of kill()
- Cleaned up charset functionality
- Fixed BIT type handling
- Connections raise exceptions after they are close()’d
- Full Py3k support
This release passes tests on CPython 2.4+, CPython 3.1.2, Jython 2.5.2+, and PyPy 1.4. There is one unicode-related case that didn’t pass on IronPython 2.6 (mono), but it’s close enough to be usable.
Also, if anyone knows how to register side-by-side source distributions on PyPI for Python 2.x and Python 3.x, please let me know! 2.x is on PyPI and Google Code, the source distribution is available on Google Code for Python 3.
EDIT: The Python 3 version is now on PyPI as “PyMySQL3″. For Python 2, it is simply listed as PyMySQL. Thanks, Marius!
Go get it! http://www.pymysql.org/
Posted: November 16th, 2010 | Author: Pete Hunt | Filed under: Uncategorized | 2 Comments »
I’ve been really, really busy lately, hence the lack of updates. I’m going to be finishing up grad school this coming December and I’ve just finished the very time consuming process of finding a job.
Since this blog is syndicated on Planet Python, I’d also like to recommend that Python developers should really, really apply to Yelp. They are a great group of hackers, offer a cushy work environment, are extremely Python-centric and they have a very popular product with lots of data to play around with.
In other news, I’m working on finishing up an M.Eng. project on scientific computing using EC2. Hopefully by the end of the semester we will have something useful that we can share with the community. One of the interesting things that we’ve found is that latency on EC2 isn’t particularly bad, but it’s very variable. That means that if you have a latency-bound problem (such as a low-CPU BSP problem) EC2 is a particularly bad platform to work on. We’ve been working on an implementation of OpenMPI to run efficiently on EC2.
For those not in the know, MPI is a message passing interface (of which OpenMPI is an implementation) written primarily for distributed scientific computing applications. The basic idea is that there are multiple processes identified by a rank (a monotonically increasing integer identifier) which communicate using messages. A BSP computation cannot move to the next time step until all processes have finished communicating with each other. On EC2, this means that your average time step will be bound by the high end of the latency distribution rather than its mean.
By replicating each process across multiple independent nodes one can work to reduce this latency. In a 5-process computation, for example, we create 3 replicas of each process for a total of 15 processes. Messages addressed to a given rank are sent to all three of its replicas and the response from the first replica to reply is used and the others are dropped. Obviously this model assumes some level of determinism in the processes, but if this assumption may hold, we’ve seen an order of magnitude increase in performance on certain benchmarks. We are very excited.
Well this is a bit of a rambling post and I’m sure it didn’t interest very many people, but I wanted to give an update of where I’ve been for the past few months. It’s going to be time to close some PyMySQL tickets soon, too.