Friday, 8 July 2016

Learning Scrapy: How to write a book about your favourite Python framework

I gave a talk on the 24th PyData London Meetup on "Learning Scrapy: How to write a book about your favourite Python framework". Enjoy:





> Now I hand over to Dimitrios. Dimitrios, what are you speaking on? Writing a book on Scrapy, right?

Learning Scrapy, exactly!

> Yeah, perfect! Please welcome Dimitrios...

Hello! My name is Dimitrios Kouzis-Loukas and I'm going to talk a little bit about my experience writing this book "Learning Scrapy" and, before we start, I would like to ask...

I think this is a silly question but is here anybody who is using open source software? Nobody?

Oh my God, yeah - I think I'm in the right place!

So, is there anybody here who has done any contribution, small, a patch, something to an open source software,

Yeah, wow! Amazing!

Is there anybody here who has written some technical text like a blog post, about, you know, their favourite something, blog post, article, master thesis, PhD?

Ah, that's quite good!

And, how many of you have really thought about writing a book? Secretly or not secretly... Oh!

This looks good!

And how many of you have actually written a book?

Yeah - a lot less! Still a handful!

That's impressive, we are in London. In other cities there are not that many but we're in a great place with great opportunities!

That said, all of us are a bit of technical authors, right?

So we've written stuff, but writing a book is a little bit of a different process. This is what I'm going to try to describe you, sharing my experiences while writing this book. I will also give you some shortcuts and some tips. Some things that will probably save you a year, or two while writing your book, "small tips"! And of course I will explain you the process, because I believe to a great extend
we don't know the process and this is what makes it so scary, to start with.

But why should somebody write a book, right? It helps the community. So, as contributors to open source software one of the best contributions you can do is actually contribute to a book or write a book, potentially, and this was the case with Scrapy

You can see here amazing numbers, like 15 thousand stars and forks, and it has been used for ages
right now, I think it's 7 or 8 years after the authors started writing it, 200 contributors... and it's like if you have a community which screams...

"please write a book... write a book!"

I can tell you, right now, there's only my book out there for this library. Why is that? The community tells you... "you waited too long, just go write a book about this"

And, it's great - open source is fantastic - there are so many projects... If you contribute, you might consider it.

What kind of slows a little bit, book writing as an initiative, is that already many projects have fantastic documentation and when somebody has fantastic documentation the first thing that people tell you  is like:

"hey, why should you write a book? just contribute to the documentation!"

And this is quite true, but the documentation is always generic, right? It has multiple aims. It will be a tutorial, it will be a reference, it will be a little bit of everything and that's exactly what it should be.

A book will have a specific audience and a specific target. It will be "to help you learn this, when you're at this level, and I want to take you to that level". So, it's always necessary, I would say, despite the fact that there might exist documentation and it helps the community grow.

Gives access to audiences that really love books and read them and also the manager - it scales learning. As a manager or as a teacher you can tell to your people "just go read this book and you will know what this is all about".

Now, what is in it for the author? I can see from the questions I get here and there, I can see the perception, that it is about the money.

Questions like "how many books did you sell up to now?", which is a funny question, a little bit. Or "did you negotiate your contract? - How many dollars they give you per copy or something?". Also a little bit of funny question and specially on the context I'm talking about today.

"writing a book for your favourite python framework"

The size of your audience is limited by the audience of the framework essentially or how much you can expand it in the next few years. So if you want to write a best-seller write Harry Potter! Not a technical book, right?

Those numbers don't count, they are not the key metric for a technical book on a very specific thing. Also if you want to write a technical book  and to be a best seller try "Microsoft Office" - another guide - I'm sure it will sell, right?

So - I would say it's not about the money, I haven't received any cheques yet - of course it's only 6 months, but - who cares - I mean, probably I will use them to write more software.

The main reason that we do write books is because it feels good, right? So you get this feeling of contribution, of connection, you get this feeling of "there is this little territory this material that I have really mastered" and feels really good, right? More or less this is also the reason why we write software.

And, networking, of course. You get access to some people that you might not have otherwise, because you write on the topic that they're interested in.

So - how?

I can guess what will happen tomorrow but anyway - How easy is it to start writing a book? Very easy!

All you do is to go to any publisher's website, you search for proposals or something like that
@whatever publisher.com and send an e-mail like that:

You can see even - this was kind of naively written - no matter what - it worked:

"Hello, I'm Dimitris there's a library called scrapy it has lots of users
and there's no book out there I think it would be a cool idea to write a book"

And then two hours later I receive a reply

"Ok, let's do it - send us an outline, table of contents, a brief bio and it will get us going". So there will be an e-mail, there will be an outline, later there will be - later - I mean like 3 days later, there will be a contract. People will do a little bit of marketing research. So, probably, if it's a library with two stars you might not actually, publish at that stage, but anyway. So there will be some research, you will have a contract and you will have a plan!

A plan like - let's write a small book to start with, like let's see if there's any market. 100 pages, 6 months, great idea - go do it!

Yeah right! ok!

So, the plan is not going to work, and why is that?

In order to explain you why, I have to tell you a little bit about the author's life. I'm an instance of an author so I will tell you a little bit my life. It was very easy to start in my life. I was "born geek" apparently. So I first made my first circuit when I was - I don't remember what, you know how it is with those ages - very young anyway. Later I wrote my first piece of software in Basic, in order to program my circuits. So the first thing I wrote was a programmer actually. Then I did a Masters degree, accidentally, in Applied Mathematics and Physics, but it proved to be awesome
we will talk about this another time. Then I said - "I want to do my hobby", I really love this thing, so I will go back and I will study hardware. I did a Master in Microelectronic System Design in Southampton and then I started working for ARM in Cambridge - a great company to work at
creating microprocessors that everybody here has on their mobile phones. Probably seven or ten in each mobile - So amazing thing! So this was my life over there, quite linear, quite predictable, quite perfect, or something!

And suddenly, wow! Maybe you could call it burn-out? Maybe you could call it like - how is it? the feeling that you get after you work long time for one company... One day I quit the company, ARM, and I take the airplane and I go to Poland and I start being my own boss. You know - I've read "the four hour work week" and all these books, and all these great things it was a little bit of a hype, maybe, also in 2009 around it So I went like - OK! I will be an entrepreneur, I will run my own business and this is what I was doing based in Birmingham, writing lots of software helping lots of people, startups. I can tell you it was tons of work, like work work work. People around you really don't know when you work, when you don't work. If you try this, work at home, wear a hat, say I'm now working and don't talk to me and take the hat off and say now you can speak to me. If you work from home this is something that you have to do. Later I moved to Budapest and then in Seville. So, great stories, really lovely!

And at that point, somewhere in the timeline, somewhere over there actually, when you're working on your own business - you don't have a clue where you are. If it is up or down, you are somewhere, ok?
and this is when actually I became an author. So it was a strange time in my life and if you know somebody who came back from work, back home, and said - "ah - today I will write a book!", please show this person to me, I think there's nobody. I think everybody who is starting writing a book, is in a weird phase of his life. He has some extra time, he has some confusion, some freedom to do whatever he wants.

So this is one of the big reasons why the plan will not go as planned but there is another big reason and the big reason is that a book is a product, right? It's a very much a product. You can say
software is a product - I don't know if it is - or it isn't sometimes, when you think it as "software as a service" but a book is really a product. It's finalized. Customers hold it at their hands, so the audience and your reader are your customer, right? And so you have to start with a normal lean startup mentality, with a hypothesis. Who is this customer of mine and the first thing you think is like "hmmm... probably he would be somebody like me". This is how you start like, "yeah... makes sense somebody like me". So this is how you start with your first outline and you can see here the first title I used; "Scrapy: Minimum Viable Product Faster than Fast!"

So the idea was ok you are a start-up and somehow you need a Minimum Viable Product and you want to take it out there and you can write your nice application stuff but you don't have data. Obviously you don't have data and also develop without having data which is like, I test with "hello world". This is not good... So the whole concept was like go, use Scrapy, get data out from the real
world. It's ok, ok? If you don't publish them it's not something like a copy. Ok, consult your lawyers, but the point is you can do it for internal use without people understanding - get some real data
create a nice cool Minimum Viable Product that looks mature, you show it to people and they go like okay good, I get what you're telling me! It's not the same if you have "hello world" on this form
or these applications you're writing.

So this was the first outline and now I will escape a little bit of a chronological order of things I will start talking a little bit about chapters which more or less follows so it's a good way to do both at the same time. So when you start writing a book the title, table of contents, cover and the first chapter
it's all marketing material, guys, ok? Believe me it's marketing material and remember if you have a mission if you have a key message that you want to pass to your people. You have this one and guess what, people will get this message no matter if they buy or not it will be out there it would be on Amazon, it would be discussed on social media. So if you really have a message, here it is... make it your title, put it on your cover put it on your introduction, make your table of contents flow! Make short sentences. Make sure that when somebody reads your table of contents he was like "Wow! I want this book, I want *this* book!" Ok? And then the reader will go to the next step to buy the book... otherwise that was all the discussion you could have with the reader...

"Here's my title; ah, I'm very reserved, I'm very safe, I have a very safe title". No no no! So consult your marketing friends or your sales friends get a title that works, get a cover that works and convince people to read more if you really believe in the cause.

And now it's the point to introduce you to... the editor. Who is the editor? The editor is that a guy
who sent you every now and then some annoying emails like when will you send me Chapter seven or Chapter three... I think we're a little bit behind schedule and stuff like that. So on the very basic concept an editor is a project manager for your project, he has skin on the game, he really wants the book delivered and he might or might not have a clue about the specific thing you are working. Highly likely he will not. So the point is, an editor might have an area of expertise like Java but he will not know the specific framework so it's not always wise to consult him for technical stuff. You are the author and I will come back to this point. So at some point the editor comes and tells me

"ok good i have read your drafts and stuff and so would you mind putting a few extra pages, explain
people what a URL is. What?"... "What a URL is!"

So it might sound funny but for me this was like a big... like... "I don't know the customer! I thought that the customer was like me... after all he's not like me... I mean partly.. he is not like me! He's probably someone who is 14 years old and he has written some programs but he's not an expert... even a URL is not quite clear to him!" and thanks a lot to the editor for giving that perspective - again customer discovery and remember that from and editor's perspective beginners audience is like a goldmine, right? If there are that many experts, beginners are lots, it's a much wider audience. So if you, in your life, you're right just a single book, make sure that you target the widest audience possible; write it so that it's beginner friendly, okay?

So I start with my first beginner friendly Chapters - Chapter two, basic crawling a little bit
of introduction, what is XPath, terminology, how to install the software, some background knowledge so we can go fast and essentially those are a little bit tutorial like - you see a website you want the data, how do you do it, how to create the expressions so you end up with the data on a file more or less by Chapter 3 but this is not enough, this is not what this this book was about data on an Excel spreadsheet is not very inspiring. You want something a little bit more and this is why in Chapter four, I use...I don't want to teach them web development or mobile development but I use a service called up appery.io so they can quickly drag and drop some stuff, it has its own database they run scrapy, data goes into their database and here it is, within like 13 pages I can tell them "you can create a mobile application so you can demonstrate your data" and here it is... Minimum Viable Product
for everybody who wants to do something with their data here you are, Chapter 4. And this has an interesting thing like... if you think you're in Chapter 4 and you didn't have the opportunity to tell
people how to write stuff in databases make REST API calls. Actually you have just a beginner you
have taught him some things, but how do you do it?

And the answer is it's a great thing to write open source software. It's a great thing to to be contributing on those communities. It's a great thing to be writing Python - Why? because you can - all the things that are complicated for this level, all the things like, complicated hacky code, boilerplate code, or, you know, code that really, is ugly, you don't want to put this on paper. So as with hardware design we were doing a partitioning between what we put into hardware and what we put into software, the same thing goes with the book.

Book is like the hardware it's out there you can not modify it - be careful what you put there. Make sure as the previous presentation said and make sure that it reads like poetry what you have in book and all the ugly stuff, put it in a library! So in future version also, when Scrapy updates, its API changes you can go back and change is as well without people having to do any changes. The book will be there, I don't know if for ever, but for a long time. So upload a Python library in PyPI and you're done and this is how I can in Chapter four show them this thing.

At this point I'm like okay, I've written...  I have actually given my editor some really first drafts of
something and he comes and says...

"okay, 100 pages, we are done!"
"We are what? We are?"
"Yeah, we're done, don't worry! Let's get it through review"
"but... but..."
"Let's get it through review, what's the worst thing that can happen?"

Okay! So here's the reviewer! I didn't ask this question, but i believe and I hope many of you here are reviewers of books and a book reviewer is a... is very important, very hard to find because a reviewer has to have expertise domain expertise. He has know Scrapy, potentially but at very least he has to know good Python, right? He has limited time, he's probably a professional who does a full-time job and then in the afternoon he will check your book or something. He's not, he's almost always not a
single reviewer they're more, so think of the reviewer as a persona, and he has to be bad. A good reviewer is a bad reviewer, a guy who will tell you are wrong, you are wrong, you are wrong
if the guy comes and tells you "everything looks awesome" actually he didn't help you to write a good book. So you want somebody who has attention to the detail and can give you nice feedback. He's the "proxy" to the customer and so here I come and I get some comments...

I put the most polite comments over here like I had a little - I wanted to show people how to use Rackspace so they can get their things directly to the cloud because it's important for a start-up *in my mind*, obviously I removed this later, but he goes like "Rackspace actually called at home and woke up my wife at 2:00 am"... ok! "Scrapy installation - I installed it on my Mac and gave 32 warnings during installation". Ok - and what should I do about it? "Do you have permissions to crawl this website"... that's a good one that's a good one - I haven't thought about this one, okay? And the thing is... this guy self-identifies himself as a person that in terms of knowledge of Python on a grade of 1 to 10 he's on 8. So actually is quite strong guy and I can claim that he is not my target audience
so this is not good you want somebody who is good but he's in your target audience, ok?

But anyway, I will get back to this, the point is that he gave me the perspective of this guy, part of the
audience, who knows his stuff really well and actually you don't want to annoy him. Probably he will read your book, he will not learn much but you don't want to annoy him. So for example every now and then I was doing the mistake of mentioning Python lists as "arrays". This was pissing him off!
He was going like; "What the hell? How can I trust you - that you know what you're talking about - this is Python you should name it as list" Ok!

And some things that I've seen in books I have reviewed actually like... somebody thought that it's funny to talk about models and put a female model on a photo and I'm like okay good... but this
doesn't sell that well in the US and UK so be careful! Or another thing like... A book is not exactly the best place to bitch about management... so say like; "we are the developers you are the managers". It's easy to write a book like that, right? You might be angry but it's not a good idea to go like "ah - those and us", so forget about "those and us" because somebeody will be a manager and he will be reading your book, right? And you don't want to make my negative.

So it's all about being a little bit careful on your phrasing and on your writing. But at this point I have the editor more or less telling me... "This is your book, you got it wrong, there is no audience and I'm so sorry and you know... you have to do lots of changes" and stuff like that... So it actually was
his ideas most of the stuff like the length and "let's make it more beginner friendly"... "no - we're missing code - we need code"... anyway. At some point we are, you know, our relationship stopped being the good.

I realised that as an author I am on my own and this is what you have to remember you are responsible for what you put out there no matter how many people contribute, it is your name and your reputation which is at stake. So at that point I changed publisher and it was that simple, just one email, again, "Scrapy book that I will not publish with ABC guys I have a very nice book
it's attached and I will not publish with those guys for this and that very specific reason. I want to publish with you." As you can imagine... half a day later "yeah, okay let's do it!"

So i have a brand new editor, publisher and everything and now with the new understanding of who the customer is. I re-review all the material, add extra 150 pages and a few years, you know, of
writing and here's my new book - drop the title -- minimum viable product and all these - now I understand - nonsense and we have the "Learning Scrapy" book. Chapters 5-9, full of very technical, very deep... I really I love the material and they're like cool material - mastering internals of scrapy and I would say that in there you will find stuff that it is not in Stack Overflow - especially with Scrapy, it's written on top of Twisted, it's asynchronous. So in stack overflow you will hear lots of advice that is complete nonsense and kills performance so... try to at very least, if you want, copy-paste a few of the things from this book to Stack Overflow so that we make it a little bit more reliable, in that sense... for this particular case...

The guys who started Scrapy they have a start-up and is caled Scrapinghub Chapter 6, a few pages, I devoted to them, I think they deserve it so you have a spider, run it in the cloud using their cloud, quite cool.

And then I want to tell you a little secret here if you have heard about new technologies, you know, everybody is very enthusiastic think about "deep learning" for example everybody getting "wow - this is the solution for everything" then there is this phase of disillusionment - actually it works with images and stuff but it doesn't work with some other stuff... then it's the enlightenment going like; I know when it works and setting the expectations right, really and then there's productivity were really... it becomes de facto technology.

This happens with everything and the same thing used to happen with me with every single chapter!

I was starting like - "this is going to be the best chapter this will be just amazing, It's the most important" then I was thinking like, you know, in my brain graphs and this is cool I should
put this and that and then I was getting an empty page, writing the first sentence and I was like "hmm ok this will take a few months longer than I expected" so this was my usual experience. But in Chapter 7, because it is configuration it has to be there it's important for for the user but for me like configuration is also covered quite well in the docummendation it's not something that I'm really that
passionate about so I thought about trying another approach - so what I did is - as long as they had an outline and the backbone of the Chapter I just hired a few people to write some drafts...in the form of "write me a tutorial explaining  this setting and that setting" "and that setting" and then...  when I got the drafts back suddenly I skipped all this part of pain because somehow my creative mind worked much better when I had to synthesize things that might be of moderate quality, some of good quality but it was much easier to create this Chapter than any other Chapter and... it allowed me to focus on what an author is there really about; to put kind of the poetry, to put the good example, to
distinguish between the important and non-important thing so... ...think about it... if you write a book consider this approach as well ok - it's not ghostwriting... it's just getting some drafts it saves lots of time.

Chapter nine was a little bit of a pain because it uses Elastic Search Redis, MySQL several of those
technologies and, trust me, when you write a book you have to explain to people if you're an honest person how to install those things and you have to explain that on Windows, on Mac, on
all versions of Linux and Unix. You don't have the space but still you want to give some cool examples, right? So you want to use MySQL but you don't want to write about how to install MySQL
and this is where i was lucky because the technologies started being mature enough so i could use them in my book so right now... my readers just do vagrant up, they download a few gigabytes
ok? and then they have 9 ssh terminals ssh'ing to different servers so something that looks
a lot like what they would face in the production, something realistic ...and it works in the airport, works offline... I put in there even the web server that gives a fake website so this problem of... do you have permission to crawl that site is also solved and it's fantastic, guys copy paste those techniques, they are also  related to reproducable research use vagrant ...right now we have the technologies to go like... "here is a box, click a few buttons and and you can reproduce" - with zero support you can have your customers using your stuff.

Then I have to tell you that by that stage I was a different author... so yes the editor came and said.. "maybe we can cut some edges, you know, we can release before December" and I'm like
"no, no thank you! My reputation is at stake I will be next to the two stars on Amazon
you will not be there, right? So I would take my time I will do the thing the way I want" and this is what brought kind of the last piece of the customer in my awareness ok? Which is like - my manager, my friend, myself, like us, right? You want to write a book and at the very least have something about people that you love.

And so - physics right? So i created a model of Scrapy, a complete performance model, URLs flow like water through the pipeline downloading here, where the bottleneck is and more or less with this picture you know more about Scrapy performance than most other people around here. How to
debug it exactly so if you have slow performance what metric you should see and what you should hack really there's no information reliable on the internet, where are the bottlenecks again one of the beiggest problem - you will always say my server is not strong enough it's not your server - it's actually you don't give the scaper enough work to do you have to scrape the indices much
faster than you consume the URLs through the pipeline.

Chapter eleven distributed crawling - everybody was asking about that - again stack overflow - most of the replies are wrong they are thinking per item, I will have an item, I would get it through my pipeline No, this is not how it works, actually you have to think something closer to what Apache Spark has, which is micro-batches, put things on S3 or something and then put a URL to Kafka, a link to the file and go and process many of them this is the only way Scrapy will give you the performance you expect... in a mid-range server actually a slow one I achived 700 pages
per second which is amazing/

Appendix - help your beginners and at very least have a reference to tell them like what you ask is in page 230.

Production you are over there you have handed up the final drafts it's not ready yet, PDFs will come
back and they will be full of mistakes, text will be broken especially for people here, you speak much better English than people who proofread your book sometimes ok? They will download some of
your text - remember you are the author you have responsibility diagrams might be cut or low
resolution - tell them to fix them code code code - read it - go stupid, I don't know how make your brain not work copy paste of the code and makes sure that it works specialy in Python where spaces count so be very careful with the code.

And after all this one day you will be really proud and happy to see yourself on Amazon, great days!

Overall summarizing it's all about the customer, all about understanding who the reader is. He is many people, he's not just one, The main audience is people who are... this is independent on the depth of the book - you might be writing something extremely specific again there will be high-end people and the masses, right?

So take the masses from there and bring them up, this is how your book is really judged, you start somewhere over there you will end up somewhere over there you will not be an expert but you know some stuff probably your environment is somewhere like that so don't be fooled - you are not
exactly writting for your environment you really sell to those guys, these are your manners or whatever so if you think about career and stuff make sure that you send the right messages there. Do not offend these guys the high-end guys, do not say "arrays" instead of "lists" and the guys who are really on top, they have no problem, they get the vision the go like "do whatever you want" I can
see the benefit.

And for all of you, here, try to contribute, this is my main message today, write a book on your favorite *Python* framework be the author, be the co-author, don't forget four co-authors no problem, no problem, it's not fair - somebody did more work than the other who cares? who cares? If it's not
something completely unfair unless somebody said "I will write" and disappear -  just put  a few
co-authors it's not that bad. Be reviewers, very important you here have the expertise - be reviewers on books and be polite, okay? Because we want the books to go out. So find a nice balance to be polite and review books really harshly - give valuable feedback and also support the authors, in the
sense of, you know, if you hear somebody writing a book go and say like okay cool how can I help you or something like do you need this change in the code that will help you? send me a draft
make it open - the industry is a bit secretive but make it open I think we can do it and so after all these,

Let's make the world a better place! Thank you!

> Thank you Dimitris. We have 2 questions
> Thank you, so how did I actually answer the question, can I scrape this website?

So the point is a everybody's a little bit on his own and there is a whole universe of data and copyrights and stuff like that. You have to, as a company whoever scrapes, has to have their lawyers hopefully go ask them and make sure that you negotiate in advance, ok? And this is a good thing about scraping it makes you think about the ecosystem you go like I get the data from those guys can I trust them? Could we collaborate so, early, you do that what I did in terms of the book is I just did some statistics on top of what I scraped and then I created with monte carlo simulated pages and you can see the source code in there so at the end what I give to my readers is completely
copyright free, but it follows the format and all the XPaths are exactly the same with XPaths that you
can use the actual it is Gumtree the actual website so it is both realistic but we don't hit the website and also this makes it available offline, right? So I can be in the airport I can hit my own web server so I think this is more or less the best solution you can do but yes, it's a good thing... ... keep in mind that data is important and get your lawyers - ready or do the negotiations about data early enough, because it's an important issue.

> And for the final question, so what is the tool that you used to write the book

Yeah every publisher has their own thing and I gets a little bit messy because some of the tools they use are not straightforward so at the end you end up learning you know asciidoc or whatever like one flavor and if you want to change publisher, it is not completely open, I can see an opportunity there, but it's not completely open you cannot go like - okay - I don't like this publisher, I go to another
one so there will be re-markup, needed to be done but to be honest... It's markup you learn it, after a
while it probably takes the one day two days  you put mark up... The images are important they're a complete different story. Images-keep them in whatever the sources so you can always create higher quality always use vector and actually they typically go inside the draft very late right, they get intergrated late so the tools that are a little bit of a pain but not the biggest pain.

Thank you!

5/7/2016
Dimitrios Kouzis-Loukas

No comments:

Post a Comment