Monday, 9 December 2019

Find large files in gmail "low on storage space"

One day I started getting this very annoying warning in gmail:

"You’re running low on storage space. Try freeing up space or purchase additional"

Buying more storage space would cost me about $9 a month which isn't much, but I pay about as much in Apple and it's very annoying for me to start paying for something I always considered free (that much for rational thinking). Several times I used "in:all size:10000000" to delete a few emails, but after a weeks/months it kept coming back. So I decided to have a better look.

After getting encouraged by this SO thread, I used this getting started with gmail API guide to create the following script:

from __future__ import print_function
import re
import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.errors import HttpError


credentials = "gmail-credentials.json"


# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']


def main():
    creds = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                credentials, SCOPES)
            creds = flow.run_local_server(port=0)
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)

    service = build('gmail', 'v1', credentials=creds)

    user_id = "lookfwd@gmail.com"

    query = "in:all size:2000000"
    response = service.users().messages().list(userId=user_id,
                                               q=query).execute()
    messages = []
    if 'messages' in response:
        messages.extend(response['messages'])
    while 'nextPageToken' in response:
        page_tk = response['nextPageToken']
        response = service.users().messages().list(userId=user_id, q=query,
                                                   pageToken=page_tk).execute()
        messages.extend(response['messages'])

    for msg_id in messages:
        try:
            msg = service.users().messages().get(userId=user_id,
                                                 id=msg_id.get('id')).execute()
            fromv = ",".join([h['value'] for h in msg['payload']['headers']
                              if h.get('name') == 'From'])
            m = re.search(r'<(.*?@.*?)>', fromv)
            mail = m.group(1) if m else fromv
            subject = ",".join([h['value'] for h in msg['payload']['headers']
                                if h.get('name') == 'Subject'])
            print(f"{msg_id.get('id')}\t{msg['sizeEstimate']}\t{mail}\t"
                  f"{fromv}\t{subject}")
        except HttpError:  # Not found
            print(f"{msg_id.get('id')}\t\t\t\t")


Most of the above are copy-paste from getting started. A reasonable question is why did I draw the line at "in:all size:2000000"?

What I did is to query email counts on different sizes.

    # in:all             : 179926 messages
    # in:all size:25000  :  58025 messages => less than   3Gb x<25k
    # in:all size:50000  :  34796 messages => less than 1.6Gb 25k<x<50k
    # in:all size:100000 :  14307 messages => less than   2Gb 50k<x<00k
    # in:all size:200000 :   6143 messages => less than 1.6Gb 100k<x<200k
    # in:all size:500000 :   3416 messages => less than 1.3Gb 200k<x<500k
    # in:all size:1000000:   2154 messages => less than 1.1Gb 500k<x<1Mb
    # ---------------------------------------------------------------------
    # in:all size:2000000:   1195 messages => less than   2Gb 1Mb<x<2Mb
    # in:all size:5000000:    409 messages => less than   3Gb 2Mb<x<5Mb
    # in:all size:9900000:    143 messages => less than 1.6Gb 5Mb<x<10Mb

There is a way to query for emails on bands and thus estimate the lower/upper bounds of total GBs you will have in each band. You may of course decide to download all (in my case 179926) messages but this would take some time and we can see that most of them are smaller than 25k. So why bother? I set the bar to >=2Mb which according to the estimations will mean at least a couple of GBs of messages.

So here's the somewhat interesting part. After I exported my top messages on CSV, I could create a pivot table and see how much data each sender sent.


It turns out that my biggest fan is... me. Not a big surprise there since I guess I was using my email as todo list and sending myself images and other notes for some time. Then, in between myself and my mother (!) a friend of mine sent me about half a Gb. It turns out that between 2009 and 2012 he was sending me some funny videos and I guess I found them funny but not really deleting them afterwards - since gmail was infinite - wasn't it? An easy half a Gb there! :)

To delete all emails with attachments from a given e-mail you search for:

from:(whoever@whatever.com) has:attachment

Finally, you might ask, was it worth the effort compared to $9 a month for the rest of your life? I think yes. I might have lost money by spending time on this, but I guess it gives me some confidence that I have control over my data and my spending. Well... next time I will likely just archive everything with takeout, and just delete everything more than 5 years old. Looking at old emails makes you realize you're getting older. :)

1 comment:

  1. Such a nice post.#Thanks for sharing this kind of information goodsync-crack You can also visit my Website ukcrack.com

    ReplyDelete