]> git.madduck.net Git - code/twitter-archiver.git/blob - tweetfetch.py

madduck's git repository

Every one of the projects in this repository is available at the canonical URL git://git.madduck.net/madduck/pub/<projectpath> — see each project's metadata for the exact URL.

All patches and comments are welcome. Please squash your changes to logical commits before using git-format-patch and git-send-email to patches@git.madduck.net. If you'd read over the Git project's submission guidelines and adhered to them, I'd be especially grateful.

SSH access, as well as push access can be individually arranged.

If you use my repositories frequently, consider adding the following snippet to ~/.gitconfig and using the third clone URL listed for each project:

[url "git://git.madduck.net/madduck/"]
  insteadOf = madduck:

README updates
[code/twitter-archiver.git] / tweetfetch.py
1 #!/usr/bin/python3
2 #
3 # tweetfetch.py
4 #
5 # Fetches tweets since a given timestamp and stores data in JSON files, as
6 # well as an HTML dump for each in the subdirectory ./tweets.
7 #
8 # Usage: ./tweetfetch.py 1159606554041040896
9 #
10 # The timestamp of the last tweet fetched is stored in ./tweets/.sentinel
11 #
12 # Copyright © 2017–2019 by martin f. krafft <madduck@madduck.net>
13 # Released under the Artistic Licence 2.0
14 #
15
16 from authdata import *
17
18 from twython import Twython
19 import json
20 import sys
21
22 twitter = Twython(app_key=consumer_key,
23                   app_secret=consumer_secret,
24                   oauth_token=access_token,
25                   oauth_token_secret=access_secret,
26                   oauth_version=1)
27
28 config = {'include_rts': False,
29           'count': 200,
30           'trim_user': True,
31           'exclude_replies': True,
32          }
33
34 if len(sys.argv) > 1:
35     config['since_id'] = sys.argv[1]
36     print("Limiting results to tweets since ID {}".format(config['since_id']),
37             file=sys.stderr)
38
39 user_timeline = twitter.get_user_timeline(screen_name="martinkrafft",
40         **config)
41
42 max_id = int(config.get('since_id', 0))
43
44 print("Fetched {} tweets, writing them to disk…".format(len(user_timeline)),
45         file=sys.stderr)
46
47 for tweet in user_timeline:
48     with open("tweets/{}.json".format(tweet['id_str']), "wt") as tf:
49         print(json.dumps(tweet), file=tf)
50
51     with open("tweets/{}.html".format(tweet['id_str']), "wt") as tf:
52         print(Twython.html_for_tweet(tweet, use_expanded_url=True), file=tf)
53
54     print("  wrote tweet ID {}".format(tweet['id_str']),
55             file=sys.stderr)
56
57     max_id = max(tweet['id'], max_id)
58
59 print("Writing ID {} to sentinel file…".format(max_id), file=sys.stderr)
60 with open("tweets/.sentinel", "wt") as tf:
61     print('{0:d}'.format(max_id), file=tf)