Skip to content
Permalink
Browse files
New Instagram module to avoid bans
We now use instagrapi ( https://github.com/adw0rd/instagrapi ), a more stable API to scrape data from Instagram

It requires logging in using a username and password: they must be set as environment variables before execution

The instagram.py module has been rewritten from scratch, even if it still uses the common module structure among the project

I added a new method called *download(url, filename)* to utils.py to download any file type, since instagrapi doesn't come with a method to dowload medias

I updated the requirements and the documentation as well to reflect the new changes to  the Instagram module

The old Instagram module is still available as instagram_old.py, but if after some weeks of testing of the new module I won't find any issue, I will delete that
marco97pa committed on Nov 22
1 parent 820ac5d commit 7f006a1f7660cc98df16ef4bdef31d2ff09fbab5
Showing with 250 additions and 70 deletions.
  1. +2 −2 .github/workflows/main.yml
  2. +3 −2 README.md
  3. +5 −16 docs/source/index.rst
  4. +67 −47 instagram.py
  5. +154 −0 instagram_old.py
  6. +3 −3 requirements.txt
  7. +16 −0 utils.py
@@ -3,8 +3,8 @@
name: Runner

on:
# schedule:
# - cron: "*/10 * * * *" # Runs every 10 minutes, everyday (actual timing may vary, depending on GitHub)
schedule:
- cron: "*/10 * * * *" # Runs every 10 minutes, everyday (actual timing may vary, depending on GitHub)
workflow_dispatch:

jobs:
@@ -78,7 +78,7 @@ The project is really modular and by editing the YAML file you can easily fork t
* Python 3
* [Tweepy](https://pypi.org/project/tweepy/)
* [pillow](https://pypi.org/project/Pillow/)
* [insta-scrape](https://pypi.org/project/insta-scrape/)
* [instagrapi](https://github.com/adw0rd/instagrapi)
* [python-youtube](https://pypi.org/project/python-youtube/)
* [spotipy](https://github.com/plamere/spotipy)
* [billboard.py](https://github.com/guoguo12/billboard-charts)
@@ -119,7 +119,8 @@ in that case all the python command will be 'python' instead of 'python3' and 'p
export YOUTUBE_API_KEY='xxxx'
export INSTAGRAM_SESSION_ID='xxxx'
export INSTAGRAM_ACCOUNT_USERNAME='xxxxxx'
export INSTAGRAM_ACCOUNT_PASSWORD='xxxxxx'
export SPOTIPY_CLIENT_ID='xxxx'
export SPOTIPY_CLIENT_SECRET='xxxx'
@@ -92,24 +92,13 @@ keys. Then set them as environment variables, by running these lines:
``export SPOTIPY_CLIENT_ID='xxxx'``
``export SPOTIPY_CLIENT_SECRET='xxxx'``

Instagram SESSION_ID cookie
Instagram USERNAME and PASSWORD
~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can get your Instagram **Session ID** if you are logged in to
Instagram using any browser. This guide uses Google Chrome.

1. Open Google Chrome
2. Go to `Instagram`_ and log in to your account
3. Right click somewhere on the webpage and select **Inspect** on the
dropdown menu to open Chrome Developer tools
4. Click the **Application** tab under the Chrome Developer Tools window
5. Under the **storage** header on the left-hand menu, expand
**Cookies** and click on the entry for ``https://www.instagram.com/``
6. Find the row with the **Name** equal to ``sessionid``: this is your
**Session ID**

| Then set it as environment variable by running:
| ``export INSTAGRAM_SESSION_ID='xxxxx'``
You can set your username and password like this:
``export INSTAGRAM_ACCOUNT_USERNAME='xxxxxx'``
``export INSTAGRAM_ACCOUNT_PASSWORD='xxxxxx'``


Fork
--------------------
@@ -1,17 +1,21 @@
#!/usr/bin/env python3
import os
from instascrape import *
from utils import display_num, convert_num, download_image
from instagrapi import Client
from PIL import Image
import requests
from utils import display_num, convert_num, download
from tweet import twitter_post, twitter_post_image
from random import randint
from time import sleep

module = "Instagram"

# Get Instagram cookies
instagram_sessionid = os.environ.get('INSTAGRAM_SESSION_ID')
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={instagram_sessionid};"}
ACCOUNT_USERNAME = os.environ.get('INSTAGRAM_ACCOUNT_USERNAME')
ACCOUNT_PASSWORD = os.environ.get('INSTAGRAM_ACCOUNT_PASSWORD')

# Login
cl = Client()
cl.login(ACCOUNT_USERNAME, ACCOUNT_PASSWORD)

module = "Instagram"

def instagram_data(group):
"""Runs all the Instagram related tasks
@@ -26,26 +30,23 @@ def instagram_data(group):
"""

print("[{}] Starting tasks...".format(module))
group, ig_profile = instagram_profile(group)
wait_random()
group = instagram_last_post(group, ig_profile)
group, user_id = instagram_profile(group)
group = instagram_last_post(group, user_id)

for artist in group["members"]:
wait_random()
artist, ig_profile = instagram_profile(artist)
wait_random()
artist = instagram_last_post(artist, ig_profile)
artist, user_id = instagram_profile(artist)
artist = instagram_last_post(artist, user_id)

print()
return group

def instagram_last_post(artist, profile):
def instagram_last_post(artist, user_id):
"""Gets the last post of a profile
It tweets if there is a new post: if the timestamp of the latest stored post does not match with the latest fetched posts timestamp
Args:
- profile: a Profile instance, already scraped
- user_id: a profile ID
- artist: a dictionary with all the details of the artist
Returns:
@@ -54,32 +55,33 @@ def instagram_last_post(artist, profile):

print("[{}] ({}) Fetching new posts".format(module, artist["instagram"]["url"][26:-1]))

recents = profile.get_recent_posts()
medias = cl.user_medias(user_id, 5)

for recent in recents:
recent.scrape(headers=headers)
for media in medias:
# If the last post timestamp is greater (post is newest) or the saved post does not exist
if "last_post" not in artist["instagram"] or recent.timestamp > artist["instagram"]["last_post"]["timestamp"]:
url = "https://www.instagram.com/p/" + recent.shortcode
if recent.is_video:
if "last_post" not in artist["instagram"] or media.taken_at.timestamp() > artist["instagram"]["last_post"]["timestamp"]:
url = "https://www.instagram.com/p/" + media.code
if media.resources[0].media_type == "2":
content_type = "video"
filename = "temp.mp4"
source = media.resources[0].video_url
else:
content_type = "photo"
filename = "temp.jpg"
recent.download(filename)
source = media.resources[0].thumbnail_url
download(source, filename)
twitter_post_image(
"{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(recent.caption), recent.timestamp, url, artist["hashtags"]),
"{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(media.caption_text), media.taken_at.timestamp(), url, artist["hashtags"]),
filename,
None
)
else:
break

last_post = {}
last_post["url"] = "https://www.instagram.com/p/" + recents[0].shortcode
last_post["caption"] = recents[0].caption
last_post["timestamp"] = recents[0].timestamp
last_post["url"] = "https://www.instagram.com/p/" + medias[0].code
last_post["caption"] = medias[0].caption_text
last_post["timestamp"] = medias[0].taken_at.timestamp()

artist["instagram"]["last_post"] = last_post

@@ -95,34 +97,35 @@ def instagram_profile(artist):
Returns:
- an dictionary containing all the updated data of the artist
- a Profile instance
- a Profile ID
"""
username = artist["instagram"]["url"][26:-1]

print("[{}] ({}) Fetching profile details".format(module, artist["instagram"]["url"][26:-1]))
print("[{}] ({}) Fetching profile details".format(module, username))

profile = Profile(artist["instagram"]["url"])
profile = profile.scrape(headers=headers, inplace=False)
artist["instagram"]["posts"] = profile.posts
user_id = cl.user_id_from_username(username)
info = cl.user_info(user_id)
artist["instagram"]["posts"] = info.media_count
# Update profile pic
artist["instagram"]["image"] = profile.profile_pic_url_hd
artist["instagram"]["image"] = info.profile_pic_url

# Add followers if never happened before
if "followers" not in artist["instagram"]:
artist["instagram"]["followers"] = profile.followers
artist["instagram"]["followers"] = info.follower_count

# Update followers only if there is an increase (fixes https://github.com/marco97pa/Blackpink-Data/issues/11)
if profile.followers > artist["instagram"]["followers"]:
print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], profile.followers))
if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", profile.followers):
if info.follower_count > artist["instagram"]["followers"]:
print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], info.follower_count))
if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", info.follower_count):
twitter_post_image(
"{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(profile.followers), artist["hashtags"]),
download_image(artist["instagram"]["image"]),
display_num(profile.followers, short=True),
"{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(info.follower_count), artist["hashtags"]),
download_profile_pic(artist["instagram"]["image"]),
display_num(info.follower_count, short=True),
text_size=50
)
artist["instagram"]["followers"] = profile.followers
artist["instagram"]["followers"] = info.follower_count

return artist, profile
return artist, user_id

def clean_caption(caption):
"""Removes unnecessary parts of an Instagram post caption
@@ -148,7 +151,24 @@ def clean_caption(caption):

return clean[:90]

def wait_random():
sleeptime = randint(5,50)
print("Sleeping for {} sec".format(sleeptime))
sleep(sleeptime)
def download_profile_pic(url):
"""Downloads an image, given an url
The image is saved in the download.jpg file
Args:
url: source from where download the image
"""

filename = "download.jpg"
response = requests.get(url)

file = open(filename, "wb")
file.write(response.content)
file.close()

img = Image.open(filename)
img = img.resize((400, 400), Image.ANTIALIAS)
img.save(filename)

return filename
@@ -0,0 +1,154 @@
#!/usr/bin/env python3
import os
from instascrape import *
from utils import display_num, convert_num, download_image
from tweet import twitter_post, twitter_post_image
from random import randint
from time import sleep

# Get Instagram cookies
instagram_sessionid = os.environ.get('INSTAGRAM_SESSION_ID')
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={instagram_sessionid};"}

module = "Instagram"

def instagram_data(group):
"""Runs all the Instagram related tasks
It scrapes data from Instagram for the whole group and the single artists
Args:
group: dictionary with the data of the group to scrape
Returns:
the same group dictionary with updated data
"""

print("[{}] Starting tasks...".format(module))
group, ig_profile = instagram_profile(group)
wait_random()
group = instagram_last_post(group, ig_profile)

for artist in group["members"]:
wait_random()
artist, ig_profile = instagram_profile(artist)
wait_random()
artist = instagram_last_post(artist, ig_profile)

print()
return group

def instagram_last_post(artist, profile):
"""Gets the last post of a profile
It tweets if there is a new post: if the timestamp of the latest stored post does not match with the latest fetched posts timestamp
Args:
- profile: a Profile instance, already scraped
- artist: a dictionary with all the details of the artist
Returns:
an dictionary containing all the updated data of the artist
"""

print("[{}] ({}) Fetching new posts".format(module, artist["instagram"]["url"][26:-1]))

recents = profile.get_recent_posts()

for recent in recents:
recent.scrape(headers=headers)
# If the last post timestamp is greater (post is newest) or the saved post does not exist
if "last_post" not in artist["instagram"] or recent.timestamp > artist["instagram"]["last_post"]["timestamp"]:
url = "https://www.instagram.com/p/" + recent.shortcode
if recent.is_video:
content_type = "video"
filename = "temp.mp4"
else:
content_type = "photo"
filename = "temp.jpg"
recent.download(filename)
twitter_post_image(
"{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(recent.caption), recent.timestamp, url, artist["hashtags"]),
filename,
None
)
else:
break

last_post = {}
last_post["url"] = "https://www.instagram.com/p/" + recents[0].shortcode
last_post["caption"] = recents[0].caption
last_post["timestamp"] = recents[0].timestamp

artist["instagram"]["last_post"] = last_post

return artist

def instagram_profile(artist):
"""Gets the details of an artist on Instagram
It tweets if the artist reaches a new followers goal
Args:
artist: a dictionary with all the details of the artist
Returns:
- an dictionary containing all the updated data of the artist
- a Profile instance
"""

print("[{}] ({}) Fetching profile details".format(module, artist["instagram"]["url"][26:-1]))

profile = Profile(artist["instagram"]["url"])
profile = profile.scrape(headers=headers, inplace=False)
artist["instagram"]["posts"] = profile.posts
# Update profile pic
artist["instagram"]["image"] = profile.profile_pic_url_hd

# Add followers if never happened before
if "followers" not in artist["instagram"]:
artist["instagram"]["followers"] = profile.followers

# Update followers only if there is an increase (fixes https://github.com/marco97pa/Blackpink-Data/issues/11)
if profile.followers > artist["instagram"]["followers"]:
print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], profile.followers))
if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", profile.followers):
twitter_post_image(
"{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(profile.followers), artist["hashtags"]),
download_image(artist["instagram"]["image"]),
display_num(profile.followers, short=True),
text_size=50
)
artist["instagram"]["followers"] = profile.followers

return artist, profile

def clean_caption(caption):
"""Removes unnecessary parts of an Instagram post caption
It removes all the hashtags and converts tags in plain text (@marco97pa --> marco97pa)
Args:
caption: a text
Returns:
the same caption without hashtags and tags
"""

clean = ""

words = caption.split()
for word in words:
if word[0] != "#":
if word[0] == "@":
clean += word[1:] + " "
else:
clean += word + " "

return clean[:90]

def wait_random():
sleeptime = randint(5,50)
print("Sleeping for {} sec".format(sleeptime))
sleep(sleeptime)
@@ -8,10 +8,10 @@ pyyaml
# Needed to post or retrieve data from Twitter
tweepy
#
# insta-scrape
# https://pypi.org/project/insta-scrape/
# instagrapi
# https://adw0rd.github.io/instagrapi/
# Needed to retrieve data from Instagram
insta-scrape
instagrapi
#
# python-youtube
# https://pypi.org/project/python-youtube/
@@ -88,3 +88,19 @@ def download_image(url):
file.close()

return filename

def download(url, filename):
"""Downloads a file, given an url and filename
Args:
url: source from where download the image
filename: name of the file to save
"""

response = requests.get(url)

file = open(filename, "wb")
file.write(response.content)
file.close()

return filename

0 comments on commit 7f006a1

Please sign in to comment.