How to organize data in a json file created through webscraping
I'm trying to get article titles from yahoo news and organize it in a json file. When I dump the data to a json file it appears confusing to read. How would I go about organizing the data, either after the dump or from the beginning?
This for a web scraping project where I have to get top news articles and their bodies and export them to a json file which can then be sent to someone else's program. For now, I'm just working on getting the titles from the yahoo finance homepage.
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
This is what ends up being written on the json file (at the time of writing this post):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Whou2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
I would like to code to show each article title on a separate line and remove the random ''s that appear in the middle.
python json beautifulsoup repl.it
add a comment |
I'm trying to get article titles from yahoo news and organize it in a json file. When I dump the data to a json file it appears confusing to read. How would I go about organizing the data, either after the dump or from the beginning?
This for a web scraping project where I have to get top news articles and their bodies and export them to a json file which can then be sent to someone else's program. For now, I'm just working on getting the titles from the yahoo finance homepage.
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
This is what ends up being written on the json file (at the time of writing this post):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Whou2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
I would like to code to show each article title on a separate line and remove the random ''s that appear in the middle.
python json beautifulsoup repl.it
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
tryjson.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05
add a comment |
I'm trying to get article titles from yahoo news and organize it in a json file. When I dump the data to a json file it appears confusing to read. How would I go about organizing the data, either after the dump or from the beginning?
This for a web scraping project where I have to get top news articles and their bodies and export them to a json file which can then be sent to someone else's program. For now, I'm just working on getting the titles from the yahoo finance homepage.
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
This is what ends up being written on the json file (at the time of writing this post):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Whou2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
I would like to code to show each article title on a separate line and remove the random ''s that appear in the middle.
python json beautifulsoup repl.it
I'm trying to get article titles from yahoo news and organize it in a json file. When I dump the data to a json file it appears confusing to read. How would I go about organizing the data, either after the dump or from the beginning?
This for a web scraping project where I have to get top news articles and their bodies and export them to a json file which can then be sent to someone else's program. For now, I'm just working on getting the titles from the yahoo finance homepage.
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm
Mt(0.8em)--sm", id="15")
#Organizing data for export
data = {'title1': title[0].get_text(),
'title2': title[1].get_text(),
'title3': title[2].get_text(),
'title4': title[3].get_text(),
'title5': title[4].get_text()}
#Exporting the data to results.json
with open("results.json", "w") as write_file:
json.dump(str(data), write_file)
This is what ends up being written on the json file (at the time of writing this post):
"{'title1': 'These US taxpayers face higher payments thanks to new law',
'title2': 'These 12 Stocks Are the Best Values in 2019, According to Pros
Whou2019ve Outsmarted the Market', '\ntitle3': 'The Best Move You Can
Make With Your Investments in 2019, According to 5 Market Professionals',
'title4': 'The auto industry said goodbye to a lot of cars in 2018',
'title5': '7 Stock Picks From Top-Rated Wall Street Analysts'}"
I would like to code to show each article title on a separate line and remove the random ''s that appear in the middle.
python json beautifulsoup repl.it
python json beautifulsoup repl.it
asked Jan 1 at 18:01
GanlasGanlas
133
133
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
tryjson.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05
add a comment |
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
tryjson.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
try
json.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
try
json.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05
add a comment |
2 Answers
2
active
oldest
votes
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got 'n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.
add a comment |
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first $10,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": "$70 Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997705%2fhow-to-organize-data-in-a-json-file-created-through-webscraping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got 'n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.
add a comment |
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got 'n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.
add a comment |
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got 'n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.
I have run your code but I didn't get any result like that you got. You have defined 'title3' which is a constant, but you got 'n' which I didn't get actually in my case. By the way, you were getting /'s because you didn't encoded it correctly like 'utf8' and ascii ensure set to false. I would suggest two change like - 'lxml' parser not 'html.parser' and this code snippet:
with open("results.json", "w",encoding='utf8') as write_file:
json.dump(str(data), write_file ,ensure_ascii=False)
this totally worked for me /'s exclusion and ascii issues solved as well.
answered Jan 1 at 18:56


Mobasshir BhuiyanMobasshir Bhuiyan
338
338
add a comment |
add a comment |
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first $10,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": "$70 Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
add a comment |
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first $10,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": "$70 Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
add a comment |
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first $10,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": "$70 Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
import requests
import json
from bs4 import BeautifulSoup
#Getting webpage
page = requests.get("https://finance.yahoo.com/")
soup = BeautifulSoup(page.content, 'html.parser') #creating instance of class to parse the page
#Getting article title
title = soup.find_all(class_="Mb(5px)")
desc = soup.find_all(class_="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)-- sm1024 LineClamp(3,57px) LineClamp(3,51px)--sm1024 M(0)")
#Getting article bodies
page2 = requests.get("https://finance.yahoo.com/news/warren-buffett-suggests-read-19th-204800450.html")
soup2 = BeautifulSoup(page2.content, 'html.parser')
body = soup.find_all(class_="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm", id="15")
title=[x.get_text().strip() for x in title]
limit=len(title) #change this to 5 if you need only the first 5
data={"title"+str(i+1):title[i] for i in range(0,limit)}
with open("results.json", "w",encoding='utf-8') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,indent=4))
results.json:
{
"title1": "These 12 Stocks Are the Best Values in 2019, According to Pros Who’ve Outsmarted the Market",
"title2": "These US taxpayers face higher payments thanks to new law",
"title3": "The Best Move You Can Make With Your Investments in 2019, According to 5 Market Professionals",
"title4": "Cramer Remix: Here's where your first $10,000 should be i...",
"title5": "The auto industry said goodbye to a lot of cars in 2018",
"title6": "Ocado Pips Adyen to Take Crown of 2018's Best European Stock",
"title7": "7 Stock Picks From Top-Rated Wall Street Analysts",
"title8": "Buy IBM Stock as It Begins 2019 as the Cheapest Dow Component",
"title9": "$70 Oil Could Be Right Around The Corner",
"title10": "What Is the Highest Credit Score and How Do You Get It?",
"title11": "Silver Price Forecast – Silver markets stall on New Year’s Eve",
"title12": "This Chart Says the S&P 500 Could Rebound in 2019",
"title13": "Should You Buy Some Berkshire Hathaway Stock?",
"title14": "How Much Does a Financial Advisor Cost?",
"title15": "Here Are the World's Biggest Billionaire Winners and Losers of 2018",
"title16": "Tax tips: What you need to know before you file your taxes in 2019",
"title17": "Kevin O’Leary: Make This Your Top New Year’s Resolution",
"title18": "Dakota Access pipeline developer slow to replace some trees",
"title19": "Einhorn's Greenlight Extends Decline to 34% in Worst Year",
"title20": "4 companies to watch in 2019",
"title21": "What Is My Debt-to-Income Ratio?",
"title22": "US recession unlikely, market volatility to continue in 2019, El-Erian says",
"title23": "Fidelity: Ignore stock market turbulence and stick to long-term goals",
"title24": "Tax season: How you can come out a winner",
"title25": "IBD 50 Growth Stocks To Watch"
}
edited Jan 1 at 19:11
answered Jan 1 at 19:05


Bitto BennichanBitto Bennichan
3,4161225
3,4161225
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997705%2fhow-to-organize-data-in-a-json-file-created-through-webscraping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
JSON is relatively human-readable, but isn't a 'pretty' output format. If you want pretty output, then you need to read in the file and parse it for output, though as you say this is for import to another program, I'm not sure why you're worried about this?
– match
Jan 1 at 18:11
try
json.dump(data, write_file, indent=4)
– t.m.adam
Jan 1 at 18:40
@match i mainly wanted to remove the unnecessry ''s to make it easier for the next group to analyze
– Ganlas
Jan 2 at 1:05