I'm trying to extract data from a .txt file and while my regex did work for the most part, it fails when it comes across single quotes within the text I'm trying to extract.
{'pro_id':'1692423', 'pro_model':'SKUF42051', 'pro_category':'accessories', 'pro_name':'Gants tactiques Escalade en plein air Gants antidérapants résistants à l'usure Formation Gants de moto d'équitation', 'pro_current_price':'27.99', 'pro_raw_price':'27.99', 'pro_discount':'36', 'pro_likes_count':'11'}
This is what my text in the .txt file looks like.
I'm looping through and creating dicts from them. I do that by extracting the content from within the single quotes and appending the "key" and "value" pairs to a dict.
I've first extracted the content from within the curly brackets, then split that at ", " to get the "items" in a list, after which I looped through the list and used the regex in the command key, value = re.findall(r"\'([^']+)\'", element)
to extract the "key" and "value".
I'm a regex as well as a programming novice, so I could use some expert help.
I did ask ChatGPT for a regex '([^']+(?:\\'[^']+)*?)':'([^']+(?:\\'[^']+)*?)'
but that fails too.
I want to get a list that holds ['pro_name', 'Gants tactiques Escalade en plein air Gants antidérapants résistants à l'usure Formation Gants de moto d'équitation']
from re.findall
but instead I get
['Gants tactiques Escalade en plein air Gants antidérapants résistants à l', 'équitation']
.
Your string is malformed. Strings containing literal single quotes should be enclosed in double quotes, else it can't be parsed correctly.
It is extremely difficult to use regex to sort this out, and also by using a for
loop.
But I have discovered a way, I have found simple patterns. Since all strings are enclosed in single quotes, and the key value pairs are separated by commas followed by a space, and the keys are separated from values by single colons, it is easy to identify key value pairs by first split the string by "', '"
, then split each substring by "':'"
.
You can then convert it to dict
, with cleanup if necessary.
Example:
import re
text = "{'pro_id':'1692423', 'pro_model':'SKUF42051', 'pro_category':'accessories', 'pro_name':'Gants tactiques Escalade en plein air Gants antidérapants résistants à l'usure Formation Gants de moto d'équitation', 'pro_current_price':'27.99', 'pro_raw_price':'27.99', 'pro_discount':'36', 'pro_likes_count':'11'}"
arr = [i.split("':'") for i in text.split("', '")]
def clean(s):
return re.sub("^[{']+|[}']+$", '', s)
{clean(a): clean(b) for a, b in arr}
The result is:
{'pro_id': '1692423',
'pro_model': 'SKUF42051',
'pro_category': 'accessories',
'pro_name': "Gants tactiques Escalade en plein air Gants antidérapants résistants à l'usure Formation Gants de moto d'équitation",
'pro_current_price': '27.99',
'pro_raw_price': '27.99',
'pro_discount': '36',
'pro_likes_count': '11'}
Wrap it in a function:
def dictify(text):
arr = [i.split("':'") for i in text.split("', '")]
return {clean(a): clean(b) for a, b in arr}
I assume you have many more strings like the above in your text file, since I don't know the exact format, I can only demonstrate how to convert the file to a list
of dict
s as if it is newline separated.
with open('/path/to/file', 'r') as f:
text = f.read()
[dictify(row) for row in text.split('\n')]
You need to change the file path placeholder to the actual path. The above won't work if your file isn't newline separated.
And my method won't work if your string deviates from the format, for example if there are spaces after the key-value delimiting colons, or there aren't spaces after the commas that separate key-value pairs.
If that is the case I cannot help you, you need to figure out a different method, but my example does work on the example you have given.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments