-
Datensatz heruntergeladen und entpackt.
-
wc -l yelp_academic_dataset_review.json (wie viele Zeilen hat der Datensatz) -> 8021122
-
Datensatz geteilt mit "split -l 350000 yelp_academic_dataset_review.json" -> 23 Dateien je ~ 250mb
-
Datenstruktur ermittelt -> {"review_id":"qCMDfOjWdoyNE-oU3h9DKg","user_id":"JtmLdyw4k1xV78jgjhKM_w","business_id":"vdR_vmmgfI56bwSZYYHtXg","stars":5.0,"useful":0,"funny":0,"cool":0, "text":"Tasty burgers and sandwiches..Staff were super friendly including the owner..I was just visiting the area and found this place yelp and thought would give this place a try..highly recommended!","date":"2019-08-01 06:51:51"}$-
-
Testweise w2v -> Datei mit nur "review text" als Zeilen
-
grep -o -P '(?<=text":").*(?=",)' yelp_academic_dataset_review.json > review_text_only.txt
-
covert to lowercase : sed -e 's/(.*)/\L\1/' regextest.txt > output.txt
-
remove special characters remove newline character: sed -i 's/\n/ /g' small-samlple.txt remove everything not a word or space:
Select Git revision
w2vp
-
-
- Open in your IDE
- Download source code
Name | Last commit | Last update |
---|---|---|
Dokumentation | ||
data_analysis | ||
models | ||
notes | ||
python | ||
.gitignore | ||
README.md |