Skip to content
Snippets Groups Projects
Commit 11f68cd5 authored by Silas Dohm's avatar Silas Dohm
Browse files

readme update

parent a1df8bed
No related branches found
No related tags found
No related merge requests found
1. Datensatz heruntergeladen und entpackt.
2. wc -l yelp_academic_dataset_review.json (wie viele Zeilen hat der Datensatz) -> 8021122
3. Datensatz geteilt mit "split -l 350000 yelp_academic_dataset_review.json" -> 23 Dateien je ~ 250mb
4. Datenstruktur ermittelt ->
{"review_id":"qCMDfOjWdoyNE-oU3h9DKg","user_id":"JtmLdyw4k1xV78jgjhKM_w","business_id":"vdR_vmmgfI56bwSZYYHtXg","stars":5.0,"useful":0,"funny":0,"cool":0,
"text":"Tasty burgers and sandwiches..Staff were super friendly including the owner..I was just visiting the area and found this place yelp and thought would give
this place a try..highly recommended!","date":"2019-08-01 06:51:51"}$-
5. Testweise w2v -> Datei mit nur "review text" als Zeilen
6. grep -o -P '(?<=text":").*(?=",)' yelp_academic_dataset_review.json > review_text_only.txt
7. covert to lowercase :
sed -e 's/\(.*\)/\L\1/' regextest.txt > output.txt
8. remove special characters
remove newline character:
sed -i 's/\\n/ /g' small-samlple.txt
remove everything not a word or space:
# Hausarbeit für das Modul „Maschinelles Lernen“
Dokumentation: https://gitlab.cvh-server.de/w2v/w2vp/-/tree/master/Dokumentation/w2v.pdf
Slidecast: https://hs-bochum.sciebo.de/s/xOkCb5a8A65jify
Modelle: https://gitlab.cvh-server.de/w2v/w2vp/-/tree/master/models
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment